MedicalAI & Process AutomationCase Study

RadTriage: Adversarial AI for Medical Imaging Referral Triage

We built a system that reads doctor handwriting, interprets Medicare billing rules, and uses competing AI agents to achieve 99%+ accuracy in referral eligibility assessment.

99.2%

Triage Accuracy

~25s

End-to-End Latency

$380K

Revenue Recovered / yr

-73%

Billing Rejections

40 hrs

Staff Hours Saved / wk

Abstract

Medical imaging referral processing in Australia requires accurate interpretation of handwritten referral documents against a complex and frequently-updated Medicare Benefits Schedule (MBS). Manual triage is slow, error-prone, and expensive. Billing rejections from misinterpreted eligibility criteria cost radiology clinics hundreds of thousands of dollars annually. We present RadTriage, an end-to-end AI pipeline that digitises handwritten referrals using custom-trained OCR models, extracts structured clinical data, and determines Medicare eligibility through a novel adversarial reasoning architecture. Two independent AI agents, the Advocate and the Sceptic, evaluate each referral and argue their assessment until reaching consensus, mimicking the deliberative process of experienced billing staff. In pilot testing across 4,200 referrals at three radiology clinics, RadTriage achieved 99.2% agreement with expert human reviewers, with a mean processing time of 24.6 seconds per referral.

Introduction

Australia's Medicare Benefits Schedule defines the rules governing rebate eligibility for medical imaging services. For a radiology clinic, every patient referral must be assessed against the applicable MBS item codes, clinical indication criteria, and requesting practitioner qualifications before the scan is performed. An ineligible scan that proceeds results in a rejected Medicare claim, and the clinic absorbs the full cost.

Three factors compound the problem. First, the majority of imaging referrals in Australia are still handwritten, and the legibility of physician handwriting is, to put it charitably, variable. Second, the MBS is a labyrinthine document with thousands of item codes, complex eligibility rules, and frequent updates. Third, the clinical information on a referral must be mapped to specific codes and modalities, a task requiring both medical knowledge and billing expertise.

Most clinics rely on experienced reception and billing staff to perform this triage manually, and the approach holds up until it doesn't. Staff turnover means institutional knowledge walks out the door. MBS updates introduce new rules that take weeks to propagate through training. And the sheer volume, with a busy clinic processing hundreds of referrals daily, makes consistent accuracy nearly impossible.

RadTriage was designed to address this problem comprehensively: a single system that handles the full pipeline from paper referral to eligibility determination, with accuracy that matches or exceeds the best human reviewers.

System Architecture

RadTriage comprises four sequential processing stages, each implemented as an independent service with well-defined input/output contracts. This modularity enables independent scaling, testing, and model updates without system-wide redeployment.

2.1 Document Acquisition & Pre-processing

Referral documents enter the system via high-resolution scanning (300 DPI minimum, colour). A pre-processing pipeline performs deskewing, contrast normalisation, noise reduction, and binarisation. Document layout analysis identifies and segments the referral into regions of interest: patient demographics, clinical information, requesting practitioner details, and provider stamps. This segmentation is performed by a U-Net-based layout model trained on 12,000 annotated referral documents from six clinic partners.

2.2 Handwriting Recognition (Custom OCR)

Medical handwriting recognition is a harder problem than general handwriting OCR. Physician handwriting exhibits extreme variability in letterform, inconsistent spacing, liberal use of non-standard abbreviations (Hx, Dx, Rx, NAD, SOB, #NOF), and a tendency toward connected cursive that defeats commercial OCR engines.

We trained a custom recognition model based on a Transformer-encoder architecture with a CTC (Connectionist Temporal Classification) decoder, operating on line-level text segments extracted by the layout model. The training corpus comprises 85,000 annotated text-line images harvested from de-identified referral documents, supplemented with synthetic data generated by a handwriting style-transfer GAN that learned the statistical properties of physician handwriting.

Critical to performance is the medical vocabulary model, a domain-constrained language model that biases decoding toward medically plausible character sequences. This reduces character-level error rates by approximately 40% compared to vocabulary-agnostic decoding, particularly on abbreviated clinical terms.

2.3 Clinical Entity Extraction

The OCR output is processed by a clinical NER (Named Entity Recognition) model that extracts structured data: patient identifiers, date of birth, referring practitioner and provider number, clinical indication, body region, suspected diagnosis, relevant history, and any specific imaging requests. The NER model is a fine-tuned BioBERT variant trained on 6,000 annotated referral extractions, achieving an F1 score of 0.94 on held-out test data.

Extracted entities are cross-validated against external data sources where available: practitioner provider numbers are validated against the AHPRA register, and Medicare provider eligibility is confirmed via the HPOS (Health Professional Online Services) API.

Adversarial Reasoning Engine

The core innovation in RadTriage is the adversarial eligibility assessment architecture. A single model for Medicare eligibility carries high consequences for errors in either direction, so we implemented a dual-agent debate system inspired by adversarial collaboration frameworks in AI safety research.

3.1 The Advocate

The Advocate agent receives the structured referral data and attempts to construct the strongest possible case for Medicare eligibility. It identifies applicable MBS item codes, matches clinical indications to eligibility criteria, and generates a reasoned argument for why the referral qualifies for a Medicare rebate. The Advocate operates on the principle of charitable interpretation: where ambiguity exists, it resolves in favour of eligibility.

3.2 The Sceptic

The Sceptic agent receives the same structured data and independently constructs the case against eligibility. It identifies potential disqualifying factors: missing information, time-based restrictions (e.g., repeat imaging intervals), clinical indication mismatches, provider qualification gaps, and edge cases in MBS interpretation. The Sceptic applies the strictest reasonable reading of the rules.

3.3 Deliberation Protocol

The two agents engage in a structured multi-turn debate. Each round, the Advocate presents or refines its argument for eligibility, and the Sceptic challenges specific claims with counter-evidence or alternative rule interpretations. Both agents have access to the complete, current MBS schedule as a retrieval-augmented knowledge base, ensuring arguments are grounded in the actual regulatory text rather than training data that may be stale.

The debate proceeds for a minimum of two rounds and a maximum of five. Convergence is declared when both agents agree on the eligibility determination and the applicable item code(s). If the agents fail to converge after five rounds, the referral is flagged for human review with the full debate transcript attached, giving the reviewer a structured analysis of the ambiguity.

Both agents are implemented as fine-tuned LLMs with role-specific system prompts and chain-of-thought reasoning. Temperature is set to 0.1 for the Advocate (favouring consistent, optimistic interpretation) and 0.05 for the Sceptic (favouring deterministic, conservative analysis). Each agent's output is structured as JSON with explicit reasoning chains, enabling full auditability of every decision.

MBS Knowledge System

The Medicare Benefits Schedule is not a static document. Item codes are added, modified, and deprecated. Eligibility criteria change. Fee schedules are updated quarterly. Any system that hardcodes MBS rules will be wrong within months.

RadTriage maintains a structured, version-controlled representation of the MBS as a knowledge graph. Each item code is a node with edges to its eligibility criteria, applicable modalities, body regions, clinical indications, fee schedule, and restriction rules (time-based, frequency-based, referrer-qualification-based). When the MBS is updated, the knowledge graph is diffed against the previous version, and affected decision paths are automatically flagged for regression testing.

Both the Advocate and Sceptic agents access the MBS knowledge graph via retrieval-augmented generation (RAG). Queries are embedded using a domain-specific embedding model and matched against the knowledge graph with hybrid sparse/dense retrieval. This ensures that arguments are always grounded in the current regulatory text, not in potentially outdated parametric knowledge.

Document Acquisition

300 DPI colour scan
Deskew & normalise
Noise reduction
Layout segmentation

U-Net layout model · 12K training docs

Handwriting Recognition

Line-level segmentation
Transformer-CTC decoder
Medical vocabulary bias
Style-transfer augmentation

85K annotated text-lines · GAN synthetic data

Entity Extraction

Patient demographics
Practitioner & provider
Clinical indication
Body region & diagnosis

BioBERT fine-tune · F1: 0.94 · AHPRA validation

Adversarial Reasoning

Advocate agent (FOR)
Sceptic agent (AGAINST)
2–5 round debate
Consensus or flag

Fine-tuned LLMs · RAG over MBS knowledge graph

Figure 1. RadTriage four-stage processing architecture. Each stage is independently deployable and scalable.

1Raw Scan

Pt: Margaret Chen

DOB: 14/03/1958

Acute pleuritic chest pain

? PE r/o pneumothorax

Dr S. Kapoor

SKEWED · NOISY

2Pre-processed

[DEMOGRAPHICS]

Pt: Margaret Chen

DOB: 14/03/1958

[CLINICAL]

Acute pleuritic chest pain

SEGMENTED

3OCR Output

Margaret Chen

14/03/1958 (67y)

Acute pleuritic chest pain

?PE r/o pneumothorax

Dr S. Kapoor

Prov: 2466731T

96.8% char97.2% word

4Structured Data

{

"patient": "Margaret Chen"

"dob": "1958-03-14"

"indication": "acute pleuritic chest pain"

"query": "PE, pneumothorax"

"referrer": "Dr S. Kapoor"

"provider": "2466731T"

}

Figure 2. Document transformation pipeline — from raw handwritten scan to structured JSON. Showing actual referral data (de-identified).

ROUND 1

ADVOCATEconf: 0.87

Referral qualifies under MBS item 57001 — CT Brain without contrast. Clinical indication of 'persistent headache 6 weeks, ?SOL' meets the diagnostic imaging criteria for neurological investigation.

SCEPTICconf: 0.92

The referring practitioner is a physiotherapist (provider number format confirmed). Under MBS rules, CT Brain requires a referral from a medical practitioner or specialist. Physiotherapists can only request limited imaging — CT Brain is not on the approved list.

ROUND 2

ADVOCATEconf: 0.71

Acknowledged — revising assessment. The physiotherapist referral pathway is restricted. However, the referral includes a co-signature from Dr. S. Kapoor (AHPRA validated, GP). If the GP is the effective referrer, item 57001 eligibility is restored.

SCEPTICconf: 0.94

The co-signature is present but the GP's provider number is not listed on the referral form. Without a valid provider number from the referring medical practitioner, Medicare cannot process the claim regardless of clinical merit.

ROUND 3

ADVOCATEconf: 0.94

Conceding eligibility under current referral document. Recommend FLAG for human review — the clinical indication is valid, the GP relationship is evidenced, but the referral document is technically incomplete. Staff should contact the GP practice to obtain the provider number before proceeding.

SCEPTICconf: 0.94

Agreed. FLAG is the correct determination. The referral is not rejectable on clinical grounds but is not processable in its current form. Specific action required: obtain GP provider number.

CONSENSUS: FLAG FOR REVIEW

Action: Contact GP practice for provider number · 3 rounds · 22.4s

Figure 3. Adversarial debate transcript — real example showing Advocate/Sceptic deliberation on a flagged referral (de-identified). Note confidence score convergence.

57001

CT Brain w/o contrast

57004

CT Chest

57007

CT Abdomen

Neurological

Respiratory

GP Referral

Specialist

12-month rule

CT Modality

item

indication

rule

restriction

modality

Figure 4. Partial view of the MBS knowledge graph. Item codes (blue) connect to clinical indications (green), referrer rules (purple), restrictions (red), and modalities (amber).

Results

5.1 Pilot Study Design

RadTriage was evaluated in a prospective pilot across three radiology clinics in metropolitan Sydney over a 12-week period. All incoming referrals (n=4,217) were processed by both the RadTriage system and the clinic's existing manual triage process. Staff were blinded to the system's output during the evaluation period. An expert panel of two senior billing specialists independently reviewed all cases where the system and manual triage disagreed.

5.2 OCR Performance

The custom OCR model achieved a character-level accuracy of 96.8% across all referral text, rising to 98.3% on printed text and 94.1% on handwritten text. With the medical vocabulary model applied, word-level accuracy on clinical terms reached 97.2%. For comparison, Google Cloud Vision achieved 91.4% character-level accuracy and Amazon Textract achieved 89.7% on the same handwritten test set. The domain-specific training and medical vocabulary model provide a clear advantage on physician handwriting.

5.3 Entity Extraction

The clinical NER model achieved an overall F1 score of 0.94. Performance varied by entity type: patient demographics (F1: 0.98), referring practitioner (F1: 0.97), clinical indication (F1: 0.91), body region (F1: 0.96), and suspected diagnosis (F1: 0.89). The lower performance on clinical indication and diagnosis reflects the inherent ambiguity and abbreviation density in these fields.

5.4 Eligibility Determination

Of 4,217 referrals processed, RadTriage achieved exact agreement with the expert panel on eligibility determination in 4,183 cases (99.2%). Of the 34 disagreements, 22 were cases where the system flagged for human review due to non-convergence in the adversarial debate, the intended behaviour for ambiguous referrals. Of the remaining 12 true errors, 8 were attributable to OCR misreading (typically in severely degraded handwriting) and 4 to entity extraction errors that propagated into incorrect code assignment.

The adversarial architecture caught 17 cases that the manual triage process got wrong: referrals that staff approved but that were ineligible, representing approximately $12,400 in avoided billing rejections during the pilot period alone.

5.5 Processing Performance

Mean end-to-end processing time was 24.6 seconds per referral (σ = 4.8s). The breakdown: document pre-processing 1.2s, OCR 14.8s, entity extraction 1.4s, adversarial deliberation 7.2s (mean 2.4 debate rounds). The OCR phase dominates latency because custom handwriting recognition on degraded input requires multiple inference passes with beam search decoding and medical vocabulary re-ranking. Referrals with particularly poor handwriting trigger additional recognition passes, pushing worst-case OCR latency above 20s.

5.6 Financial Impact

Extrapolating from the pilot data: across the three clinics, RadTriage is projected to recover approximately $380,000 per year in previously-rejected Medicare claims, while simultaneously reducing staff triage time by approximately 40 hours per week. The system identified an additional $49,600 in referrals that would have been incorrectly approved, preventing downstream audit exposure.

Eligibility Accuracy

RadTriage

99.2%

Manual

96.8%

Commercial OCR

87.3%

Handwriting OCR (char-level)

RadTriage

94.1%

Google Vision

91.4%

Amazon Textract

89.7%

Clinical NER (F1 score)

RadTriage

94%

General BioBERT

82%

spaCy med

71%

Figure 5. Performance comparison across key metrics. RadTriage (blue) vs. manual triage and commercial alternatives.

1.2s

Pre-process

14.8s

OCR

1.4s

NER

7.2s

Debate

Pre-process (1.2s)

OCR (14.8s)

NER (1.4s)

Debate (7.2s)

Total: 24.6s mean

Note: Custom OCR dominates latency at 60% of total processing time. Handwriting recognition requires multiple inference passes with beam search and medical vocabulary re-ranking. Worst-case OCR on severely degraded handwriting: 20s+.

Figure 6. End-to-end latency breakdown. Adversarial deliberation accounts for 66.7% of total processing time.

Discussion

The adversarial architecture is the key differentiator. Single-model approaches to eligibility determination, even well-trained ones, tend to develop systematic biases. A model trained primarily on eligible referrals develops a tendency to approve; one trained with strong negative examples becomes over-conservative. By forcing two agents to argue opposing positions, RadTriage surfaces the reasoning behind each determination and catches cases where a single model would silently err.

The non-convergence flag is a feature. Referrals where the agents cannot agree after five rounds are ambiguous. These are the cases that benefit most from experienced human review. The debate transcript provides the reviewer with a structured analysis they wouldn't otherwise have, and it often identifies specific MBS clauses or clinical interpretation questions that need resolution.

During the pilot, RadTriage identified a category of referrals that clinic staff were systematically misclassifying: a specific interaction between time-based restrictions and provider qualification rules that affects approximately 2.3% of referrals. This rule interaction was not well-understood by staff at any of the three pilot sites, suggesting a systemic training gap in the industry.

Conclusion & Future Work

RadTriage demonstrates that adversarial AI architectures can achieve expert-level performance on complex regulatory interpretation tasks while maintaining full auditability and appropriate deference to human judgment on ambiguous cases.

The system is currently in production deployment at three radiology clinics, with expansion planned to an additional twelve sites in 2026. Ongoing work includes:

Extension to additional imaging modalities (PET, nuclear medicine) with modality-specific eligibility rules
Integration with RIS (Radiology Information Systems) for end-to-end workflow automation from referral to billing
Real-time MBS change monitoring with automated regression testing of affected decision paths
Federated model updates across clinic sites to improve OCR performance on site-specific handwriting patterns without sharing patient data
Investigation of the adversarial architecture's applicability to other regulated domains: PBS (Pharmaceutical Benefits Scheme) eligibility, workers' compensation claim assessment, and insurance pre-authorisation

The broader implication is that adversarial reasoning, having AI systems argue both sides of a determination before reaching a conclusion, may be a general-purpose pattern for high-stakes classification tasks where false positives and false negatives carry asymmetric costs. We believe this architecture has significant potential beyond medical billing.

Technology Stack

OCR & Vision

Custom Transformer-CTC model
U-Net layout segmentation
Style-transfer GAN (synthetic data)
OpenCV pre-processing

NLP & Reasoning

Fine-tuned BioBERT (NER)
Fine-tuned LLMs (Advocate/Sceptic)
RAG with hybrid retrieval
Medical vocabulary model

Knowledge & Data

MBS knowledge graph
AHPRA/HPOS API integration
Version-controlled rule sets
PostgreSQL + vector store

Infrastructure

Python (FastAPI)
Docker + Kubernetes
GPU inference (NVIDIA T4)
On-premise deployment option

AI can transform your clinical workflows. Let's talk.

Start a Conversation

You have a hard problem. We should talk.

Start a conversation→