The Age of Diagnosis-by-AI Is Coming, And Physicians Are Not Ready
Features And What Still Needs to Be Fixed. Leave Your Comment!
When the Best We Have Isn’t Good Enough
A new diagnostic benchmark posted to arXiv in June 2025 should unsettle every health policy official, medical educator, and clinical leader who still believes that AI belongs somewhere in the distant future of medicine.
In a controlled simulation of real diagnostic reasoning—drawn from the New England Journal of Medicine’s clinicopathological case series—experienced physicians were asked to diagnose 56 difficult but solvable cases using only sequential questioning and test requests. They were not allowed to read the full case up front, use internet searches, or confer with colleagues. They had to think through each case step by step.
Their average diagnostic accuracy was 19.9%.
Let that sink in.
19.9%
The best physician across all cases? 41%.
Most didn’t come close. The average test cost was nearly $3,000.
This isn’t just a statistical blip. It’s a systemic failure of unaided diagnostic reasoning under constraint.
What AI Did Differently—and Better
The same study introduced MAI-DxO—an AI system that simulates a virtual panel of doctors. Each “agent” plays a specific role: one tracks hypotheses, one selects tests, one flags cognitive bias, another manages cost, and a fifth ensures internal consistency.
In short: it operationalizes the reasoning disciplines we demand from physicians but seldom enforce.
When run atop OpenAI’s “o3” model:
MAI-DxO achieved 81.9% accuracy, more than 4× that of human physicians.
It cut average diagnostic costs by nearly $3,100 compared to the same model without orchestration.
A budget-constrained variant scored 79.9% accuracy at $2,396—outperforming both doctors and baseline models on both axes.
An ensemble configuration reached 85.5% accuracy, at lower cost than the unstructured version.
These weren’t cherry-picked toy problems. Many were NEJM CPC cases published after the model’s training cutoff.
Figure from the preprint.
Why the Gap Exists
Physicians operate under constraints—cognitive, temporal, emotional, systemic. They anchor prematurely, overlook rare diagnoses, and order expensive tests that fail to discriminate. MAI-DxO doesn’t get tired, doesn’t panic, and doesn’t skip steps. It questions its own top hypotheses through role-play and structured dissent.
The system doesn’t understand like a human. It performs like one who never drifts off course.
That’s not magic. It’s architecture.
What’s Blocking Deployment?
Despite the performance, systems like MAI-DxO cannot yet be deployed at scale—not because they’re unproven, but because no regulatory framework exists to evaluate them.
In the U.S.:
The FDA’s 2025 draft guidance only applies to locked algorithms. MAI-DxO is neither locked nor singular—it orchestrates multiple agents in real time.
Medicare/Medicaid reimburse AI diagnostics like diabetic retinopathy detection, but not structured reasoning agents.
There is no legal pathway for dynamic, ensemble-based clinical reasoning systems to obtain clearance.
In the EU:
The AI Act (2024/1689) defines diagnostic AI as “high-risk,” requiring conformity assessments—but offers no provisions for reasoning orchestration or agent role distribution.
Globally:
Liability is unclear. If MAI-DxO is right and ignored, or wrong and followed, who’s accountable?
Auditability is minimal. We still lack tools to trace and explain the steps of complex AI diagnostic chains.
No guidance exists on ensemble voting, synthetic data injection, or continuous learning agents.
This Isn’t About Replacing Doctors, But If They Resist…
It’s about ending the illusion that cognitive heroism is enough.
When physicians guess at a 20% success rate—on gold-standard cases—and AI gets four out of five right, we can no longer pretend the status quo is acceptable. Especially when AI can do it with lower cost and greater consistency.
No, MAI-DxO hasn’t been validated in everyday primary care. No, it doesn’t understand patient emotion, legal nuance, or comorbid messiness. But it’s already proven superior in one crucial domain: diagnostic precision under resource constraint.
If we refuse to use it—not because it failed but because we failed to regulate it—we may be committing a form of institutional malpractice.
Limitations Worth Embracing
SDBench is a simulation. There were no healthy patients. No vague symptoms. No social variables. No scheduling chaos. But the study does not claim clinical deployment readiness. It shows, decisively, that structured AI is already better than humans at one of the hardest tasks in medicine.
Shouldn’t that trigger urgency—not dismissal?
A Turning Point We Might Just Waste
We can keep telling ourselves that diagnostic AI needs more time. But it doesn’t need time—it needs oversight, regulation, audit standards, and clinical validation at scale.
And we need to stop lying to ourselves about human performance in complex medicine. It’s not superhuman. It’s not sacred. It’s brittle.
The age of diagnosis-by-AI isn’t coming.
It’s already here.
What’s missing isn’t technology.
It’s a clear path toward regulation and accountability.





Please, remember that AI is not some ultra-smart alien sent down to the Earth AD 2023 to solve all our problems and let us be lazy, do nothing and thrive.
AI = software. There is nothing more to it. This is a combination of dumb harvesting engines, dumb collating engines and dumb regurgitating engines. Their production appears to be smarter than you-the-user only because of the algorithms (= programming routines) which follow Lego-like matching prescriptions.
The performance of AI machines will depend on the quality of the harvested sources. Take that infamous Alzheimer research story. A number of fake studies with fake images and fake conclusions was the driving force behind the approaches to Alzheimer treatment for ages. Let your AI harvest all of them, and conclude that because it is a sequence of 10 or 20 studies from the same author, all peer-reviewed and published in renowned journals, they must be the legitimate source of true knowledge.
This is exactly what medical schools and doctors do. They harvest abstracts, if they read anything at all, and without double-checking the published story, they accept the conclusions as the true science.
Considering the advance of medical knowledge, there should be a special board set up to analyse and permanently erase all defective peer-reviewed published studies - every single month. As 90% of the papers published in journals are false, redundant or completely useless (according to the medical research itself - Ioannides), the board will have a lot of work until the end of the universe. In other words, there will never be good, reliable source material for AI to harvest.
So, yes, we need a clear path toward regulation and accountability. Regulation = ban all fake and unnecessary research. (Including creating false teams of authors all glued to one big name that paves the way to being published.) Accountability = remove all offenders from the profession, permanently, no appeal possible. (The latter has to be done algorithmically, no human intervention possible.)
If these measures are applied to both the past publications and the current and future work, maybe we will have a great database of reliable publications for AI to harvest and use in something like… 300 years?
Until that time… if my chances for a true diagnosis from a professionally trained overpriced doctor are at 1/5, as stated in this article… I’ll go back to my grandma who has always been 100% infallible with her herbs.
And we have to make sure it can recognize and diagnose the full scope of vaccine injury.