By Dr. David PW Rastall.
Every year, millions of patients suffer from misdiagnosis, making it one of the most persistent and underappreciated risks in modern healthcare. Despite decades of research, misdiagnosis remains a major cause of preventable harm, leading to delayed treatment, unnecessary procedures, and even death.1 Unlike errors such as incorrect medication dosing or failure to monitor sepsis, many misdiagnoses go undetected for years.2–5 Without immediate consequences, errors can slip through the cracks, escaping routine quality checks and physician oversight.
Enter Artificial Intelligence (AI)—a tool uniquely suited to tackling this complex challenge. AI can rapidly process vast amounts of patient data, identify subtle patterns that elude human clinicians, and learn from past diagnostic mistakes. However, while AI presents enormous opportunities, it also faces serious obstacles. The greatest challenge is that AI is built on very large datasets, and very large datasets containing correct diagnoses do not exist. Most general-purpose AI are therefore trained on the electronic health record (EHR), where misdiagnoses are rarely labeled.6 Without access to the ground truth of which diagnoses were correct and which were errors, AI is likely to simply learn to repeat existing mistakes and reinforce biases instead of improving care delivery.
In this article, we explore the promise and pitfalls of AI in diagnostic medicine, focusing on the need for high-quality training data and the hidden dangers of undetected AI errors.
AI's Unique Advantages in Tackling Misdiagnosis
AI is particularly well-suited to reducing diagnostic errors for several key reasons:
- Knowledge Pool – Medicine is growing incredibly complex and no one doctor can know everything. In fact, it was complex enough in the 1800’s to spark the invention of medical specialization, with the idea that each specialist could excel in their area of expertise.7 This model leaves some patients with unclear symptoms or rare symptoms falling in the gap between specialties. AI is capable of knowing all symptoms and diseases, which alone may assist many patients in their medical journey to diagnosis.
- Inhuman Pattern Recognition – Certain patterns stand out to humans due to our experiences and the way our brains are wired. However, for those same reasons, certain patterns are hidden from us. AI is able to look from a unique, alternative perspective to see patterns that the human brain has evolved to hide from us. Therefore, it pairs well with a human physician, with each covering each other's weaknesses and enhancing strengths.
- Consistent, Unbiased Analysis – While doctors experience fatigue, cognitive biases, and variability in training, AI can provide standardized diagnostic support across different cases and patient populations. If trained correctly, AI could safeguard against all manner of cognitive biases and create medical fairness.
- Fill Gaps in the Diagnostic Journey – There are many areas of the diagnostic journey that are currently unmanned by physicians, which AI can greatly improve. Some examples include: outpatient triage, care transitions from hospital to home, care transitions between outpatient specialists, sorting complex cases to the correct specialist–
- Augmenting Physician Decision-Making – AI can enhance doctors' ability to diagnose complex cases by AI can serving as a second opinion to highlight potential errors and summarize complex histories into the important information for medical decision making.
However, despite these advantages, significant challenges remain, and the most difficult one stems from the fundamental nature of misdiagnosis itself.
The Missing Data Problem: AI's Biggest Obstacle
Unlike traditional diagnostic AI applications, such as detecting cancer in pathology slides or spotting pneumonia on chest X-rays, an AI system trained to reduce misdiagnosis faces a major hurdle: Misdiagnoses are not systematically recorded in the EHR. In most hospital databases, a patient's diagnosis is simply assumed to be correct unless explicitly revised later. This creates a "dataset ceiling effect", where AI trained on standard medical records will only be as accurate as the current healthcare system—and never better.
Why Can't AI Learn from Misdiagnosis the Way it Learns Other Tasks?
To train an AI model, researchers need large, labeled datasets or collections of patient cases where each diagnosis is confirmed to be correct or incorrect. In real-world medicine, this level of verification rarely happens. For instance, consider diagnosing stroke in the ED:
- If a stroke is correctly diagnosed, an MRI is obtained, the patient receives appropriate treatment, and the diagnosis is correctly recorded in the EHR.
- However, if a stroke is misdiagnosed as a benign condition (e.g., inner ear disease), the patient is frequently discharged without MRI. Unless they return to the same hospital and unless this correlation is noticed and commended on by a physician, their original misdiagnosis remains invisible in the EHR.
- An AI trained on this dataset could not tell the difference between a missed stroke and a correctly diagnosed inner ear disease.
As a result, AI systems trained on routine clinical data will learn to reproduce the same diagnostic errors as human doctors, but with even greater efficiency.
The Hidden Risks: When AI's Mistakes Go Unnoticed
Another major challenge is that AI-driven misdiagnoses could be harder to detect than human errors. Consider two real-world AI failures in medicine:
- IBM Watson for Oncology suggested unsafe cancer treatments, but oncologists quickly identified the errors and discontinued its use.8,9
- Epic’s Sepsis Prediction Model failed to detect many cases of sepsis. However, because sepsis leads to rapid in-patient deterioration, doctors quickly noticed and raised concerns.10
By contrast, certain diagnostic AI errors could go unnoticed for years. If an AI system mistakenly classifies a serious condition as benign and the patient is not seen in the hospital this error may go undetected. Even if the error resulted in death, there is no way to link this to the AI decision. These "silent failures" pose a significant risk, especially if AI is deployed without ongoing human oversight or without mechanisms to flag potential misdiagnoses for review.
How to Make AI Work for Diagnosis?
Given these challenges, what can be done to ensure AI fulfills its promise in diagnostic medicine?
- Create Specially Designed Training Datasets – Instead of relying on flawed EHR data, researchers must develop gold-standard datasets where diagnoses are systematically verified. This means conducting long-term follow-up studies and tracking patient outcomes across multiple healthcare systems.
- Integrate AI with Physician Expertise – AI should function as an assistant, not an authority. Instead of making definitive diagnoses, AI should highlight uncertainties and alternative possibilities, prompting physicians to reconsider potential errors.
- Monitor AI for Hidden Errors – Continuous post-deployment monitoring is essential. Just as we audit medical devices and pharmaceuticals, we need AI surveillance systems that flag patterns of misdiagnosis before they cause widespread harm.
- Require AI to Justify Its Decisions – "Black box" AI models that provide no explanation for their diagnoses are dangerous in medicine. AI should provide clear reasoning and supporting evidence for its conclusions, allowing physicians to evaluate and override incorrect recommendations.
- Ensure AI Optimization Prioritizes Patient Outcomes—Not Costs – Many existing healthcare AI models are optimized to reduce costs rather than improve diagnosis. AI developers must explicitly train systems to maximize diagnostic accuracy and patient outcomes rather than minimizing expenses.
Conclusion: AI's Promise and Peril in Medicine
AI has the potential to revolutionize diagnostic medicine, reducing misdiagnoses and saving countless lives. However, without carefully designed datasets, physician oversight, and transparency, AI could just as easily amplify existing errors and biases—all while making its mistakes harder to detect!
As AI continues to integrate into healthcare, we must ensure that it serves as a tool for diagnostic improvement and not just automation. The future of AI in medicine will depend not only on its technical sophistication but also on our commitment to rigorous evaluation, responsible oversight, and ethical implementation.
By addressing these challenges head-on, we can harness AI’s full potential to create a safer, smarter, and more reliable future for medical diagnosis.
About the author:
Dr. David PW Rastall is a neurologist and AI researcher working with Johns Hopkins Center for Diagnostic Excellence. His research focuses on improving diagnostic accuracy, particularly in the early diagnosis of complex diseases in outpatient settings. He utilizes patient-centered approaches to integrate AI into clinical workflows. His research examines AI alignment, deception detection, and the safe implementation of AI in healthcare. Dr. Rastall is also leading efforts to develop structured AI clinical validation methodologies and to establish rigorous safety protocols to ensure trustworthy and transparent AI integration in medicine such that medical AI, patients, and doctors form teams that thrive.
Cited Works
- Newman-Toker, D. E. et al. Burden of serious harms from diagnostic error in the USA. BMJ Qual. Saf. 33, 109–120 (2024).
- Faye, F. et al. Time to diagnosis and determinants of diagnostic delays of people living with a rare disease: results of a Rare Barometer retrospective patient survey. Eur. J. Hum. Genet. 32, 1116–1126 (2024).
- Faugno, E. et al. Experiences with diagnostic delay among underserved racial and ethnic patients: a systematic review of the qualitative literature. BMJ Qual. Saf. bmjqs-2024-017506 (2024) doi:10.1136/bmjqs-2024-017506.
- Al-Hashel, J. Y., Ahmed, S. F., Alroughani, R. & Goadsby, P. J. Migraine misdiagnosis as a sinusitis, a delay that can last for many years. J. Headache Pain 14, 97 (2013).
- Wróbel, M., Wielgoś, M. & Laudański, P. Diagnostic delay of endometriosis in adults and adolescence-current stage of knowledge. Adv. Med. Sci. 67, 148–153 (2022).
- Liberman, A. L. & Newman-Toker, D. E. Symptom-Disease Pair Analysis of Diagnostic Error (SPADE): a conceptual framework and methodological approach for unearthing misdiagnosis-related harms using big data. BMJ Qual. Saf. 27, 557–566 (2018).
- Weisz, G. The Emergence of Medical Specialization in the Nineteenth Century. Bull. Hist. Med. 77, 536–574 (2003).
- Swetlitz, C. R., Ike. IBM’s Watson supercomputer recommended ‘unsafe and incorrect’ cancer treatments, internal documents show. STAT https://www.statnews.com/2018/07/25/ibm-watson-recommended-unsafe-incorrect-treatments/ (2018).
- IBM Watson, heal thyself: How IBM overpromised and underdelivered on AI health care | IEEE Journals & Magazine | IEEE Xplore. https://ieeexplore.ieee.org/abstract/document/8678513.
- Habib, A. R., Lin, A. L. & Grant, R. W. The Epic Sepsis Model Falls Short-The Importance of External Validation. JAMA Intern. Med. 181, 1040–1041 (2021).
The opinions expressed here are those of the author and do not necessarily reflect those of The Johns Hopkins University.