New AI Model Beats Doctors at Clinical Reasoning, Diagnosis

Most research testing the medical reasoning abilities of large language models (LLMs) has lacked physician baselines.
Across six experiments with human baselines, a sophisticated LLM matched or outperformed physicians and other artificial intelligence tools.
The authors emphasized that artificial intelligence will not replace doctors, but clinical trials testing these tools are needed.

A large language model (LLM) matched or exceeded hundreds of expert physicians in diagnostic and management reasoning tasks across six experiments, a new study showed.

The LLM’s advantage was most pronounced in early-stage emergency department triage, where it identified the exact or very close diagnosis in 67.1% of cases, compared to two physicians whose accuracy was 55.3% and 50%, reported Arjun K. Manrai, PhD, of the department of biomedical informatics at Harvard Medical School in Boston, and colleagues in Science.

For this experiment, researchers copy-and-pasted patients’ electronic health records with no data curation into the LLM to simulate real-world utility. But the LLM also performed well in assessments of differential diagnosis, diagnostic test selection, and other tasks.

“Overall, our findings show that LLMs now demonstrate substantial performance in differential diagnosis, diagnostic clinical reasoning, and management reasoning, and exceed both prior model generations and even human clinicians across multiple domains,” the authors wrote.

“This paper just kept surprising me,” Manrai told MedPage Today.

He said that he doesn’t think the study’s findings mean that artificial intelligence (AI) will replace doctors, but that “we’re witnessing a really profound change in technology that will reshape medicine, and that we need to evaluate this technology now in rigorously conducted prospective clinical trials.”

In an accompanying editorial, Ashley Hopkins, PhD, and Erik Cornelisse, both of Flinders University in Adelaide, Australia, agreed that AI use in clinical practice needs to be proven to benefit real-world applications in randomized trials.

“Although evaluation methods are progressing, the deployment of AI systems is outpacing them,” they noted. “Accuracy on a validated task does not guarantee that a deployed system will confine itself to that task.” For instance, patient-facing ChatGPT Health is promoted as being able to address more than 40 million health-related questions each day, but it is not designed for clinical triage. However, it will still do triage tasks without pushback.

While this study shows that AI can perform some diagnostic tasks as well or better than physicians, AI in healthcare is expected to be collaborative “with clinicians providing oversight, contextual judgment, and accountability” rather than being replaced by AI, they wrote.

Study Details

Manrai noted it’s a decades-old question of whether it’s possible for computers to think or reason like physicians, but in recent years, AI has become dramatically more sophisticated. Much of the existing literature on the diagnostic and management capabilities of LLMs has lacked human physician baselines and have focused on narrow diagnostic tasks and curated clinical vignettes.

To address this gap, the team systematically evaluated the medical reasoning abilities of the preview version of OpenAI o1, a reasoning model LLM, against physician baselines from previous studies.

Researchers first evaluated o1-preview using 143 clinicopathological conferences published in the New England Journal of Medicine (NEJM) with two physicians evaluating the quality of its differential diagnosis. The LLM included the correct diagnosis in its differential in 78.3% of cases and its first suggested diagnosis was correct in 52% of cases. When expanded to also include potentially helpful or very close diagnoses, o1-preview was accurate in 97.9% of cases.

For a subset of 136 cases, they tested o1-preview’s ability to select the next diagnostic test. In 87.5% of cases the chosen testing plan was deemed correct, in 11% it would have been helpful, and in 1.5% of cases it chose an unhelpful test.

The team also assessed o1-preview with 20 clinical reasoning cases from the NEJM Healer curriculum that were previously evaluated with GPT-4 in a prior study. In 78 of 80 cases, o1-preview achieved a perfect Revised-IDEA score — a validated 10-point scale for evaluating four core domains of documenting clinical reasoning. This was significantly better than GPT-4 (47/80), attending physicians (28/80), and resident physicians (16/72) (P<0.0001 for all).

The next test of o1-preview were five clinical vignettes from a previous study, which were presented and followed by a series of questions on the next steps in management. The median score for o1-preview was 89% per case, which was again better than GPT-4 (median 42%), physicians with access to GPT-4 (median 41%), and physicians with conventional resources (median 34%).

OpenAI o1-preview was also tested using six clinical vignettes from a previous study that compared GPT-4 to generalist physicians. The LLM’s median score per case was 97% — once again higher than GPT-4, physicians with access to GPT-4, and physicians with conventional resources.

Lastly, they tested o1-preview’s diagnostic probabilistic reasoning using five cases on primary care topics. The LLM modestly outperformed GPT-4 overall and clinicians had substantially wider variability in estimates than either AI.

The authors noted that this research only reflects the preview version of OpenAI o1 and that newer models already exist; more research is still needed comparing these newer models. Also, their study focused on emergency medicine and internal medicine, so the findings are not applicable to all specialties. Plus, the experiments were purely text-based with no auditory or visual information that clinicians would normally use to inform their response.

Source link : https://www.medpagetoday.com/practicemanagement/informationtechnology/121049

Author :

Publish date : 2026-04-30 21:00:00

Copyright for syndicated content belongs to the linked Source.

New AI Model Beats Doctors at Clinical Reasoning, Diagnosis

‘We felt we had to miscarry again to get the help we needed’

Defibrillation Testing Not Needed After S-ICD?

Related Posts

New CDC Messaging May Be Eroding Trust in Vaccines, Survey Finds

Why Can’t Europe Kick Its Antibiotic Habit?

Physician Groups Want Better Enforcement of No Surprises Act

Dr. Mike’s Guinness World Record; Aluminum Deodorant PSA; RN’s Surprise Afib

Defibrillation Testing Not Needed After S-ICD?

‘We felt we had to miscarry again to get the help we needed’

New CDC Messaging May Be Eroding Trust in Vaccines, Survey Finds

Why Can’t Europe Kick Its Antibiotic Habit?

Physician Groups Want Better Enforcement of No Surprises Act

Dr. Mike’s Guinness World Record; Aluminum Deodorant PSA; RN’s Surprise Afib

Defibrillation Testing Not Needed After S-ICD?

New AI Model Beats Doctors at Clinical Reasoning, Diagnosis

‘We felt we had to miscarry again to get the help we needed’

FDA Panel Gives Thumbs Down to Novel Strategy for Switching Breast Cancer Therapy

Categories

Archives