Physicians Beat AI Scribes on Clinical Note Quality, Study Shows

Clinical notes generated by artificial intelligence (AI) scored lower on quality than those prepared by humans in this cross-sectional study of five simulated primary care cases.
The largest difference in quality occurred in a case in which a clinician and patient discussed back pain amid background noise.
Though AI scribes have “emerged as a promising approach to alleviating documentation burden,” the researchers said that “concerns remain about accuracy, completeness, and style.”

Clinical notes generated by artificial intelligence (AI) scored lower on quality than those prepared by humans, a cross-sectional evaluation of five simulated primary care cases showed.

Notes were rated on 10 attributes (such as being accurate, thorough, organized, and comprehensive) using the modified Physician Documentation Quality Instrument (PDQI-9). The tool measures each attribute on a 5-point scale, with a maximum score of 50.

The largest difference in quality between AI and human notes occurred when a clinician and patient discussed back pain amid substantial background noise, with the human notes scoring 43.8, while AI scored just 20.3 (P≤0.001), Ashok Reddy, MD, MSc, of the Veterans Affairs Puget Sound Health Care System in Seattle, and colleagues reported in Annals of Internal Medicine and at the American College of Physicians meeting in San Francisco.

In two other cases, humans performed significantly better than AI. One was a case of chest pain where both the patient and clinician were masked, where human notes scored 42.2 compared with 34.8 for AI (P≤0.05). The other was a nurse care manager encounter with a patient who had heart failure (38.4 vs 32.8, P≤0.05).

Humans scored better on two other cases, but the differences weren’t significant:

Pharmacy encounter (patient with accented voice): 39.6 vs 36.3
New patient (patient and clinician with accented voices): 41.8 vs 38.7

The study — which included 11 different AI scribe tools — comes as roughly two-thirds of physicians have already been using some form of AI, primarily for documentation and messaging, according to recent surveying.

Indeed, ambient AI scribes, which transcribe and summarize clinical visits using large language models, have “emerged as a promising approach to alleviating documentation burden,” Reddy and colleagues noted. However, “concerns remain about accuracy, completeness, and style.”

Findings of the current study “may generate some pause in the growing acceptance of AI scribes,” wrote Aaron Tierney, PhD, and Kristine Lee, MD, both of the Permanente Medical Group in Oakland, California, in an accompanying editorial. “Note quality will need to be balanced against the positive effect AI scribes seem to have on clinician burnout and documentation burden.”

Though simulated assessments in the current study are “thought provoking,” Tierney and Lee added, “future assessments of AI scribes must focus on real-world performance rather than using standardized cases, clearly define the relative importance of the domains of quality, and include patient perspectives.”

Additional findings by Reddy and colleagues were that, across all 10 attributes measured, AI notes scored lower than human notes, with the largest differences in being thorough (–1.2, P≤0.001), organized (–1.1, P≤0.001), and useful (–1.0, P≤0.01). Smaller differences were seen in being free from bias (–0.7, P≤0.05) and free from hallucination (–0.9, P≤0.01).

“Although deficits were modest in absolute terms, the pattern was consistent across domains, suggesting that observed quality gaps were broad rather than isolated,” Reddy and colleagues noted. “These findings highlight that although ambient AI scribes can generate complete notes, the overall quality remains broadly below that of human-authored documentation.”

For their study, the researchers used five audio-recorded clinical scenarios. Eleven AI vendors generated notes from these scenarios, and blinded clinical raters benchmarked the quality of these notes against those prepared by humans (3 clinicians per case.)

In addition to the use of simulated cases, limitations included that the research team was not able to interpret results for specific AI vendors, and that just a few standardized primary care cases were examined, “which may not capture the full range of clinical complexity,” Reddy and colleagues noted. Furthermore, human-generated notes were not prepared under real-world constraints.

“Although ambient AI scribes hold promise for reducing clinician burden, rigorous and ongoing evaluation of their quality is essential to ensure that these tools enhance rather than compromise the quality of clinical care,” they concluded.

Source link : https://www.medpagetoday.com/practicemanagement/practicemanagement/120829

Author :

Publish date : 2026-04-17 16:00:00

Copyright for syndicated content belongs to the linked Source.

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Physicians Beat AI Scribes on Clinical Note Quality, Study Shows

Can Children’s Allergic Rhinitis Be Prevented?

Radiologists: Can You Detect Deepfake X-rays?

Related Posts

Ebola risk raised to ‘very high’ in DR Congo

What I Learned When We Treated a Possible Ebola Patient

Chikungunya in Pregnancy Raises Infant Hospital Risk

Creeping Gradients After TAVR Linked to Poor Longer-Term Outcomes

EMA Backs Breast Cancer SERD After FDA No Vote

Prostate Cancer Drug Linked to Lower Risk of Cognitive Decline

Ebola risk raised to ‘very high’ in DR Congo

What I Learned When We Treated a Possible Ebola Patient

Chikungunya in Pregnancy Raises Infant Hospital Risk

Creeping Gradients After TAVR Linked to Poor Longer-Term Outcomes

EMA Backs Breast Cancer SERD After FDA No Vote

Prostate Cancer Drug Linked to Lower Risk of Cognitive Decline

UK scientists developing new Ebola vaccine that could be ready in months

FRAX Predicts Surgery Benefit in Hyperparathyroidism

Categories

Archives