- Clinical notes generated by artificial intelligence (AI) scored lower on quality than those prepared by humans in this cross-sectional study of five simulated primary care cases.
- The largest difference in quality occurred in a case in which a clinician and patient discussed back pain amid background noise.
- Though AI scribes have “emerged as a promising approach to alleviating documentation burden,” the researchers said that “concerns remain about accuracy, completeness, and style.”
Clinical notes generated by artificial intelligence (AI) scored lower on quality than those prepared by humans, a cross-sectional evaluation of five simulated primary care cases showed.
Notes were rated on 10 attributes (such as being accurate, thorough, organized, and comprehensive) using the modified Physician Documentation Quality Instrument (PDQI-9). The tool measures each attribute on a 5-point scale, with a maximum score of 50.
The largest difference in quality between AI and human notes occurred when a clinician and patient discussed back pain amid substantial background noise, with the human notes scoring 43.8, while AI scored just 20.3 (P≤0.001), Ashok Reddy, MD, MSc, of the Veterans Affairs Puget Sound Health Care System in Seattle, and colleagues reported in Annals of Internal Medicine and at the American College of Physicians meeting in San Francisco.
In two other cases, humans performed significantly better than AI. One was a case of chest pain where both the patient and clinician were masked, where human notes scored 42.2 compared with 34.8 for AI (P≤0.05). The other was a nurse care manager encounter with a patient who had heart failure (38.4 vs 32.8, P≤0.05).
Humans scored better on two other cases, but the differences weren’t significant:
- Pharmacy encounter (patient with accented voice): 39.6 vs 36.3
- New patient (patient and clinician with accented voices): 41.8 vs 38.7
The study — which included 11 different AI scribe tools — comes as roughly two-thirds of physicians have already been using some form of AI, primarily for documentation and messaging, according to recent surveying.
Indeed, ambient AI scribes, which transcribe and summarize clinical visits using large language models, have “emerged as a promising approach to alleviating documentation burden,” Reddy and colleagues noted. However, “concerns remain about accuracy, completeness, and style.”
Findings of the current study “may generate some pause in the growing acceptance of AI scribes,” wrote Aaron Tierney, PhD, and Kristine Lee, MD, both of the Permanente Medical Group in Oakland, California, in an accompanying editorial. “Note quality will need to be balanced against the positive effect AI scribes seem to have on clinician burnout and documentation burden.”
Though simulated assessments in the current study are “thought provoking,” Tierney and Lee added, “future assessments of AI scribes must focus on real-world performance rather than using standardized cases, clearly define the relative importance of the domains of quality, and include patient perspectives.”
Additional findings by Reddy and colleagues were that, across all 10 attributes measured, AI notes scored lower than human notes, with the largest differences in being thorough (–1.2, P≤0.001), organized (–1.1, P≤0.001), and useful (–1.0, P≤0.01). Smaller differences were seen in being free from bias (–0.7, P≤0.05) and free from hallucination (–0.9, P≤0.01).
“Although deficits were modest in absolute terms, the pattern was consistent across domains, suggesting that observed quality gaps were broad rather than isolated,” Reddy and colleagues noted. “These findings highlight that although ambient AI scribes can generate complete notes, the overall quality remains broadly below that of human-authored documentation.”
For their study, the researchers used five audio-recorded clinical scenarios. Eleven AI vendors generated notes from these scenarios, and blinded clinical raters benchmarked the quality of these notes against those prepared by humans (3 clinicians per case.)
In addition to the use of simulated cases, limitations included that the research team was not able to interpret results for specific AI vendors, and that just a few standardized primary care cases were examined, “which may not capture the full range of clinical complexity,” Reddy and colleagues noted. Furthermore, human-generated notes were not prepared under real-world constraints.
“Although ambient AI scribes hold promise for reducing clinician burden, rigorous and ongoing evaluation of their quality is essential to ensure that these tools enhance rather than compromise the quality of clinical care,” they concluded.
Source link : https://www.medpagetoday.com/practicemanagement/practicemanagement/120829
Author :
Publish date : 2026-04-17 16:00:00
Copyright for syndicated content belongs to the linked Source.










