AI GenerallyMonday, 04 May 2026 · 3 min read

Peer-Reviewed Science Study: OpenAI's o1 Outdiagnosed ER Physicians in 67% of Triage Cases

A peer-reviewed study in Science found OpenAI's o1 model correctly diagnosed 67% of emergency-room triage cases versus 55% and 50% for two attending physicians — but researchers say prospective trials are essential before any clinical deployment.

Medical doctor reviewing patient records in an emergency department setting — ↳ Placeholder (picsum)

A peer-reviewed study published in Science has found that OpenAI's o1 model correctly diagnosed emergency-room patients in 67% of triage cases tested against unedited electronic health records — a figure that surpassed both attending physicians in the same trial, who reached accurate diagnoses in 55% and 50% of cases respectively.

The research, led by teams at Harvard Medical School and Beth Israel Deaconess Medical Center with Stanford collaborators, represents one of the most methodologically rigorous clinical assessments of a large language model in a real diagnostic setting to date. It is already generating significant debate about what it means for medicine, liability, and the future of clinical AI.

How the Study Was Designed

The researchers presented 76 real emergency-room cases — drawn directly from Beth Israel's electronic health record system — to both OpenAI's o1 model and two internal medicine attending physicians. Crucially, the case data was presented in its raw, uncleaned form: exactly as it appeared in the EHR, without preprocessing or curation. Two independent physicians then assessed each set of diagnoses blind, without knowing whether a given assessment had come from the AI or a human.

The o1 model performed at or above the physician baseline at every stage of the diagnostic process — initial triage, first physician contact, and admission decisions — with its strongest advantage showing at initial triage, where information is most limited. The model proved particularly effective on rare and complex cases where physicians tend to anchor early on common diagnoses.

Arjun Manrai, a senior co-author at Harvard's Blavatnik Institute, noted that the field has effectively run out of multiple-choice benchmarks to test these systems on: "We used to evaluate models with multiple-choice tests; now they are consistently scoring close to 100%, and we're already at the ceiling." The triage study was an attempt to move evaluation into messier clinical territory.

What the Researchers Are — and Are Not — Claiming

The authors are careful to draw a line between their findings and a recommendation for deployment. Adam Rodman, a senior author and Beth Israel physician, called explicitly for controlled clinical trials before any real-world use. The study captures diagnostic accuracy in a structured retrospective review; it does not capture the full clinical encounter, which includes physical examination, patient affect, the emotional weight of delivering a serious diagnosis, and the legal accountability that attaches to a licensed physician.

Rodman also acknowledged a finding that points toward how AI might be integrated rather than replace: a separate December 2025 study found that 67% of physicians changed their treatment recommendations after seeing AI-generated assessments. That suggests an augmentation model — where AI flags cases for review or offers a differential diagnosis for physician consideration — may be more immediately viable than autonomous AI practice.

Emergency physician Kristen Panthagani offered one of the sharpest methodological critiques: the study compared o1 against internal medicine attendings rather than emergency medicine specialists. ER clinicians, she argued, are primarily trained to rule out immediately life-threatening conditions — a different task from generating the exact correct diagnosis — which may have depressed the physician scores relative to what a specialist comparison group would have shown.

The Accountability Gap

Beyond the diagnostic numbers, the study surfaces a problem the healthcare system has not yet solved. "There's no formal framework right now for accountability" around AI diagnoses, Rodman told reporters. If an AI-assisted diagnosis leads to patient harm, the question of who bears responsibility — the physician who accepted the AI's suggestion, the hospital that deployed the tool, or the model developer — remains legally unresolved in most jurisdictions.

Thomas Buckley of Harvard Medical School noted that the o1 model achieved "nearly optimal diagnosis" on cases drawn from a benchmark that has been used in clinical education since 1959, which gives a sense of the depth of the reasoning the model is now capable of producing. But as Peter Brodeur, a clinical fellow involved in the research, observed, diagnostic reasoning is likely an easier task than management reasoning — deciding what to do once a diagnosis is established — and the study did not evaluate the latter.

The Harvard findings land in a healthcare AI landscape that is moving fast regardless of academic caution. AI diagnostic tools are already being piloted in radiology, pathology, and dermatology departments across the United States and Europe. The Science publication is likely to accelerate both deployment discussions and regulatory attention, particularly in the EU where the AI Act classifies medical AI systems as high-risk.

#healthcare#ai-diagnosis#openai#o1#clinical-ai#harvard#science

Peer-Reviewed Science Study: OpenAI's o1 Outdiagnosed ER Physicians in 67% of Triage Cases

How the Study Was Designed

What the Researchers Are — and Are Not — Claiming

The Accountability Gap

Sources

More from AI Generally