AI Delivers More Accurate ER Diagnoses Than Doctors, Harvard Study Finds

Despite the promising experiment, the study does not advocate for AI being ready for real-world decision-making in emergency rooms.

MITSloan ME Editorial 24 minutes ago

Topics

Image Credit- Chetan Jha/ MIT Sloan Management Review Middle East

In high-stakes emergency rooms, where seconds shape outcomes, a new study suggests machines may be closing in on one of medicine’s most human strengths: diagnosis.

A study by Harvard University, conducted with Beth Israel Deaconess Medical Center and published in Science, finds that advanced AI models can outperform human physicians in diagnosing patients in emergency settings. The research shows that systems like OpenAI’s o1 delivered more accurate triage decisions than doctors in certain scenarios—raising a critical question: if AI can diagnose better under pressure, what does that mean for the future of clinical judgment?

Published in late April, the study measured how OpenAI models of o1 and 4o performed compared to human physicians. One experiment focused on 76 patients who came into the Beth Israel emergency room and had their diagnoses compared by two internal medicine attending physicians with those generated by OpenAI’s o1 and 4o models.

“At each diagnostic touchpoint, o1 either performed nominally better than or on par with the two attending physicians and 4o,” the study said, where the differences “were especially pronounced at the first diagnostic touchpoint (initial ER triage), where there is the least information available about the patient and the most urgency to make the correct decision.”

The diagnoses were further assessed by two other attending physicians, without revealing which were made by humans and which by AI.

Harvard Medical School, in a press release for the study, stated that they did not “pre-process the data at all.” The models were presented with equal information available in the electronic medical records at the time of each diagnosis.

With the same information provided, the o1 model managed to offer “the exact or very close diagnosis” in 67% of triage cases, compared to one physician who had the exact or close diagnosis 55% of the time, and to the other who hit the mark 50% of the time.

“We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines,” said Arjun Manrai, head of AI lab at Harvard Medical School and co-lead author of the study.

Despite the promising experiment, the study does not advocate for AI being ready for real-world decision-making in emergency rooms. Rather, the study noted an “urgent need for prospective trials to evaluate these technologies in real-world patient care settings.”

It further clarified that it only studies how models perform with text-based information and that “existing studies suggest that current foundation models are more limited in reasoning over nontext inputs.”

Kristen Panthagani, an emergency physician, posted about the study and called for a specialty check. “If we’re going to compare AI tools to physicians’ clinical ability, we should start by comparing them to physicians who actually practice that specialty,” Panthagani said. “I would not be surprised if an LLM could beat a dermatologist at a neurosurgery board exam, [but] that’s not a particularly helpful thing to know.”

“As an ER doctor seeing a patient for the first time, my primary goal is not to guess your ultimate diagnosis. My primary goal is to determine if you have a condition that could kill you,” she added.

So far, AI in healthcare has been focused on improving patient outcomes and operational efficiency through early disease diagnosis (radiology, oncology), accelerating drug discovery, and personalizing treatment plans. It also enables real-time patient monitoring, virtual health assistants, robotic surgeries, and administrative automation, such as the generation of clinical notes.

Topics

About the Author

Tags:

Topics

Share