Emergency rooms gave OpenAI’s o1 a much harder diagnostic test than a medical exam question.
In a blinded study using patient records from actual hospital visits, the AI system diagnosed emergency-room cases more accurately than two attending physicians. The findings, published in Science by researchers at Harvard Medical School and Beth Israel Deaconess Medical Center, suggest medical AI testing is starting to move closer to the messy conditions of real care.
For hospitals, researchers, and regulators, the study raises a bigger question: how much should diagnostic accuracy count when safety, testing decisions, and clinician judgment still shape what happens to a patient?
The result researchers didn’t expect
Adam Rodman, one of the study’s senior authors and a Harvard Medical School assistant professor of medicine at Beth Israel Deaconess, went into the emergency-room experiment with low expectations.
“I thought it was going to be a fun experiment but that it wouldn’t work that well. That was not at all what happened,” Rodman said.
The ER experiment included 76 cases from Beth Israel Deaconess Medical Center and compared OpenAI’s o1 with GPT-4o and a physician baseline. The biggest gap appeared at the earliest point in care. At initial triage, when patient information was most limited, o1 identified the exact or a very close diagnosis in 67.1% of cases. The physician scores were 55.3% and 50%.
The system also stayed ahead later in the ER process, including after the physician encounter and at admission to the medical floor or ICU.
Real records, real uncertainty
The ER portion of the study left polished medical case studies behind and tested the model closer to the way diagnosis unfolds in a hospital.
“To better understand real-world performance, we needed to test performance early in the patient course, when clinical data is sparse,” said co-first author Thomas Buckley.
The emergency department cases were drawn directly from electronic health records, and Harvard said the team did not preprocess the data before feeding it into the model. The AI had to work through incomplete, unstructured patient information at the same early moments when clinicians are still trying to figure out what is happening.
Not a green light for AI doctors
The researchers are not arguing that AI should diagnose patients on its own. The Science paper says the findings make the case for prospective clinical trials, along with more study of how clinicians and AI systems should work together in real care settings.
Diagnostic accuracy is only one piece of emergency care. Doctors still have to weigh risks, order tests, decide whether a patient should be admitted, and act on details that may never appear cleanly in a record.
“A model might get the top diagnosis right but also suggest unnecessary testing that could expose a patient to harm,” said Peter Brodeur, co-first author of the study and a Harvard Medical School clinical fellow in medicine at Beth Israel Deaconess. “Humans should be the ultimate baseline when it comes to evaluating performance and safety.”
The next test will be whether systems like o1 can improve care when doctors are using them, not just whether they can beat doctors on a diagnostic scorecard.
Researchers found that AI could detect many pre-diagnosis pancreatic cancer cases well before doctors could.


