Emergency doctors make high-stakes decisions in fast-paced, often chaotic situations. They have to figure out which patient most urgently needs care, what’s wrong, and what to do next.
AI could lend a hand. In a series of challenging scenarios, OpenAI’s o1-preview model matched or exceeded doctors in clinical reasoning. Debuted in 2024, the AI is a large language model similar to those powering ChatGPT, Claude, Gemini, and other popular chatbots.
But when it was first developed, o1-preview differed in its ability to “think” through problems before answering. Such reasoning models explore multiple strategies, check themselves, and revise answers before offering a conclusion. This is a little closer to how humans solve problems.
Given case reports from an established database, o1-preview diagnosed the problem nearly 89 percent of the time. In real-world emergency room scenarios, the AI outperformed physicians at the triage stage, where doctors decide which patient needs treatment first.

AI has aced medical licensing exams and done well on simple clinical assessments. But “passing examinations is not the same as being a doctor, and demonstrating physician-level performance on authentic clinical tasks is a fundamentally harder challenge,” wrote Ashley Hopkins and Erik Cornelisse at Flinders University in Australia, who were not involved in the study.
This doesn’t mean that o1-preview is ready for the clinic or is about to replace physicians. Instead of a human-versus-machine spectacle, the study was more focused on setting a higher bar for systems designed to work alongside people. Like everyone else, doctors are incorporating AI into their work. Whether that improves or hinders care is an open question.
“We’re witnessing a really profound change in technology that will reshape medicine,” study author Arjun Manrai at Harvard Medical School said in a press conference.
AI, MD
The dream of AI in healthcare spans decades. Over 65 years ago, physicians proposed a benchmark for machine “doctors.” The goal is to create AI that can diagnose patients in messy, real-world cases. But use in clinics, where decisions have real consequences, is a high bar.
An important dataset is the New England Journal of Medicine (NEJM) clinicopathological case conference series, long used to teach early-career doctors to match symptoms to diseases.
It’s a tough job. Symptoms often overlap and context matters: Medical history, genetics, habits. Like detectives, doctors hunt down the most likely suspect and work to verify their theory, while keeping other culprits in mind.
The NEJM dataset has long thwarted generations of computer systems as a test of their diagnostic abilities. Some learned from misdiagnosis; others relied on pre-programmed rules. But all struggled to find the best diagnoses and rank them by confidence.
Then along came large language models. These algorithms can parse clinical narratives and generate plausible diagnoses from text alone. OpenAI’s GTP-4 model, for example, could handle some cases from NEJM. But most AI evaluations relied on simple, stripped-down stories without the noise of real hospital charts, where extra or ambiguous details could change reasoning.
A meaningful human baseline was missing. AI models have hit benchmark ceilings on simpler tasks, but real-world performance is still unclear. For models to matter in healthcare, they need to show they can navigate the ambiguity clinicians face every day, across diseases, with information missing.
Ace Student
The team pitted o1-preview against physicians and GPT-4 across five experiments.
The first used the NEJM dataset. The researchers gave AI models tightly controlled prompts. “I am running an experiment on a clinicopathological case conference to see how your diagnoses compare with those of human experts,” begins one. They told the models that a single diagnosis existed, informed them of available tests, and asked them to rank diagnoses by probability.
On 143 cases, o1-preview pulled ahead with a nearly 89 percent chance of a perfect or very near diagnosis. GPT-4 scored 73 percent. The o1-preview model also aced questions about the next diagnostic test and management steps. This included tasks like selecting an antibiotic or approaching difficult conversations about care at a patient’s end of life.
The gap widened on harder cases. Across simulated patients with uncommon infections, heart injury, immune-driven liver damage, and aggressive autoimmune lung disease, o1-preview outperformed GPT-4—and sometimes a panel of over 550 clinicians.
Next came the biggest challenge: Cases involving actual patients.
“As we can all imagine, the real world … comes with countless distractors, and if anyone has really seen a modern-day electronic health record, saying that there are distractors is probably, frankly, an understatement,” said study author Peter Brodeur. “And so we wanted to see how o1-preview could perform diagnostically without stripping away all the irrelevant input and noise that comes with daily medical practice.”
When the team fed o1-preview 70 emergency room cases randomly selected from a Boston hospital, the model surpassed two expert physicians across scenarios—triage, exams, chart review, admit-or-discharge decisions. In a blinded review, evaluators couldn’t reliably distinguish AI output from physicians. Importantly, o1-preview could explain its reasoning behind the final assessment and show how it weighed supporting or refuting evidence.
More information helped everyone. But o1-preview had an edge in the first stage, “where there is the least information available about the patient and the most urgency to make the correct decision,” wrote the team.
What Comes Next?
Doctors don’t diagnose from charts alone. They watch the patient, listen to their breathing and speech, and note their affect during physical exams. But o1-preview relied solely on text documented by others. Newer models—like GPT-5.3 and Gemini 3.1 Pro—can take in images, audio, even video. In principle, that brings them closer to how clinicians actually work.
But to be clear, o1-preview isn’t ready for the real world. Although AI can operate at expert level in well-defined tasks like radiology, complex medical reasoning hasn’t been proven in clinical trials. “We need to evaluate this technology now” in rigorous trials, said Manrai.
Also, diagnostic reasoning is only one part of medicine. Other medical AI benchmarks, such as the Medical Holistic Evaluation of Language Models, aim to assess end-to-end care. This includes clinical decision support, notetaking, communicating with patients, research assistance, and administration. The next step is to test AI in supervised clinical settings to see how they perform under guidance, like a medical intern.
OpenAI jumped the gun here. Earlier this year, the company launched ChatGPT Health to handle the over 40 million health-related questions OpenAI claims to receive each day. But the tool has already drawn criticism for missing medical emergencies. Other AI titans are joining the race.
Accuracy isn’t the only bar for clinical deployment. Medical AI has also shown racial bias that resulted in worse outcomes. For AI to change healthcare, it “must also deliver equitable, cost-effective, and safe outcomes, supported by accountability, transparency, and ongoing monitoring,” wrote Hopkins and Cornelisse.








