There are more AI health tools than ever—but how well do they work?


Singhal, the OpenAI health lead, notes that the company’s current GPT-5 series of models, which had not yet been released when the original HealthBench study was conducted, do a much better job of soliciting additional information than their predecessors. However, OpenAI has reported that GPT-5.4, the current flagship, is actually worse at seeking context than GPT-5.2, an earlier version.

Ideally, Bean says, health chatbots would be subjected to controlled tests with human users, as they were in his study, before being released to the public. That might be a heavy lift, particularly given how fast the AI world moves and how long human studies can take. Bean’s own study used GPT-4o, which came out almost a year ago and is now outdated. 

Earlier this month, Google released a study that meets Bean’s standards. In the study, patients discussed medical concerns with the company’s Articulate Medical Intelligence Explorer (AMIE), a medical LLM chatbot that is not yet available to the public, before meeting with a human physician. Overall, AMIE’s diagnoses were just as accurate as physicians’, and none of the conversations raised major safety concerns for researchers. 

Despite the encouraging results, Google isn’t planning to release AMIE anytime soon. “While the research has advanced, there are significant limitations that must be addressed before real-world translation of systems for diagnosis and treatment, including further research into equity, fairness, and safety testing,” wrote Alan Karthikesalingam, a research scientist at Google DeepMind, in an email. Google did recently reveal that Health100, a health platform it is building in partnership with CVS, will include an AI assistant powered by its flagship Gemini models, though that tool will presumably not be intended for diagnosis or treatment.

Rodman, who led the AMIE study with Karthikesalingam, doesn’t think such extensive, multiyear studies are necessarily the right approach for chatbots like ChatGPT Health and Copilot Health. “There’s lots of reasons that the clinical trial paradigm doesn’t always work in generative AI,” he says. “And that’s where this benchmarking conversation comes in. Are there benchmarks [from] a trusted third party that we can agree are meaningful, that the labs can hold themselves to?”

They key there is “third party.” No matter how extensively companies evaluate their own products, it’s tough to trust their conclusions completely. Not only does a third-party evaluation bring impartiality, but if there are many third parties involved, it also helps protect against blind spots.

OpenAI’s Singhal says he’s strongly in favor of external evaluation. “We try our best to support the community,” he says. “Part of why we put out HealthBench was actually to give the community and other model developers an example of what a very good evaluation looks like.” 

Given how expensive it is to produce a high-quality evaluation, he says, he’s skeptical that any individual academic laboratory would be able to produce what he calls “the one evaluation to rule them all.” But he does speak highly of efforts that academic groups have made to bring preexisting and novel evaluations together into comprehensive evaluations suites—such as Stanford’s MedHELM framework, which tests models on a wide variety of medical tasks. Currently, OpenAI’s GPT-5 holds the highest MedHELM score.



Source link

  • Related Posts

    Phone Carriers Offer eSIM Plans in the US for World Cup Travelers From Abroad

    With the FIFA World Cup 2026 tournament coming to the US in June, host cities are expecting an influx of international travelers. Mobile carriers are offering ways to help them…

    ‘Solve all diseases,’ you say?

    Let’s unpack what Demis Hassabis said at the end of yesterday’s Google I/O keynote. This is Optimizer, a weekly newsletter sent from Verge senior reviewer Victoria Song that dissects and…

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    You Missed

    Wednesday Afternoon Links

    High Arctic Overseas Announces engagement as Authorised Distributor for Atlas Copco Power Technique products in Papua New Guinea

    Incoming Ofcom chair vows to take on ‘tech bros’ | Ofcom

    Incoming Ofcom chair vows to take on ‘tech bros’ | Ofcom

    Phone Carriers Offer eSIM Plans in the US for World Cup Travelers From Abroad

    Phone Carriers Offer eSIM Plans in the US for World Cup Travelers From Abroad

    Why Southwest Airlines’ Boeing 737s Make A First Class Cabin So Difficult To Build

    Why Southwest Airlines’ Boeing 737s Make A First Class Cabin So Difficult To Build

    Zuckerberg warns ‘success isn’t a given’ amid 10% layoffs at Meta

    Zuckerberg warns ‘success isn’t a given’ amid 10% layoffs at Meta