Humanity’s Last Exam Stumps Top AI Models—and That’s a Good Thing


How do you translate a Roman inscription found on a tombstone? How many pairs of tendons are supported by one bone in hummingbirds? Here is a chemical reaction that requires three steps: What are they? Based on the latest research on Tiberian pronunciation, identify all syllables ending in a consonant sound from this Hebrew text.

These are just a few example questions from the latest attempt to measure the capability of large language models. These algorithms power ChatGPT and Gemini. They’re getting “smarter” in specific domains—math, biology, medicine, programming—and developing a sort of common sense.

Like the dreaded standardized tests we endured in school, researchers have long relied on benchmarks to track AI performance. But as cutting-edge algorithms now regularly score over 90 percent on such tests, older benchmarks are increasingly becoming obsolete.

An international team has now developed a kind of new SAT for language models. Dubbed Humanity’s Last Exam (HLE), the test has 2,500 challenging questions spanning math, the humanities, and the natural sciences. A human expert crafted and carefully vetted each question so the answers are non-ambiguous and can’t be easily found online.

Although the test captures some general reasoning in models, it measures task performance not  “intelligence.” The exam focuses on expert-level academic problems, which are a far cry from the messy scenarios and decisions we face daily. But as AI increasingly floods many research fields, the HLE benchmark is an objective way to measure their improvement.

“HLE no doubt offers a useful window into today’s AI expertise,” wrote MIT’s Katherine Collins and Joshua Tenenbaum, who were not involved in the study. “But it is by no means the last word on humanity’s thinking or AI’s capacity to contribute to it.”

Moving Scale

It seems that AI has steadily become smarter over the past few years. But what exactly does “smart” mean for an algorithm?

A common way to measure AI “smarts” is to challenge different AI models—or upgraded versions of the same model—with standardized benchmarks. These collections of questions cover a wide range of topics and can’t be answered with a simple web search. They require both an extensive representation of the world, and more importantly, the ability to use it to answer questions. It’s like taking a driver’s license test: You can memorize the entire handbook of rules and regulations but still need to figure out who has the right of way in any scenario.

However, benchmarks are only useful if they still stump AI. And the models have become expert test takers. Cutting-edge large language models are posting near-perfect scores across benchmarks tests, making the tests less effective at detecting genuine advances.

The problem “has grown worse because as well as being trained on the entire internet, current AI systems can often search for information online during the test,” essentially learning to cheat, wrote Collins and Tenenbaum.

Working with the non-profit Center for AI Safety and Scale AI, the HLE Contributors Consortium designed a new benchmark tailor-made to confuse AI. They asked thousands of experts from 50 countries to submit graduate-level questions in specific fields. The questions have two types of answers. One type must completely match the actual solution, while the other is multiple-choice. This makes it easy to automatically score test results.

Notably, the team avoided incorporating questions requiring longer or open-ended answers, such as writing a scientific paper, a law brief, or other cases where there isn’t a clearly correct answer or a way to gauge if an answer is right.

They chose questions in a multi-step process to gauge difficulty and originality. Roughly 70,000 submissions were tested on multiple AI models. Only those that stumped models advanced to the next stage, where experts judged their usefulness for AI evaluation using strict guidelines.

The team has released 2,500 questions from the HLE collection. They’ve kept the rest private to prevent AI systems from gaming the test and outperforming on questions they’ve seen before.

When the team first released the test in early 2025, leading AI models from Google, OpenAI, and Anthropic scored in the single digits. As it subsequently caught the eye of AI companies, many adopted the test to demonstrate the performance of new releases. Newer algorithms have shown some improvement, though even leading models still struggle. OpenAI’s GTP-4o scored a measly 2.7 percent, whereas GPT-5’s success rate increased to 25 percent.

A New Standard?

Like IQ tests and standardized college admission exams, HLE has come under fire. Some people object to the test’s bombastic name, which could lead the general public to misunderstand an AI’s capabilities compared to human experts.

Others question what the test actually measures. Expertise across a wide range of academic fields and model improvement are obvious answers. However, HLE’s current curation inherently limits “the most challenging and meaningful questions that human experts engage with,” which require thoughtful responses, often across disciplines, that can hardly be captured with short answers or multiple-choice questions, wrote Collins and Tenenbaum.

Expertise also involves far more than answering existing questions. Beyond solving a given problem, experts can also evaluate whether the question makes sense—for example, if it has answers the test-maker didn’t consider—and gauge how confident they are of their answers.

“Humanity is not contained in any static test, but in our ability to continually evolve both in asking and answering questions we never, in our wildest dreams, thought we would—generation after generation,” Subbarao Kambhampati, former president of the Association for the Advancement of Artificial Intelligence, who was not involved in the study, wrote on X.

And although an increase in HLE score could be due to fundamental advances in a model, it could also be because model-makers gave an algorithm extra training on the public dataset—like studying the previous year’s exam questions before a test. In this case, the exam mainly reflects the AI’s test performance, not that it has gained expertise or “intelligence.”

The HLE team embraces these criticisms and are continuing to improve the benchmark. Others are developing completely different scales. Using human tests to benchmark AI has been the norm, but researchers are looking into other ways that could better capture an AI’s scientific creativity or collaborative thinking with humans in the real world. A consensus on AI intelligence, and how to measure it, remains a hot topic for debate.

Despite its shortcomings, HLE is a useful way to measure AI expertise. But looking forward, “as the authors note, their project will ideally make itself obsolete by forcing the development of innovative paradigms for AI evaluation,” wrote Collins and Tenenbaum.



Source link

  • Related Posts

    Intel will start making GPUs, a market dominated by Nvidia 

    As Intel continues to try to turn itself around, its CEO promised that the company will start producing a new type of chip, one that has been made very popular…

    Take-Two hit pause on the Switch 2 port of Borderlands 4

    2K owner Take-Two has paused development on Borderlands 4 for the Nintendo Switch 2, the company shared during its Q3 2026 earnings presentation. The Switch 2 port was originally planned…

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    You Missed

    This unexpected plant discovery could change how drugs are made

    This unexpected plant discovery could change how drugs are made

    How Peach Is Leading ANA’s Global LCC Charge

    How Peach Is Leading ANA’s Global LCC Charge

    Intel will start making GPUs, a market dominated by Nvidia 

    Intel will start making GPUs, a market dominated by Nvidia 

    Durham: Afghan leg-spin bowler Shafiqullah Ghafari signs for 2026

    Durham: Afghan leg-spin bowler Shafiqullah Ghafari signs for 2026

    Despite pushback, Quebec minister ‘confident’ government will adopt new secularism bill

    Despite pushback, Quebec minister ‘confident’ government will adopt new secularism bill

    Judge presses DOJ lawyers for a precedent for Pentagon to punish Sen. Mark Kelly over video

    Judge presses DOJ lawyers for a precedent for Pentagon to punish Sen. Mark Kelly over video