Debates over AI benchmarking have reached Pokémon


Not even Pokémon is safe from AI benchmarking controversy.

Last week, a post on X went viral, claiming that Google’s latest Gemini model surpassed Anthropic’s flagship Claude model in the original Pokémon video game trilogy. Reportedly, Gemini had reached Lavendar Town in a developer’s Twitch stream; Claude was stuck at Mount Moon as of late February.

But what the post failed to mention is that Gemini had an advantage.

As users on Reddit pointed out, the developer who maintains the Gemini stream built a custom minimap that helps the model identify “tiles” in the game like cuttable trees. This reduces the need for Gemini to analyze screenshots before it makes gameplay decisions.

Now, Pokémon is a semi-serious AI benchmark at best — few would argue it’s a very informative test of a model’s capabilities. But it is an instructive example of how different implementations of a benchmark can influence the results.

For example, Anthropic reported two scores for its recent Anthropic 3.7 Sonnet model on the benchmark SWE-bench Verified, which is designed to evaluate a model’s coding abilities. Claude 3.7 Sonnet achieved 62.3% accuracy on SWE-bench Verified, but 70.3% with a “custom scaffold” that Anthropic developed.

More recently, Meta fine-tuned a version of one of its newer models, Llama 4 Maverick, to perform well on a particular benchmark, LM Arena. The vanilla version of the model scores significantly worse on the same evaluation.

Given that AI benchmarks — Pokémon included — are imperfect measures to begin with, custom and non-standard implementations threaten to muddy the waters even further. That is to say, it doesn’t seem likely that it’ll get any easier to compare models as they’re released.





Source link

  • Related Posts

    From tech pioneers to ‘extremists’: Belarusian founders face exile and statelessness

    In 2013, Tatyana Marynich and Anastasiya Khamiankova opened the doors to Imaguru, a startup hub in Minsk, Belarus that would go on to launch some of Eastern Europe’s most prominent…

    Epic Games just scored a win against Apple

    Epic Games notched a win in an ongoing legal dispute with Apple. The result could be Fortnite returning to the U.S. iOS app store as early as next week. Judge…

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    You Missed

    Republicans hit early snags as they start crafting a massive bill for Trump’s agenda

    Republicans hit early snags as they start crafting a massive bill for Trump’s agenda

    Officer killed in gunman’s Pennsylvania hospital siege was hit by friendly fire, prosecutor says

    Officer killed in gunman’s Pennsylvania hospital siege was hit by friendly fire, prosecutor says

    Community in Grimsby comes together for well-being day

    Community in Grimsby comes together for well-being day

    New Russia-North Korea bridge will boost economic cooperation, both countries say

    New Russia-North Korea bridge will boost economic cooperation, both countries say

    1 dead in flooding in Oklahoma as downpours hit region

    1 dead in flooding in Oklahoma as downpours hit region

    Martin Scorsese and Pope Francis teamed to produce a documentary called ‘Aldeas — A New Story’

    Martin Scorsese and Pope Francis teamed to produce a documentary called ‘Aldeas — A New Story’