Forget Code: AI Is Learning to Hack Society

AI’s hacking skills are big news at the moment, but finding vulnerabilities in code may be the least of our worries. A new study suggests AI models can discover potentially damaging loopholes in the rules and regulations underpinning society.

Modern AI systems are powerful optimizers. Give them a goal, and they’ll pursue it relentlessly, quickly discovering solutions that would take a human years to find. But they are also incredibly literal in the way they approach a problem. They will do exactly what you tell them and are incapable of reading between the lines in the ways a human would.

This tendency leads to a recurring problem known as “reward hacking,” where an AI finds some loophole to maximize its performance on the metric used to measure success without actually achieving what its designers intended. The classic example is the AI that discovered it could win a boat racing videogame by looping around in circles collecting power-ups rather than completing the course.

The problem is partly due to humans being bad at specifying their goals. And unfortunately, it seems this weakness exists in the rules and regulations used to run society. When researchers let popular large language models loose in 72 simulated regulatory environments, the models found 60 percent of known loopholes and even identified some entirely new exploits.

“Within these environments, reward hacking naturally emerges and leads to regulatory loophole discovery,” the authors write in a non-peer-reviewed paper published on arXiv. “Models learn to hack the social rules and generate strategies that remain technically compliant while defeating regulatory intent.”

The regulatory environments the researchers created were primarily based on rules governing things like pharmaceutical patents, NBA salary caps, and deep-sea mining. In each case, Alibaba’s Qwen3 model was given the relevant rules, an explanation of its task, a predefined set of actions it could take, and the system used to score different outcomes.

A more powerful model, Google’s Gemini-3-flash, then simulated the consequences of different actions Qwen3 took and judged if and when it had found a way to exploit the rules of the game. When that occurred, the larger model patched the loophole by adding new rules, and the smaller model was set loose again. Over many iterations, the models to discover increasingly subtle workarounds.

When building their regulatory environments, the researchers omitted real-world fixes that regulators had used to close known loopholes. Over many trials, Qwen3 rediscovered more than 60 percent of these exploits. In a simulation of pharmaceutical patent regulations, the two models ended up replaying the same sequence of loophole discovery and regulatory reform that occurred in the real world.

Crucially, their behavior emerged spontaneously without the researchers asking the algorithms to cheat the system. This is a byproduct of the popular reinforcement learning approach the researchers used, where a model is rewarded for getting closer to a specific, numerically-defined goal.

Worryingly, the team found that existing safety measures offered little protection. Both models are designed to refuse prompts featuring harmful language, but loophole-seeking behavior slipped under the radar. When asked to self-critique their own behavior, the models identified fewer than 40 percent of their own exploits.

The researchers note that the same capabilities could be used more proactively to scour proposed regulations for loopholes before enactment. But lead author Wei Liu, a PhD student at King’s College London, says there are always likely to be gaps. “In the real world,” he told Science, “society is a huge, complicated reward function that can’t ever be patched to a perfect status.”

Adding to the concern, the models used in this study were far from the frontier, suggesting that more powerful AI could be even more adept at regulatory hacking. Whether our existing institutions can adapt quickly enough to this emerging threat is an open question.

Source link