UK Institute Is Hunting for Dangers Lurking in AI

On a recent Tuesday in an Edwardian government building along Parliament Square in London, four artificial intelligence experts were busy tricking an A.I. chatbot into sharing instructions for making the deadly bioweapon anthrax.

In various ways, the experts asked the chatbot to give a list of needed ingredients. When the system declined — “I’m sorry I can’t help with that” — they used a custom algorithm to bombard the A.I. tool with thousands of automated questions and prompts.

Eventually, the A.I. caved. It provided a detailed list of materials and equipment, along with a step-by-step recipe for making the lethal mixture at home. (The New York Times agreed to withhold the name of the A.I. system for safety reasons.)

“There are some questions that you definitely don’t want the model to give the answer to,” said Xander Davies, a 25-year-old American who leads what is known as a red team at Britain’s A.I. Security Institute. “We try really hard to get the answers out.”

Mr. Davies and his red team, who simulate attacks on A.I. systems, also recently broke through the safeguards on OpenAI’s newest ChatGPT chatbot, coaxing it into providing hacking tips in about six hours. After finding problems, they share results with the companies.

“They try to fix it, report something back to us,” said Mr. Davies, a computer scientist who chose to work at the institute instead of in a tech job in San Francisco after attending Harvard. “They actually strengthen their system with us.”

A mix of weapons inspectors, epidemiologists and code breakers, the A.I. Security Institute is one of the world’s largest and best-funded government efforts dedicated to probing the technology’s potentially catastrophic risks.

The institute’s roughly 100 employees — drawn from British intelligence agencies, academia and tech companies — have found major safety gaps in every leading A.I. model they have tested, including Anthropic’s Claude and Google’s Gemini. Created nearly three years ago, the organization said it had co-opted A.I. systems into sharing instructions for making chemical and biological weapons, and planning and executing cyberattacks. It publishes its research and also works with Britain’s national security agencies to identify and prepare for emerging threats.

Now, the institute’s work is becoming a blueprint for other governments as concerns about A.I. safety grow. The Trump administration is considering rules for vetting A.I. models that have some similarities to the approach pioneered by the British group. With many governments lacking the technical understanding to police the technology and reliant on big tech firms to self-regulate, the institute may offer a different path to which A.I. experts bring real technological know-how into government decision-making.

“Companies can’t be left to mark their own homework,” Rishi Sunak, the former British prime minister who created the institute, said in an interview. “That is the job of democratic institutions.”

In April, Anthropic announced a new A.I. model, Mythos, which it did not make public because of fears it could find and exploit cybersecurity flaws in global networks. The British institute was the only non-American government organization to receive access to the model for safety testing. Its findings, released six days after Mythos was announced, were widely cited by security experts.

The United States has its own A.I. safety group, the Center for A.I. Standards and Innovation. But the British version, backed by 360 million pounds of government money, equal to about $480 million, is larger and better funded than its U.S. counterpart, which will receive about $10 million this year. Australia, Canada, China, France, India, Japan and Singapore have formed similar institutes.

Even so, global investment in A.I. safety has paled against the vast sums for building and commercializing the technology. OpenAI, Anthropic and Google have teams working on safety controls, but outside researchers regularly find dangerous gaps. Academics in Italy recently tricked an A.I. model into providing bomb-related instructions using poetry.

Governments have largely not created systems dedicated to reviewing A.I. for safety and security risks, as they have for industries such as drug development or car manufacturing.

“The thing that keeps me up at night is the relative speed of the technology compared to the institutions like governments that have to respond,” said Jade Leung, an A.I. adviser for Prime Minister Keir Starmer and the chief technology officer of the A.I. Security Institute.

The British security institute originated from a 2023 meeting at 10 Downing Street between Mr. Sunak and three of the world’s highest-profile A.I. leaders — OpenAI’s Sam Altman, Anthropic’s Dario Amodei and Google DeepMind’s Demis Hassabis. Mr. Sunak recalled them saying that A.I.’s abilities were accelerating, with profound implications for government, jobs and national security.

“The pace of development was surprising even to them,” he said.

In November 2023, Mr. Sunak announced the creation of the institute at a summit of world leaders on A.I. safety at Bletchley Park, where Alan Turing and others broke German encryption codes during World War II.

The institute has become a template for others, said Olivia Shen, director of the strategic technologies program at the United States Studies Center, an Australian think tank at the University of Sydney. Last year, Ms. Leung of the British institute traveled to Australia to meet with government leaders. This year, Australia opened its own A.I. security center.

“Governments need to play catch-up,” said Ms. Shen, who helped organize the visit. “At the pace of where the technology is coming, governments are losing pace every day.”

The British institute works on the most serious potential risks from advanced A.I.: cyberthreats, chemical and biological weapons, and the manipulation of human behavior. In recent weeks, it found that A.I. models from Anthropic and OpenAI could much more quickly complete a complex, 32-step corporate network attack that would usually take a skilled human hacker 20 hours to complete.

Another research area is studying whether A.I. models recognize when they are being tested and alter their behavior, a development that would signal A.I.’s level of awareness and capacity to deceive.

Adam Beaumont, the A.I. Security Institute’s interim director, said a major fear was the technology’s mimicry of human behavior. Last year, the institute published a study that found that chatbots can swing people’s political opinions.

“A lot of people in this building are looking at each of those things,” said Mr. Beaumont, a former top A.I. officer at GCHQ, Britain’s intelligence, security and cyber agency.

Many fear the institute’s work is insufficient. The British group has no regulatory power, and its researchers do not receive information about how top A.I. models are trained and created. It keeps a lot of its research private, sharing it only with certain government agencies and companies.

Recruiting is also a challenge. Other than senior leaders, its workers can earn up to £145,000 a year, or about $195,000. Many have walked away from multimillion-dollar pay packages at A.I. companies to do what some called a government “tour of duty.”

Ian Hogarth, a tech investor who co-founded the institute, was an early backer of Anthropic. To avoid a conflict of interest, he sold his Anthropic stake after he joined. The A.I. start-up could soon be worth $900 billion, up from about $4 billion at the beginning of 2023.

“I’ve got a mortgage, so it wasn’t a trivial decision at all,” said Mr. Hogarth, 44, who is now chair of the institute. He added that it was an “expensive” choice, but the right one.

“I believe in the importance of getting the technology right and believe the government has a role to play,” he said.

Source link