Ahead of an artificial intelligence conference held last April, peer reviewers considered papers written by “Carl” alongside other submissions. What the reviewers did not know was that, unlike other authors, Carl wasn’t a scientific researcher, but rather an AI system built by the tech company Autoscience Institute, which says that the model can accelerate artificial intelligence research. And at least according to the humans involved in the review process, the papers were good enough for the conference: In the double-blind peer review process, three of the four papers, which were authored by Carl (with varying levels of human input) were accepted.
Carl joins a growing group of so-called “AI scientists,” which include Robin and Kosmos, research agents developed by the San Francisco-based nonprofit research lab FutureHouse, and The AI Scientist, introduced by the Japanese company Sakana AI, among others. AI scientists are made up from multiple large language models. For example, Carl differs from chatbots in that it’s devised to generate and test ideas and produce findings, said Eliot Cowan, co-founder of Autoscience Institute. Companies say these AI-driven systems can review literature, devise hypotheses, conduct experiments, analyze data, and produce novel scientific findings with varying degrees of autonomy.
The goal, said Cowan, is to develop AI systems that can increase efficiency and scale up the production of science. And other companies like Sakana AI have indicated a belief that AI scientists are unlikely to replace human ones.
Still, the automation of science has stirred a mix of concern and optimism among the AI and scientific communities. “You start feeling a little bit uneasy, because, hey, this is what I do,” said Julian Togelius, a professor of computer science at New York University who works on artificial intelligence. “I generate hypotheses, read the literature.”
AI scientists are made up from multiple large language models. Carl differs from chatbots in that it’s devised to generate and test ideas and produce findings.
Critics of these systems, including scientists who themselves study artificial intelligence, worry that AI scientists could displace researchers of the next generation, flood the system with low quality or untrustworthy data, and erode trust in scientific findings. The advancements also pose a question about where AI fits into the inherently social and human scientific enterprise, said David Leslie, director of ethics and responsible innovation research at The Alan Turing Institute in London. “There’s a difference between the full-blown shared practice of science and what’s happening with a computational system.”
In the last five years, automated systems have already led to important scientific advances. For example, AlphaFold, an AI system developed by Google DeepMind, was able to predict the three-dimensional structures of proteins with high resolution more quickly than scientists in the lab. The developers of AlphaFold, Demis Hassabis and John Jumper, won a 2024 Nobel Prize in Chemistry for their protein prediction work.
Now companies have expanded to integrate AI into other aspects of the scientific discovery, creating what Leslie calls computational Frankensteins. The term, he says, refers to the convergence of various generative AI infrastructure, algorithms, and other components used “to produce applications that attempt to simulate or approximate complex and embodied social practices (like practices of scientific discovery).” In 2025 alone, at least three companies and research labs—Sakana AI, Autoscience Institute, and FutureHouse (which launched a commercial spinoff called Edison Scientific in November)—have touted their first “AI-generated” scientific results. Some US government scientists have also embraced artificial intelligence: Researchers at three federal labs, Argonne National Laboratory, the Oak Ridge National Laboratory, and Lawrence Berkeley National Laboratory, have developed AI-driven, fully automated materials laboratories.
“You start feeling a little bit uneasy, because, hey, this is what I do.”
Indeed, these AI systems, like large language models, could be potentially used to synthesize literature and mine vast amounts of data to identify patterns. Particularly, they may be useful in material sciences, in which AI systems can design or discover new materials, and in understanding the physics of subatomic particles.
Systems can “basically make connections between millions, billions, trillions of variables” in ways that humans can’t, said Leslie. “We don’t function that way, and so just in virtue of that capacity, there are many, many opportunities.” For example, FutureHouse’s Robin mined literature and identified a potential therapeutic candidate for a condition that causes vision loss, proposed experiments to test the drug, and then analyzed the data.
But researchers have also raised red flags. While Nihar Shah, a computer scientist at Carnegie Mellon University, is “more on the optimistic side” about how AI systems can enable new discoveries, he also worries about AI slop, or the overflow of the scientific literature with AI-generated studies of poor quality and little innovation. Researchers have also pointed out other important caveats regarding the peer review process.
In a recent study that is yet to be peer reviewed, Shah and colleagues tested two AI models that aid in the scientific process: Sakana’s AI Scientist-v2 (an updated version of the original) and Agent Laboratory, a system developed by AMD, a semiconductor company, in collaboration with Johns Hopkins University, to perform research assistant tasks. Shah’s goal with the study was to examine where these systems might be failing.
One AI system, the AI Scientist-v2, reported 95 and sometimes even 100 percent accuracy on a specified task, which was impossible given that the researchers had intentionally introduced noise into the dataset. Seemingly, both systems were sometimes making up synthetic datasets to run the analysis on while stating in the final report that it was done on the original dataset. To address this, Shah and his team developed an algorithm to flag methodological pitfalls they identified, such as cherry-picking favorable datasets to run their analysis and selective reporting of positive results.
Some research suggests generative AI systems have also failed to produce innovative ideas. One study concluded that one generative AI chatbot, ChatGPT4, can only produce incremental discoveries, while a recent study published last year in Science Immunology found that, despite being able to synthesize the literature accurately, AI chatbots failed to generate insightful hypotheses or experimental proposals in the field of vaccinology. (Sakana AI and FutureHouse did not respond to requests for comments.)
Even if these systems continue being used, a human place in the lab will likely not disappear, Shah said. “Even if AI scientists become super-duper duper capable, still there’ll be a role for people, but that itself is not entirely clear,” said Shah, “as to how capable will AI scientists be and how much would still be there for humans?”
Historically, science has been a deeply human enterprise, which Leslie described as an ongoing process of interpretation, world-making, negotiation, and discovery. Importantly, he added, that process is dependent on the researchers themselves and the values and biases they hold.
A computational system trained to predict the best answer, in contrast, is categorically distinct, Leslie said. “The predictive model itself is just getting a small slice of a very complex and deep, ongoing practice, which has got layers of institutional complexity, layers of methodological complexity, historical complexity, layers of discrimination that have arisen from other injustices that define who gets to do science, who doesn’t get to do science, and what science has done for whom, and what science has not done because people aren’t sending to have their questions answered.”
Researchers at three federal labs have developed AI-driven, fully automated materials laboratories.
Rather than as a substitute for scientists, some experts see AI scientists as an additional, augmentative tool for researchers to help draw out insights, much like a microscope or a telescope. Companies also say they do not intend to replace scientists. “We do not believe that the role of a human scientist will be diminished. If anything, the role of a scientist will change and adapt to new technology, and move up the food chain,” Sakana AI wrote when the company announced its AI Scientist.
Now researchers are beginning to ponder what the future of science might look like alongside AI systems, including how to vet and validate their output. “We need to be very reflective about how we classify what’s actually happening in these tools, and if they’re harming the rigor of science as opposed to enriching our interpretive capacity by functioning as a tool for us to use in rigorous scientific practice,” said Leslie.
Going forward, Shah proposed, journals and conferences should vet AI research output by auditing log traces of the research process and generated code to both validate the findings and identify any methodological flaws. And companies, such as Autoscience Institute, say they are building systems to make sure that experiments hold to the same ethical standards as “an experiment run by a human at an academic institution would have to meet,” said Cowan. Some of the standards baked into Carl, Cowan noted, include preventing false attribution and plagiarism, facilitating reproducibility, and not using human subjects or sensitive data, among others.
While some researchers and companies are focused on improving the AI models, others are stepping back to ask how the automation of science will affect the people currently doing the research. Now is a good time to begin to grapple with such questions, said Togelius. “We got the message that AI tools that make that make us better at doing science, that’s great. Automating ourselves out of the process is terrible,” he added “How do we do one and not the other?”
This article was originally published on Undark. Read the original article.









