All major AI models risk encouraging dangerous science experiments


Scientific laboratories can be dangerous places

PeopleImages/Shutterstock

The use of AI models in scientific laboratories risks enabling dangerous experiments that could cause fires or explosions, researchers have warned. Such models offer a convincing illusion of understanding but are susceptible to missing basic and vital safety precautions. In tests of 19 cutting-edge AI models, every single one made potentially deadly mistakes.

Serious accidents in university labs are rare but certainly not unheard of. In 1997, chemist Karen Wetterhahn was killed by dimethylmercury that seeped through her protective gloves; in 2016, an explosion cost one researcher her arm; and in 2014, a scientist was partially blinded.

Now, AI models are being pressed into service in a variety of industries and fields, including research laboratories where they can be used to design experiments and procedures. AI models designed for niche tasks have been used successfully in a number of scientific fields, such as biology, meteorology and mathematics. But large general-purpose models are prone to making things up and answering questions even when they have no access to data necessary to form a correct response. This can be a nuisance if researching holiday destinations or recipes, but potentially fatal if designing a chemistry experiment.

To investigate the risks, Xiangliang Zhang at the University of Notre Dame in Indiana and her colleagues created a test called LabSafety Bench that can measure whether an AI model identifies potential hazards and harmful consequences. It includes 765 multiple-choice questions and 404 pictorial laboratory scenarios that may include safety problems.

In multiple-choice tests, some AI models, such as Vicuna, scored almost as low as would be seen with random guesses, while GPT-4o reached as high as 86.55 per cent accuracy and DeepSeek-R1 as high as 84.49 per cent accuracy. When tested with images, some models, such as InstructBlip-7B, scored below 30 per cent accuracy. The team tested 19 cutting-edge large language models (LLMs) and vision language models on LabSafety Bench and found that none scored more than 70 per cent accuracy overall.

Zhang is optimistic about the future of AI in science, even in so-called self-driving laboratories where robots work alone, but says models are not yet ready to design experiments. “Now? In a lab? I don’t think so. They were very often trained for general-purpose tasks: rewriting an email, polishing some paper or summarising a paper. They do very well for these kinds of tasks. [But] they don’t have the domain knowledge about these [laboratory] hazards.”

“We welcome research that helps make AI in science safe and reliable, especially in high-stakes laboratory settings,” says an OpenAI spokesperson, pointing out that the researchers did not test its leading model. “GPT-5.2 is our most capable science model to date, with significantly stronger reasoning, planning, and error-detection than the model discussed in this paper to better support researchers. It’s designed to accelerate scientific work while humans and existing safety systems remain responsible for safety-critical decisions.”

Google, DeepSeek, Meta, Mistral and Anthropic did not respond to a request for comment.

Allan Tucker at Brunel University of London says AI models can be invaluable when used to assist humans in designing novel experiments, but that there are risks and humans must remain in the loop. “The behaviour of these [LLMs] are certainly not well understood in any typical scientific sense,” he says. “I think that the new class of LLMs that mimic language – and not much else – are clearly being used in inappropriate settings because people trust them too much. There is already evidence that humans start to sit back and switch off, letting AI do the hard work but without proper scrutiny.”

Craig Merlic at the University of California, Los Angeles, says he has run a simple test in recent years, asking AI models what to do if you spill sulphuric acid on yourself. The correct answer is to rinse with water, but Merlic says he has found AIs always warn against this, incorrectly adopting unrelated advice about not adding water to acid in experiments because of heat build-up. However, he says, in recent months models have begun to give the correct answer.

Merlic says that instilling good safety practices in universities is vital, because there is a constant stream of new students with little experience. But he’s less pessimistic about the place of AI in designing experiments than other researchers.

“Is it worse than humans? It’s one thing to criticise all these large language models, but they haven’t tested it against a representative group of humans,” says Merlic. “There are humans that are very careful and there are humans that are not. It’s possible that large language models are going to be better than some percentage of beginning graduates, or even experienced researchers. Another factor is that the large language models are improving every month, so the numbers within this paper are probably going to be completely invalid in another six months.”

Topics:



Source link : https://www.newscientist.com/article/2511098-all-major-ai-models-risk-encouraging-dangerous-science-experiments/?utm_campaign=RSS%7CNSNS&utm_source=NSNS&utm_medium=RSS&utm_content=home

Author :

Publish date : 2026-01-15 10:36:00

Copyright for syndicated content belongs to the linked Source.
Exit mobile version