Biotech firm aims to create ‘ChatGPT of biology’ – will it work?


Basecamp researchers gathering genetic data in Malta

Greg Funnell

A British biotech firm called Basecamp Research has spent the past few years collecting troves of genetic data from microbes living in extreme environments around the world, identifying more than a million species and nearly 10 billion genes new to science. It claims that this massive database of the planet’s biodiversity will help train a “ChatGPT of biology” that will answer questions about life on Earth – but there’s no guarantee this will work.

Jörg Overmann at the Leibniz Institute DSMZ in Germany, which houses one of the world’s most diverse collections of microbial cultures, says increasing known genetic sequences is valuable, but may not result in useful findings for things like drug discovery or chemistry without more information about the organisms from which they were collected. “I’m not convinced that in the end the understanding of really novel functions will be accelerated by this brute-force increase in the sequence space,” he says.

Recent years have seen researchers develop a number of machine learning models trained to identify patterns and predict relationships amid vast amounts of biological data. The most famous of these is AlphaFold, which can predict the 3D structure of a protein based only on genetic data, and earned its creators at Google DeepMind the 2024 Nobel prize in chemistry.

While such “generative biology” models have grown ever more complex since, they haven’t gotten much better, says Frances Ding at the University of California, Berkeley. One reason could be a lack of biodiverse data. “Current models in biology are trained on datasets that disproportionately represent well-studied species (e.g., E. coli, mice, humans), and these models are worse at predicting properties about sequences from other parts of the tree of life,” she says.

Researchers at Basecamp set out to address this biodiversity gap. The company’s growing database now contains samples from more than 120 sites in 26 countries, according to a report the company posted. Jonathan Finn, the company’s chief science officer, says the collection efforts focused on extreme environments that hadn’t yet been widely sampled, ranging from the frigid water beneath Arctic sea ice to jungle hot springs. “Most of the samples that we’ve been going after are prokaryotic samples: bacteria, microbes and their viruses,” says Finn. “I know we’ve got some fungi in there.”

Genetic analysis of these samples revealed differences in genes shared nearly universally across the tree of life – based on this, the company estimates the data contains information from more than 1 million species that don’t occur in public genomic datasets used to train AI biology models. These collectively contain around 9.8 billion newly identified genes, a 10-fold increase in the total number of known genes, each of which encodes a potentially useful protein, the researchers say.

“By showing these models a large piece of nature, they should have a better understanding of how biology works,” says Finn. “We’re trying to build a ChatGPT of biology.”

By some estimates, Earth hosts as many as a trillion microbial species, almost none of which are well characterised. So, it’s not hugely surprising the company identified so much new life. “It’s almost inevitable that if you explore more you get more different gene variants,” says Leopold Parts at the Wellcome Sanger Institute, UK.

But Basecamp is banking on the idea that all the new material could be valuable – and it’s not alone. “This is one of the most exciting things I’ve seen in a long time,” says Nathan Frey, a machine learning researcher at Genentech, a biotech firm in the US. In general, he says work on AI models for biology has focused on improving algorithms or generating more data in labs rather than actually going out in the world and collecting samples.

However, there is reason to be sceptical that the database will lead to the radically improved models the company wants. For one, it remains unclear to what extent this new diversity of proteins represents valuable new functions, such as plastic-eating enzymes or proteins that could be repurposed for gene editing. “They have to show that this novelty is useful in some way,” says Parts.

Further, if the new genes really are substantially different from those we already know, Overmann doesn’t see how existing tools can easily predict their functions, or how the data can be used for training a new model. “You don’t have any clue what the majority of the genes do,” he says. The company could well have assembled a treasure trove of new biology, but without more old-fashioned laboratory work to understand what’s there it may remain mysterious, even to the most powerful AI.

Topics:



Source link : https://www.newscientist.com/article/2484323-biotech-firm-aims-to-create-chatgpt-of-biology-will-it-work/?utm_campaign=RSS%7CNSNS&utm_source=NSNS&utm_medium=RSS&utm_content=home

Author :

Publish date : 2025-06-17 20:13:00

Copyright for syndicated content belongs to the linked Source.
Exit mobile version