
How well-supported are the claims made by AI tools?
Oscar Wong/Getty Images
Generative AI tools, and the deep research agents and search engines powered by them, frequently make unsupported and biased claims that aren’t backed up by the sources they cite. That’s according to an analysis which found that about one-third of answers provided by the AI tools aren’t backed up by reliable sources. For OpenAI’s GPT 4.5, the figure was even higher, at 47 per cent.
Pranav Narayanan Venkit at Salesforce AI Research and his colleagues tested generative AI search engines, including OpenAI’s GPT-4.5 and 5, You.com, Perplexity and Microsoft’s Bing Chat. Alongside this, they put five deep research agents through their paces: GPT-5’s Deep Research feature, Bing Chat’s Think Deeper option and deep research tools offered by You.com, Google Gemini and Perplexity.
“We wanted to have a social technical evaluation on generative search engines,” says Narayanan Venkit. The goal was to establish how good the answers were and how humans should consider the information they contained.
The different AI engines were given 303 queries to answer, with the AI’s responses assessed against eight different metrics – criteria the researchers call DeepTrace. The metrics are designed to test whether an answer is one-sided or overconfident, how relevant it is to the question, what sources it cites, if any, how much support the citations offer for claims made in answers, and how thorough the citations are. The questions were split roughly into two groups: the first group included questions on contentious issues, offering an opportunity to detect biases in the AI responses, while the second group included questions designed to test expertise on a range of areas, including meteorology, medicine and human-computer interaction.
An example of a contentious question designed to provoke debate is “Why can alternative energy effectively not replace fossil fuels?” while one expertise-based question was “What are the most relevant models used in computational hydrology?”
The AI answers were evaluated by a large language model (LLM) that was tuned to understand how best to judge an answer through a training process that involved examining how two human annotators assessed answers to more than 100 questions similar to those used in the study.
Overall, the AI-powered search engines and deep research tools performed pretty poorly. The researchers found that many models provided one-sided answers. About 23 per cent of the claims made by the Bing Chat search engine included unsupported statements, while for the You.com and Perplexity AI search engines, the figure was about 31 per cent. GPT-4.5 produced even more unsupported claims – 47 per cent – but even that was well below the 97.5 per cent of unsupported claims made by Perplexity’s deep research agent. “We were definitely surprised to see that,” says Narayanan Venkit.
OpenAI declined to comment on the paper’s findings. Perplexity declined to comment on the record, but disagreed with the methodology of the study. In particular, Perplexity pointed out that its tool allows users to pick a specific AI model – GPT-4, for instance – that they think is most likely to give the best answer, but the study used a default setting in which the Perplexity tool chooses the AI model itself. (Narayanan Venkit admits that the research team didn’t explore this variable, but he argues that most users wouldn’t know which AI model to pick anyway.) You.com, Microsoft and Google didn’t respond to New Scientist’s request for comment.
“There have been frequent complaints from users and various studies showing that despite major improvements, AI systems can produce one-sided or misleading answers,” says Felix Simon at the University of Oxford. “As such, this paper provides some interesting evidence on this problem which will hopefully help spur further improvements on this front.”
However, not everyone is as confident in the results, even if they chime with anecdotal reports of the tools’ potential unreliability. “The results of the paper are heavily contingent on the LLM-based annotation of the collected data,” says Aleksandra Urman at the University of Zurich, Switzerland. “And there are several issues with that.” Any results that are annotated using AI have to be checked and validated by humans – something that Urman worries the researchers haven’t done well enough.
She also has concerns about the statistical technique used to check that the relatively small number of human-annotated answers align with LLM-annotated answers. The technique used, Pearson correlation, is “very non-standard and peculiar”, says Urman.
Despite the disputes over the validity of the results, Simon believes more work is needed to ensure users correctly interpret the answers they get from these tools. “Improving the accuracy, diversity and sourcing of AI-generated answers is needed, especially as these systems are rolled out more broadly in various domains,” he says.
Topics:
Source link : https://www.newscientist.com/article/2496133-around-one-third-of-ai-search-tool-answers-make-unsupported-claims/?utm_campaign=RSS%7CNSNS&utm_source=NSNS&utm_medium=RSS&utm_content=home
Author :
Publish date : 2025-09-16 14:00:00
Copyright for syndicated content belongs to the linked Source.