Google's Gemini 3 model keeps the AI hype train going – for now

Gemini 3 is Google’s latest AI model

VCG via Getty Images

Google’s latest chatbot, Gemini 3, has made significant leaps on a raft of benchmarks designed to measure AI progress, according to the company. These achievements may be enough to allay fears of an AI bubble bursting for the moment, but it is unclear how well these scores translate to real-world capabilities.

What’s more, persistent factual inaccuracies and hallucinations that have become a hallmark of all large language models show no signs of being ironed out, which could prove problematic for any uses where reliability is vital.

In a blog post announcing the new model, Google bosses Sundar Pichai, Demis Hassabis and Koray Kavukcuoglu write that Gemini 3 has “PhD-level reasoning”, a phrase that competitor OpenAI also used when it announced its GPT-5 model. As evidence for this, they list scores on several tests designed to test “graduate-level” knowledge, such as Humanity’s Last Exam, a set of 2500 research-level questions from maths, science and the humanities. Gemini 3 scored 37.5 per cent on this test, outclassing the previous record holder, a version of OpenAI’s GPT-5, which scored 26.5 per cent.

Jumps like this can indicate that a model has become more capable in certain respects, says Luc Rocher at the University of Oxford, but we need to be careful about how we interpret these results. “If a model goes from 80 per cent to 90 per cent on a benchmark, what does it mean? Does it mean that a model was 80 per cent PhD level and now is 90 per cent PhD level? I think it’s quite difficult to understand,” they say. “There is no number that we can put on whether an AI model has reasoning, because this is a very subjective notion.”

Benchmark tests have many limitations, such as requiring a single answer or multiple choice answers for which models don’t need to show their working. “It’s very easy to use multiple choice questions to grade [the models],” says Rocher, “but if you go to a doctor, the doctor will not assess you with a multiple choice. If you ask a lawyer, a lawyer will not give you legal advice with multiple choice answers.” There is also a risk that the answers to such tests were hoovered up in the training data of the AI models being tested, effectively letting them cheat.

The real test for Gemini 3 and the most advanced AI models – and whether their performance will be enough to justify the trillions of dollars that companies like Google and OpenAI are spending on AI data centres – will be in how people use the model and how reliable they find it, says Rocher.

Google says the model’s improved capabilities will make it better at producing software, organising email and analysing documents. The firm also says it will improve Google search by supplementing AI-generated results with graphics and simulations.

It is likely that the real improvements will be for people who use AI tools to autonomously write code, a process called agentic coding, says Adam Mahdi at the University of Oxford. “I think we’re hitting the upper limit of what a typical chatbot can do, and the real benefits of Gemini 3 Pro [the standard version of Gemini 3] will probably be in more complex, potentially agentic workflows, rather than everyday chatting,” he says.

Initial reactions online have included people praising Gemini’s coding capabilities and ability to reason, but as with all new model releases, there have also been posts highlighting failures to do apparently simple tasks, such as tracing hand-drawn arrows pointing to different people, or simple visual reasoning tests.

Google admits, in Gemini 3’s technical specifications, that the model will continue to hallucinate and produce factual inaccuracies some of the time, at a rate that is roughly comparable with other leading AI models. The lack of improvement in this area is a big concern, says Artur d’Avila Garcez at City St George’s, University of London. “The problem is that all AI companies have been trying to reduce hallucinations for more than two years, but you only need one very bad hallucination to destroy trust in the system for good,” he says.

Topics:

Source link : https://www.newscientist.com/article/2505039-googles-gemini-3-model-keeps-the-ai-hype-train-going-for-now/?utm_campaign=RSS%7CNSNS&utm_source=NSNS&utm_medium=RSS&utm_content=home

Author :

Publish date : 2025-11-19 15:38:00

Copyright for syndicated content belongs to the linked Source.

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Google’s Gemini 3 model keeps the AI hype train going – for now

Quantum computers that recycle their qubits can limit errors

MacArthur Foundation Awards $100M to Outbreak Surveillance Network Amid Global Cuts

Related Posts

‘People Are Taking a Sledgehammer to Patients’: What We Heard This Week

Precision or Collateral Damage? Oncology’s F-16 Moment

Regular Physical Activity Thwarts Fatigue, Boosts QoL in Colorectal Cancer

The Berkshire mum looking to encourage others into rugby

Drinking Coffee May Help Improve Heart Health, Lower Diabetes Risk

What Do the New Dietary Guidelines Emphasize?

‘People Are Taking a Sledgehammer to Patients’: What We Heard This Week

Precision or Collateral Damage? Oncology’s F-16 Moment

Regular Physical Activity Thwarts Fatigue, Boosts QoL in Colorectal Cancer

The Berkshire mum looking to encourage others into rugby

Drinking Coffee May Help Improve Heart Health, Lower Diabetes Risk

What Do the New Dietary Guidelines Emphasize?

Immuno-Oncology Tops Embolization for Intermediate-Grade Liver Cancer

Global Pain Levels; Tertiary Patents on FDA-Approved Drugs

Categories

Archives