
AIs are getting better at maths problems
Andresr/ Getty Images
Experimental AI models from Google DeepMind and OpenAI have achieved a gold-level performance in the International Mathematical Olympiad (IMO) for the first time.
The companies are hailing the moment as an important milestone for AIs that might one day solve hard scientific or mathematical problems, but mathematicians are more cautious because details of the models’ results and how they work haven’t been made public.
The IMO, one of the world’s most prestigious competitions for young mathematicians, has long been seen by AI researchers as a litmus test for mathematical reasoning that AI systems tend to struggle with.
After last year’s competition held in Bath, UK, Google DeepMindannounced that AI systems it had developed, called AlphaProof and AlphaGeometry, had together achieved a silver medal-level performance, but its entries weren’t graded by the competition’s official markers.
Before this year’s contest, which was held in Queensland, Australia, companies including Google, Huawei and TikTok-owner ByteDance, as well as academic researchers, approached the organisers to ask whether they could have their AI models’ performance officially graded, says Gregor Dolinar, the IMO’s president. The IMO agreed, with the proviso that the companies waited to announce their results until 28 July, when the IMO’s full closing ceremonies had been completed.
OpenAI also asked if it could participate in the competition, but after it was informed about the official scheme, it didn’t respond or register an entry, says Dolinar.
On 19 July, OpenAI announced that a new AI it had developed had achieved a gold medal score marked by three former IMO medallists separate from the official competition. The AI answered five out of six questions correctly in the same 4.5-hour time limit as the contestants, OpenAI said.
Two days later, Google DeepMind also announced that its AI system, called Gemini Deep Think, had achieved gold with the same score and time limits. Dolinar confirmed that this result was given by the IMO’s official markers.
Unlike Google’s AlphaProof and AlphaGeometry systems, which were crafted especially for the competition and worked with questions and answers written in a computer programming language called Lean, both Google and OpenAI’s models this year worked entirely in natural language.
Working in Lean meant the AI’s output could be instantly checked for correctness, but it is harder for non-experts to read. Thang Luong at Google, who worked on Gemini Deep Think, says the natural language approach could produce more understandable answers, as well as being applicable to generally useful AI systems.
Luong says the ability to verify solutions in a large language model has been made possible thanks to progress with reinforcement learning, a training method in which an AI is taught what success looks like and is left to figure out the rules and how to succeed solely through trial and error. This method was key to Google’s previous success with its game-playing AIs, such as AlphaZero.
Google’s model also considers multiple solutions at once, in a mode called parallel thinking, as well as being trained on a dataset of maths problems specifically useful for the IMO, says Luong.
OpenAI has released few details on its system, apart from that it also uses reinforcement learning and “experimental research methods”.
“The progress is promising, but not performed in a controlled scientific fashion, and so I will not be able to assess it at this stage,” says Terence Tao at the University of California, Los Angeles. “Perhaps once the companies involved release some papers with more data, and hopefully enough access to the model for others to replicate the results, one can say something more definitive, but, for now, we largely have to trust the companies themselves for the claimed results.”
Geordie Williamson at the University of Sydney in Australia agrees. “I think it is remarkable that this is where we’re at. It is frustrating how little detail outsiders are provided with regarding internals,” says Williamson.
While systems working in natural language could be useful for non-mathematicians, it could also present a problem if models produce long proofs that are hard to check, says Joseph Myers, one of the organisers of this year’s IMO. “If AIs are ever to produce solutions to significant unsolved problems that might plausibly be correct but might also have a few subtle but fatal errors hidden accidentally, or potentially deliberately from a misaligned AI, having those AIs also generate a formal proof is key to having confidence in the correctness of a long AI output before attempting to read it.”
Both companies say that, in the coming months, they will offer these systems for testing to mathematicians at first, before releasing them to the wider public. The models could soon help with harder scientific research problems, says Junehyuk Jung at Google, who worked on Gemini Deep Think. “There are going to be many, many unsolved problems within reach,” he says.
Topics:
Source link : https://www.newscientist.com/article/2489248-deepmind-and-openai-claim-gold-in-international-mathematical-olympiad/?utm_campaign=RSS%7CNSNS&utm_source=NSNS&utm_medium=RSS&utm_content=home
Author :
Publish date : 2025-07-22 17:05:00
Copyright for syndicated content belongs to the linked Source.