Move Over, Mathematicians, Here Comes AlphaProof

AlphaGeometry 2 solved Problem 4 of the Olympiad in 19 seconds, after receiving the formalized statement of the problem in AlphaGeometry’s own language. (Google DeepMind via The New York Times)
AlphaGeometry 2 solved Problem 4 of the Olympiad in 19 seconds, after receiving the formalized statement of the problem in AlphaGeometry’s own language. (Google DeepMind via The New York Times)

At the headquarters of Google DeepMind, an artificial intelligence laboratory in London, researchers have a long-standing ritual for announcing momentous results: They bang a big ceremonial gong.

In 2016, the gong sounded for AlphaGo, an AI system that excelled at the game Go. In 2017, the gong reverberated when AlphaZero conquered chess. On each occasion, the algorithm had beaten human world champions.

Last week the DeepMind researchers got out the gong again to celebrate what Alex Davies, a lead of Google DeepMind’s mathematics initiative, described as a “massive breakthrough” in mathematical reasoning by an AI system. A pair of Google DeepMind models tried their luck with the problem set in the 2024 International Mathematical Olympiad, or IMO, held from July 11 to July 22 about 100 miles west of London at the University of Bath. The event is said to be the premier math competition for the world’s “brightest mathletes,” according to a promotional post on social media.

Sign up for The Morning newsletter from the New York Times

The human problem-solvers — 609 high school students from 108 countries — won 58 gold medals, 123 silver and 145 bronze. The AI performed at the level of a silver medalist, solving 4 of 6 problems for a total of 28 points. It was the first time that AI has achieved a medal-worthy performance on an Olympiad’s problems.

“It’s not perfect, we didn’t solve everything,” Pushmeet Kohli, Google DeepMind’s vice president of research, said in an interview. “We want to be perfect.”

Nonetheless, Kohli described the result as a “phase transition” — a transformative change — “in the use of AI in mathematics and the ability of AI systems to do mathematics.”

The lab asked two independent experts to adjudicate the AI’s performance: Timothy Gowers, a mathematician at the University of Cambridge in England and a Fields medalist, who has been interested in the math-AI interplay for 25 years; and Joseph Myers, a software developer in Cambridge. Both won IMO gold in their day. Myers was chair of this year’s problem selection committee and at previous Olympiads served as a coordinator, judging human solutions. “I endeavored to assess the AI attempts consistently with how human attempts were judged this year,” he said.

Gowers added in an email: “I was definitely impressed.” The lab had discussed its Olympiad ambitions with him a couple of weeks beforehand, so “my expectations were quite high,” he said. “But the program met them, and in one or two instances significantly surpassed them.” The program found the “magic keys” that unlocked the problems, he said.

Hitting the gong

After months of rigorous training, the students sat for two exams, three problems per day — in algebra, combinatorics, geometry and number theory.

The AI counterpart beavered away roughly in tandem at the lab in London. (The students were not aware that Google DeepMind was competing, in part because the researchers did not want to steal the spotlight.) Researchers moved the gong into the room where they had gathered to watch the system work. “Every time the system solved a problem, we hit the gong to celebrate,” David Silver, a research scientist, said.

Haojia Shi, a student from China, ranked No. 1 and was the only competitor to earn a perfect score — 42 points for six problems; each problem is worth seven points for a full solution. The U.S. team won first place with 192 points; China placed second with 190.

The Google system earned its 28 points for fully solving four problems — two in algebra, one in geometry and one in number theory. (It flopped at two combinatorics problems.) The system was allowed unlimited time; for some problems, it took up to three days. The students were allotted only 4.5 hours per exam.

For the Google DeepMind team, speed is secondary to overall success, as it “is really just a matter of how much compute power you’re prepared to put into these things,” Silver said.

“The fact that we’ve reached this threshold, where it’s even possible to tackle these problems at all, is what represents a step-change in the history of mathematics,” he added. “And hopefully it’s not just a step-change in the IMO, but also represents the point at which we went from computers only being able to prove very, very simple things toward computers being able to prove things that humans can’t.”

Algorithmic ingredients

Applying AI to mathematics has been part of DeepMind’s mission for several years, often in collaboration with world-class research mathematicians.

“Mathematics requires this interesting combination of abstract, precise and creative reasoning,” Davies said. In part, he noted, this repertoire of abilities is what makes math a good litmus test for the ultimate goal: reaching so-called artificial general intelligence, or AGI, a system with capabilities ranging from emerging to competent to virtuoso to superhuman. Companies such as OpenAI, Meta AI and xAI are tracking similar goals.

Olympiad math problems have come to be considered a benchmark.

In January, a Google DeepMind system named AlphaGeometry solved a sampling of Olympiad geometry problems at nearly the level of a human gold medalist. “AlphaGeometry 2 has now surpassed the gold medalists in solving IMO problems,” Thang Luong, the principal investigator, said in an email.

Riding that momentum, Google DeepMind intensified its multidisciplinary Olympiad effort, with two teams: one led by Thomas Hubert, a research engineer in London, and another led by Luong and Quoc Le in Mountain View, each with some 20 researchers. For his “superhuman reasoning team,” Luong said he recruited a dozen IMO medalists — “by far the highest concentration of IMO medalists at Google!”

The lab’s strike at this year’s Olympiad deployed the improved version of AlphaGeometry. Not surprisingly, the model fared rather well on the geometry problem, polishing it off in 19 seconds.

Hubert’s team developed a new model that is comparable but more generalized. Named AlphaProof, it is designed to engage with a broad range of mathematical subjects. All told, AlphaGeometry and AlphaProof made use of a number of different AI technologies.

One approach was an informal reasoning system, expressed in natural language. This system leveraged Gemini, Google’s large language model. It used the English corpus of published problems and proofs and the like as training data.

The informal system excels at identifying patterns and suggesting what comes next; it is creative and talks about ideas in an understandable way. Of course, large language models are inclined to make things up — which may (or may not) fly for poetry and definitely not for math. But in this context, the LLM seems to have displayed restraint; it wasn’t immune to hallucination, but the frequency was reduced.

Another approach was a formal reasoning system, based on logic and expressed in code. It used theorem prover and proof-assistant software called Lean, which guarantees that if the system says a proof is correct, then it is indeed correct. “We can exactly check that the proof is correct or not,” Hubert said. “Every step is guaranteed to be logically sound.”

Another crucial component was a reinforcement learning algorithm in the AlphaGo and AlphaZero lineage. This type of AI learns by itself and can scale indefinitely, said Silver, who is Google DeepMind’s vice-president of reinforcement learning. Since the algorithm doesn’t require a human teacher, it can “learn and keep learning and keep learning until ultimately it can solve the hardest problems that humans can solve,” he said. “And then maybe even one day go beyond those.”

Hubert added, “The system can rediscover knowledge for itself.” That’s what happened with AlphaZero: It started with zero knowledge, Hubert said, “and by just playing games, and seeing who wins and who loses, it could rediscover all the knowledge of chess. It took us less than a day to rediscover all the knowledge of chess, and about a week to rediscover all the knowledge of Go. So, we thought, let’s apply this to mathematics.”

Gowers doesn’t worry — too much — about the long-term consequences. “It is possible to imagine a state of affairs where mathematicians are basically left with nothing to do,” he said. “That would be the case if computers became better, and far faster, at everything that mathematicians currently do.”

“There still seems to be quite a long way to go before computers will be able to do research-level mathematics,” he added. “It’s a fairly safe bet that if Google DeepMind can solve at least some hard IMO problems, then a useful research tool can’t be all that far away.”

A really adept tool might make mathematics accessible to more people, speed up the research process, nudge mathematicians outside the box. Eventually it might even pose novel ideas that resonate.

c.2024 The New York Times Company