[slides] PutnamBench%3A Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition

PUTNAMBENCH is a new multilingual benchmark designed to evaluate the capabilities of neural theorem-provers in solving competition-level mathematics problems. The benchmark consists of 1697 hand-constructed formalizations of 640 theorems from the William Lowell Putnam Mathematical Competition, a premier undergraduate-level mathematics competition in North America. These theorems are formalized in Lean 4 and Isabelle, with a substantial subset also formalized in Coq. The problems require a broad range of mathematical knowledge and skills, covering topics such as analysis, abstract algebra, and number theory. PUTNAMBENCH is unique in that it includes problems in all three popular formal proof languages: Lean 4, Coq, and Isabelle. The benchmark is evaluated using several established neural and symbolic theorem-provers, including GPT-4, COPRA, Sledgehammer, and CoqHammer. Despite the advanced capabilities of these methods, they can only solve a handful of PUTNAMBENCH problems, highlighting the benchmark's difficulty and serving as a challenging open problem for the neural theorem-proving community. The paper discusses the design and features of PUTNAMBENCH, the formalization process, and the experimental evaluation using various theorem-proving approaches. The results show that while some problems can be solved, the overall challenge of the benchmark remains significant, indicating the need for significant advancements in automated mathematical reasoning.PUTNAMBENCH is a new multilingual benchmark designed to evaluate the capabilities of neural theorem-provers in solving competition-level mathematics problems. The benchmark consists of 1697 hand-constructed formalizations of 640 theorems from the William Lowell Putnam Mathematical Competition, a premier undergraduate-level mathematics competition in North America. These theorems are formalized in Lean 4 and Isabelle, with a substantial subset also formalized in Coq. The problems require a broad range of mathematical knowledge and skills, covering topics such as analysis, abstract algebra, and number theory. PUTNAMBENCH is unique in that it includes problems in all three popular formal proof languages: Lean 4, Coq, and Isabelle. The benchmark is evaluated using several established neural and symbolic theorem-provers, including GPT-4, COPRA, Sledgehammer, and CoqHammer. Despite the advanced capabilities of these methods, they can only solve a handful of PUTNAMBENCH problems, highlighting the benchmark's difficulty and serving as a challenging open problem for the neural theorem-proving community. The paper discusses the design and features of PUTNAMBENCH, the formalization process, and the experimental evaluation using various theorem-proving approaches. The results show that while some problems can be solved, the overall challenge of the benchmark remains significant, indicating the need for significant advancements in automated mathematical reasoning.

PUTNAMBENCH: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition

15 Jul 2024 | George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jennings, Amitayush Thakur, Swarat Chaudhuri