Researchers grapple with a crucial question: can AI truly understand mathematical concepts, or is it merely excelling at recognizing patterns from past data? A group of 30 esteemed mathematicians at Harvard approached this inquiry through a rigorous investigation involving leading AI technologies, aiming to determine their capabilities in addressing novel mathematical problems without any prior exposure.
The initiative, dubbed First Proof, Second Batch, brought together esteemed scholars at Harvard's Center of Mathematical Sciences and Applications in early June 2026. Their mission was both straightforward and groundbreaking: to evaluate AI-generated solutions for 10 original, unpublished mathematics problems from active research. The findings, announced on June 10, reveal a complex landscape that doesn't neatly fit into either the pessimistic or overly optimistic views surrounding AI's mathematical abilities.
Why are unpublished problems significant in evaluating AI? The entire study is firmly rooted in a key design element. Each problem was sourced from ongoing, unpublished research, ensuring that these inquiries had not been featured in educational texts or accessible databases from which AI could have derived insights. This design choice was integral to assessing the genuine problem-solving capabilities of these AI systems.
The team behind this undertaking included top mathematicians, such as Mohammed Abouzaid from Stanford and Lauren Williams from Harvard, ensuring a high level of expertise in the assessment process.
What were the results of the AI evaluation? Four major AI systems were tested, including prominent models from OpenAI and Google. In total, expert evaluators granted passing scores on seven out of ten assessed problems across the systems tested. Notably, preliminary assessments indicated that AI systems only managed to solve two problems correctly. The significant improvement from early trials to the final evaluation suggests that the AIs might have enhanced their understanding through different attempts or varied prompting techniques, although the blind grading system prioritized the evaluation of submitted solutions based solely on their performance.
Continuing to build on initial assessments This recent evaluation expands on an earlier assessment conducted in February 2026. The First Proof initiative was conceived as an ongoing investigation rather than a mere episodic test. By continuously introducing fresh problems within multiple rounds, the project seeks to monitor the progression of AI capabilities in the realm of advanced mathematics, determining whether improvements are authentic or just a temporary surge in performance following rigorous benchmarks.
Traditional math benchmarks, whilst challenging, have become increasingly solvable by the latest AI technologies. However, research-level mathematics presents a different challenge altogether, often lacking known solutions and established methods. This complexity makes the exploration of AI in advanced mathematical reasoning both essential and intriguing.