hub Canonical reference

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning · 2024 · cs.AI · arXiv 2411.04872

Canonical reference. 70% of citing Pith papers cite this work as background.

26 Pith papers citing it

Background 70% of classified citations

open full Pith review browse 26 citing papers arXiv PDF

abstract

We introduce FrontierMath, a benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians. The questions cover most major branches of modern mathematics -- from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. Solving a typical problem requires multiple hours of effort from a researcher in the relevant branch of mathematics, and for the upper end questions, multiple days. FrontierMath uses new, unpublished problems and automated verification to reliably evaluate models while minimizing risk of data contamination. Current state-of-the-art AI models solve under 2% of problems, revealing a vast gap between AI capabilities and the prowess of the mathematical community. As AI systems advance toward expert-level mathematical abilities, FrontierMath offers a rigorous testbed that quantifies their progress.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 dataset 3

citation-polarity summary

background 7 use dataset 3

representative citing papers

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

cs.CL · 2026-05-09 · unverdicted · novelty 8.0 · 2 refs

Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.

Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems

quant-ph · 2025-10-23 · accept · novelty 8.0

A Lean-verified multi-agent system produces a catalogue of 14,116 quantum codes with transversal diagonal gates for small parameters, extracts infinite families, and resolves specific distance-3 cases with constructions and no-go proofs.

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

cs.AI · 2025-09-30 · unverdicted · novelty 8.0

CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.

FVSpec: Real-World Property-Based Tests as Lean Challenges

cs.SE · 2026-05-31 · conditional · novelty 7.0

A new benchmark of 9,415 Lean 4 specifications derived from 2,772 scraped Python property-based tests, plus a three-agent LLM transpilation pipeline and proof-generation baselines.

Forecasting Scientific Progress with Artificial Intelligence

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.

Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

Formal Conjectures is a Lean 4 benchmark containing 2615 formalized problems with 1029 open conjectures, designed to evaluate automated mathematical reasoning and proof discovery.

MathDuels: Evaluating LLMs as Problem Posers and Solvers

cs.CL · 2026-04-23 · unverdicted · novelty 7.0

Self-play between LLMs for problem authoring and solving, scored via Rasch modeling, shows that authoring and solving skills are partially decoupled and that the benchmark difficulty evolves with new models.

Fine-Tuning Small Reasoning Models for Quantum Field Theory

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.

Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems

cs.AI · 2026-04-13 · unverdicted · novelty 7.0

A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.

$k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture

cs.MS · 2026-04-08 · accept · novelty 7.0

k-server-bench formulates potential-function discovery for the k-server conjecture as a code-based inequality-satisfaction task; current agents fully solve the resolved k=3 case and reduce violations on the open k=4 case.

DeonticBench: A Benchmark for Reasoning over Rules

cs.CL · 2026-04-06 · unverdicted · novelty 7.0

DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.

RMA: an Agentic System for Research-Level Mathematical Problems

cs.AI · 2026-05-20 · unverdicted · novelty 6.0

RMA, a multi-agent system with structured memory and iterative feedback loops, solves 8 out of 10 research-level math problems on the new First Proof benchmark and outperforms GPT-5.2R and Aletheia according to expert evaluation.

Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.

Agentic Frameworks for Reasoning Tasks: An Empirical Study

cs.AI · 2026-04-17 · unverdicted · novelty 6.0

An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

cs.AI · 2026-03-27 · unverdicted · novelty 6.0

XpertBench provides 1,346 rubric-scored expert tasks showing leading LLMs achieve a maximum ~66% success rate and ~55% mean score across domains.

CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics

cs.CL · 2025-09-19 · unverdicted · novelty 6.0

CFDLLMBench is a new benchmark suite with CFDQuery, CFDCodeBench, and FoamBench to evaluate LLMs on graduate-level CFD knowledge, numerical reasoning, and context-dependent code implementation.

STAR-P\'olyaMath: Multi-Agent Reasoning under Persistent Meta-Strategic Supervision

cs.MA · 2026-05-19 · unverdicted · novelty 5.0

STAR-PólyaMath introduces a multi-agent framework with meta-strategic supervision and state-machine orchestration that reports state-of-the-art and perfect scores on eight top math competition benchmarks.

Automatically Generating Hard Math Problems from Hypothesis-Driven Error Analysis

cs.AI · 2026-04-06 · unverdicted · novelty 5.0

A hypothesis-driven pipeline generates targeted hard math problems that drop Llama-3.3-70B-Instruct accuracy from 77% on MATH to as low as 45%.

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

cs.AI · 2025-03-12 · unverdicted · novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

Humanity's Last Exam

cs.LG · 2025-01-24 · unverdicted · novelty 5.0

Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.

MadEvolve: Evolutionary Optimization of Trading Systems with Large Language Models

q-fin.TR · 2026-05-21 · unverdicted · novelty 4.0

MadEvolve uses LLMs for evolutionary optimization of trading strategies and reports significant backtest improvements on Bitcoin tasks including signal feature evolution and joint strategy optimization.

Artificial Intelligence and the Structure of Mathematics

cs.AI · 2026-04-07 · unverdicted · novelty 4.0

AI agents exploring Platonic mathematical structures via proof hypergraphs may reveal the overall architecture of formal mathematics and what makes parts of it human-accessible.

AI for Mathematics: Progress, Challenges, and Prospects

math.HO · 2026-01-19 · unverdicted · novelty 4.0

AI for math combines task-specific architectures and general foundation models to support research and advance AI reasoning capabilities.

A Survey of Reinforcement Learning for Large Reasoning Models

cs.CL · 2025-09-10 · accept · novelty 3.0

A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

citing papers explorer

Showing 26 of 26 citing papers.

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs cs.CL · 2026-05-09 · unverdicted · none · ref 15 · 2 links · internal anchor
Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.
Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems quant-ph · 2025-10-23 · accept · full · ref 4 · internal anchor
A Lean-verified multi-agent system produces a catalogue of 14,116 quantum codes with transversal diagonal gates for small parameters, extracts infinite families, and resolves specific distance-3 cases with constructions and no-go proofs.
Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark cs.AI · 2025-09-30 · unverdicted · none · ref 49 · internal anchor
CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.
FVSpec: Real-World Property-Based Tests as Lean Challenges cs.SE · 2026-05-31 · conditional · none · ref 19 · internal anchor
A new benchmark of 9,415 Lean 4 specifications derived from 2,772 scraped Python property-based tests, plus a three-agent LLM transpilation pipeline and proof-generation baselines.
Forecasting Scientific Progress with Artificial Intelligence cs.AI · 2026-05-21 · unverdicted · none · ref 47 · internal anchor
Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.
Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics cs.AI · 2026-05-13 · unverdicted · none · ref 11 · internal anchor
Formal Conjectures is a Lean 4 benchmark containing 2615 formalized problems with 1029 open conjectures, designed to evaluate automated mathematical reasoning and proof discovery.
MathDuels: Evaluating LLMs as Problem Posers and Solvers cs.CL · 2026-04-23 · unverdicted · none · ref 39 · internal anchor
Self-play between LLMs for problem authoring and solving, scored via Rasch modeling, shows that authoring and solving skills are partially decoupled and that the benchmark difficulty evolves with new models.
Fine-Tuning Small Reasoning Models for Quantum Field Theory cs.LG · 2026-04-21 · unverdicted · none · ref 3 · internal anchor
Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems cs.AI · 2026-04-13 · unverdicted · none · ref 34 · internal anchor
A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.
$k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture cs.MS · 2026-04-08 · accept · none · ref 21 · internal anchor
k-server-bench formulates potential-function discovery for the k-server conjecture as a code-based inequality-satisfaction task; current agents fully solve the resolved k=3 case and reduce violations on the open k=4 case.
DeonticBench: A Benchmark for Reasoning over Rules cs.CL · 2026-04-06 · unverdicted · none · ref 12 · internal anchor
DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.
RMA: an Agentic System for Research-Level Mathematical Problems cs.AI · 2026-05-20 · unverdicted · none · ref 38 · internal anchor
RMA, a multi-agent system with structured memory and iterative feedback loops, solves 8 out of 10 research-level math problems on the new First Proof benchmark and outperforms GPT-5.2R and Aletheia according to expert evaluation.
Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs cs.AI · 2026-05-09 · unverdicted · none · ref 49 · internal anchor
OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.
Agentic Frameworks for Reasoning Tasks: An Empirical Study cs.AI · 2026-04-17 · unverdicted · none · ref 55 · internal anchor
An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.
Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation cs.AI · 2026-03-27 · unverdicted · none · ref 4 · internal anchor
XpertBench provides 1,346 rubric-scored expert tasks showing leading LLMs achieve a maximum ~66% success rate and ~55% mean score across domains.
CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics cs.CL · 2025-09-19 · unverdicted · none · ref 19 · internal anchor
CFDLLMBench is a new benchmark suite with CFDQuery, CFDCodeBench, and FoamBench to evaluate LLMs on graduate-level CFD knowledge, numerical reasoning, and context-dependent code implementation.
STAR-P\'olyaMath: Multi-Agent Reasoning under Persistent Meta-Strategic Supervision cs.MA · 2026-05-19 · unverdicted · none · ref 4 · internal anchor
STAR-PólyaMath introduces a multi-agent framework with meta-strategic supervision and state-machine orchestration that reports state-of-the-art and perfect scores on eight top math competition benchmarks.
Automatically Generating Hard Math Problems from Hypothesis-Driven Error Analysis cs.AI · 2026-04-06 · unverdicted · none · ref 4 · internal anchor
A hypothesis-driven pipeline generates targeted hard math problems that drop Llama-3.3-70B-Instruct accuracy from 77% on MATH to as low as 45%.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models cs.AI · 2025-03-12 · unverdicted · none · ref 213 · internal anchor
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
Humanity's Last Exam cs.LG · 2025-01-24 · unverdicted · none · ref 19 · internal anchor
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
MadEvolve: Evolutionary Optimization of Trading Systems with Large Language Models q-fin.TR · 2026-05-21 · unverdicted · none · ref 3 · internal anchor
MadEvolve uses LLMs for evolutionary optimization of trading strategies and reports significant backtest improvements on Bitcoin tasks including signal feature evolution and joint strategy optimization.
Artificial Intelligence and the Structure of Mathematics cs.AI · 2026-04-07 · unverdicted · none · ref 45 · internal anchor
AI agents exploring Platonic mathematical structures via proof hypergraphs may reveal the overall architecture of formal mathematics and what makes parts of it human-accessible.
AI for Mathematics: Progress, Challenges, and Prospects math.HO · 2026-01-19 · unverdicted · none · ref 58 · internal anchor
AI for math combines task-specific architectures and general foundation models to support research and advance AI reasoning capabilities.
A Survey of Reinforcement Learning for Large Reasoning Models cs.CL · 2025-09-10 · accept · none · ref 163 · internal anchor
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
Riemann-Bench: A Benchmark for Moonshot Mathematics cs.AI · 2026-04-08 · unreviewed · ref 9 · internal anchor
Automated Conjecture Resolution with Formal Verification cs.LG · 2026-04-04 · unreviewed · ref 21 · internal anchor

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer