Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.
hub Canonical reference
FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
Canonical reference. 70% of citing Pith papers cite this work as background.
abstract
We introduce FrontierMath, a benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians. The questions cover most major branches of modern mathematics -- from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. Solving a typical problem requires multiple hours of effort from a researcher in the relevant branch of mathematics, and for the upper end questions, multiple days. FrontierMath uses new, unpublished problems and automated verification to reliably evaluate models while minimizing risk of data contamination. Current state-of-the-art AI models solve under 2% of problems, revealing a vast gap between AI capabilities and the prowess of the mathematical community. As AI systems advance toward expert-level mathematical abilities, FrontierMath offers a rigorous testbed that quantifies their progress.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
A Lean-verified multi-agent system produces a catalogue of 14,116 quantum codes with transversal diagonal gates for small parameters, extracts infinite families, and resolves specific distance-3 cases with constructions and no-go proofs.
CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.
A new benchmark of 9,415 Lean 4 specifications derived from 2,772 scraped Python property-based tests, plus a three-agent LLM transpilation pipeline and proof-generation baselines.
Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.
Formal Conjectures is a Lean 4 benchmark containing 2615 formalized problems with 1029 open conjectures, designed to evaluate automated mathematical reasoning and proof discovery.
Self-play between LLMs for problem authoring and solving, scored via Rasch modeling, shows that authoring and solving skills are partially decoupled and that the benchmark difficulty evolves with new models.
Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.
k-server-bench formulates potential-function discovery for the k-server conjecture as a code-based inequality-satisfaction task; current agents fully solve the resolved k=3 case and reduce violations on the open k=4 case.
DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.
RMA, a multi-agent system with structured memory and iterative feedback loops, solves 8 out of 10 research-level math problems on the new First Proof benchmark and outperforms GPT-5.2R and Aletheia according to expert evaluation.
OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.
An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.
XpertBench provides 1,346 rubric-scored expert tasks showing leading LLMs achieve a maximum ~66% success rate and ~55% mean score across domains.
CFDLLMBench is a new benchmark suite with CFDQuery, CFDCodeBench, and FoamBench to evaluate LLMs on graduate-level CFD knowledge, numerical reasoning, and context-dependent code implementation.
STAR-PólyaMath introduces a multi-agent framework with meta-strategic supervision and state-machine orchestration that reports state-of-the-art and perfect scores on eight top math competition benchmarks.
A hypothesis-driven pipeline generates targeted hard math problems that drop Llama-3.3-70B-Instruct accuracy from 77% on MATH to as low as 45%.
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
MadEvolve uses LLMs for evolutionary optimization of trading strategies and reports significant backtest improvements on Bitcoin tasks including signal feature evolution and joint strategy optimization.
AI agents exploring Platonic mathematical structures via proof hypergraphs may reveal the overall architecture of formal mathematics and what makes parts of it human-accessible.
AI for math combines task-specific architectures and general foundation models to support research and advance AI reasoning capabilities.
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
citing papers explorer
-
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.
-
Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems
A Lean-verified multi-agent system produces a catalogue of 14,116 quantum codes with transversal diagonal gates for small parameters, extracts infinite families, and resolves specific distance-3 cases with constructions and no-go proofs.
-
Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark
CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.
-
FVSpec: Real-World Property-Based Tests as Lean Challenges
A new benchmark of 9,415 Lean 4 specifications derived from 2,772 scraped Python property-based tests, plus a three-agent LLM transpilation pipeline and proof-generation baselines.
-
Forecasting Scientific Progress with Artificial Intelligence
Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.
-
Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics
Formal Conjectures is a Lean 4 benchmark containing 2615 formalized problems with 1029 open conjectures, designed to evaluate automated mathematical reasoning and proof discovery.
-
MathDuels: Evaluating LLMs as Problem Posers and Solvers
Self-play between LLMs for problem authoring and solving, scored via Rasch modeling, shows that authoring and solving skills are partially decoupled and that the benchmark difficulty evolves with new models.
-
Fine-Tuning Small Reasoning Models for Quantum Field Theory
Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
-
Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems
A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.
-
$k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture
k-server-bench formulates potential-function discovery for the k-server conjecture as a code-based inequality-satisfaction task; current agents fully solve the resolved k=3 case and reduce violations on the open k=4 case.
-
DeonticBench: A Benchmark for Reasoning over Rules
DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.
-
RMA: an Agentic System for Research-Level Mathematical Problems
RMA, a multi-agent system with structured memory and iterative feedback loops, solves 8 out of 10 research-level math problems on the new First Proof benchmark and outperforms GPT-5.2R and Aletheia according to expert evaluation.
-
Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs
OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.
-
Agentic Frameworks for Reasoning Tasks: An Empirical Study
An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.
-
Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation
XpertBench provides 1,346 rubric-scored expert tasks showing leading LLMs achieve a maximum ~66% success rate and ~55% mean score across domains.
-
CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics
CFDLLMBench is a new benchmark suite with CFDQuery, CFDCodeBench, and FoamBench to evaluate LLMs on graduate-level CFD knowledge, numerical reasoning, and context-dependent code implementation.
-
STAR-P\'olyaMath: Multi-Agent Reasoning under Persistent Meta-Strategic Supervision
STAR-PólyaMath introduces a multi-agent framework with meta-strategic supervision and state-machine orchestration that reports state-of-the-art and perfect scores on eight top math competition benchmarks.
-
Automatically Generating Hard Math Problems from Hypothesis-Driven Error Analysis
A hypothesis-driven pipeline generates targeted hard math problems that drop Llama-3.3-70B-Instruct accuracy from 77% on MATH to as low as 45%.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
Humanity's Last Exam
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
-
MadEvolve: Evolutionary Optimization of Trading Systems with Large Language Models
MadEvolve uses LLMs for evolutionary optimization of trading strategies and reports significant backtest improvements on Bitcoin tasks including signal feature evolution and joint strategy optimization.
-
Artificial Intelligence and the Structure of Mathematics
AI agents exploring Platonic mathematical structures via proof hypergraphs may reveal the overall architecture of formal mathematics and what makes parts of it human-accessible.
-
AI for Mathematics: Progress, Challenges, and Prospects
AI for math combines task-specific architectures and general foundation models to support research and advance AI reasoning capabilities.
-
A Survey of Reinforcement Learning for Large Reasoning Models
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
- Riemann-Bench: A Benchmark for Moonshot Mathematics
- Automated Conjecture Resolution with Formal Verification