hub Canonical reference

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning · 2024 · cs.AI · arXiv 2411.04872

Canonical reference. 70% of citing Pith papers cite this work as background.

45 Pith papers citing it

Background 70% of classified citations

open full Pith review browse 45 citing papers arXiv PDF

abstract

We introduce FrontierMath, a benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians. The questions cover most major branches of modern mathematics -- from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. Solving a typical problem requires multiple hours of effort from a researcher in the relevant branch of mathematics, and for the upper end questions, multiple days. FrontierMath uses new, unpublished problems and automated verification to reliably evaluate models while minimizing risk of data contamination. Current state-of-the-art AI models solve under 2% of problems, revealing a vast gap between AI capabilities and the prowess of the mathematical community. As AI systems advance toward expert-level mathematical abilities, FrontierMath offers a rigorous testbed that quantifies their progress.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 dataset 3

citation-polarity summary

background 7 use dataset 3

representative citing papers

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

cs.CL · 2026-05-09 · unverdicted · novelty 8.0 · 2 refs

Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.

Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems

quant-ph · 2025-10-23 · accept · novelty 8.0

A Lean-verified multi-agent system produces a catalogue of 14,116 quantum codes with transversal diagonal gates for small parameters, extracts infinite families, and resolves specific distance-3 cases with constructions and no-go proofs.

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

cs.AI · 2025-09-30 · unverdicted · novelty 8.0

CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.

IsoSci: A Benchmark of Isomorphic Cross-Domain Science Problems for Evaluating Reasoning versus Knowledge Retrieval in LLMs

cs.CL · 2026-07-01 · unverdicted · novelty 7.0

ISOSCI benchmark finds 91.3% of reasoning-mode accuracy gains in LLMs on science problems depend on domain knowledge rather than invariant logical structure.

Beyond Shapley: Efficient Computation of Asymmetric Shapley Values

cs.AI · 2026-06-23 · unverdicted · novelty 7.0

Polynomial-time exact ASV computation for rooted trees via topological equivalence classes, plus sampling approximation for arbitrary causal DAGs.

Automated Proving of Shannon-Type Entropy Inequalities via Fine-Tuned Language Models and Guided Tree Search

cs.IT · 2026-06-04 · unverdicted · novelty 7.0

Fine-tuned 0.6B LLMs with beam search achieve 85% success on 60 test Shannon entropy inequalities (n=10-15), outperforming GPT-5.5 (1.7%) and Psitip (33.3%).

LeanMarathon: Toward Reliable AI Co-Mathematicians through Long-Horizon Lean Autoformalization

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

LeanMarathon uses four contract-scoped agents on an evolving blueprint coordinated by a two-stage orchestrator to formalize seven theorems from Erdős problems in Lean, proving 258 lemmas with no sorry across three runs.

CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

CrowdMath is a new dataset of annotated collaborative math proof discussions where frontier LLMs achieve 83-88% on next-post prediction but only 0.42 macro-F1 on identifying contribution roles.

GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

GTBench is a new curriculum-grounded benchmark showing GPT-5 performs strongly on basic graph theory tasks but all models, including it, struggle more on advanced proofs with notable evaluator disagreements.

FVSpec: Real-World Property-Based Tests as Lean Challenges

cs.SE · 2026-05-31 · conditional · novelty 7.0

A new benchmark of 9,415 Lean 4 specifications derived from 2,772 scraped Python property-based tests, plus a three-agent LLM transpilation pipeline and proof-generation baselines.

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

The paper introduces a multi-turn interactive benchmark using 474 executable games to evaluate LLMs on evidence acquisition, belief updating, contextual robustness, and metacognitive adaptation, revealing large performance gaps and sensitivity to perturbations.

Forecasting Scientific Progress with Artificial Intelligence

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.

Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

Formal Conjectures is a Lean 4 benchmark containing 2615 formalized problems with 1029 open conjectures, designed to evaluate automated mathematical reasoning and proof discovery.

MathDuels: Evaluating LLMs as Problem Posers and Solvers

cs.CL · 2026-04-23 · unverdicted · novelty 7.0

Self-play between LLMs for problem authoring and solving, scored via Rasch modeling, shows that authoring and solving skills are partially decoupled and that the benchmark difficulty evolves with new models.

Fine-Tuning Small Reasoning Models for Quantum Field Theory

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.

Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems

cs.AI · 2026-04-13 · unverdicted · novelty 7.0

A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.

$k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture

cs.MS · 2026-04-08 · accept · novelty 7.0

k-server-bench formulates potential-function discovery for the k-server conjecture as a code-based inequality-satisfaction task; current agents fully solve the resolved k=3 case and reduce violations on the open k=4 case.

DeonticBench: A Benchmark for Reasoning over Rules

cs.CL · 2026-04-06 · unverdicted · novelty 7.0

DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.

Data and Evaluation Closed-Loop for Model Capability Enhancement

cs.AI · 2026-06-26 · unverdicted · novelty 6.0

Proposes capability slices with dual taxonomies and mapping rules to form a closed loop converting benchmark failures into targeted data interventions, validated via two opposing case studies on BBH and math reasoning.

Theorist Toolbox: Tools for Agent Based LLM-assisted economic theory Research

econ.TH · 2026-06-21 · unverdicted · novelty 6.0

External verification structures, not model capability, determine the reliability of LLM-assisted economic theory, as shown in attempts to design an incentive mechanism for a grade inflation model where adversarial checks caught false claims.

Curriculum Reinforcement Learning Can Incentivize Reasoning Capacity in LLMs Beyond the Base Model

cs.LG · 2026-06-21 · unverdicted · novelty 6.0

Boundary-aware Curriculum RL raises average pass@256 by 9.8 points over base models and 10.3 points over vanilla RLVR on Qwen, Llama, and DeepSeek families.

Learning the ARTS of Search for Automated Discovery

cs.AI · 2026-06-20 · unverdicted · novelty 6.0

ARTS improves automated scientific discovery by using reasoning LMs with test-time training to separate hypothesis merit from execution quality in tree search, achieving 15.3% relative gains on 22 MLGym and MLEBench tasks.

An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models

cs.AI · 2026-05-31 · conditional · novelty 6.0

LRMs show a large production-evaluation gap on the VAIR dataset with valid answers but invalid reasoning, driven by answer confirmation bias as evidenced by CoT analysis, linear probes, and causal patching.

Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention

cs.LG · 2026-05-28 · unverdicted · novelty 6.0

Larger models succeed on rare and complex tasks by reducing gradient interference from common tasks, allowing rare-task features to accumulate, as shown via synthetic task mixtures and OLMo pretraining from 4M to 4B parameters.

citing papers explorer

Showing 8 of 8 citing papers after filters.

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs cs.CL · 2026-05-09 · unverdicted · none · ref 15 · 2 links · internal anchor
Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.
IsoSci: A Benchmark of Isomorphic Cross-Domain Science Problems for Evaluating Reasoning versus Knowledge Retrieval in LLMs cs.CL · 2026-07-01 · unverdicted · none · ref 5 · internal anchor
ISOSCI benchmark finds 91.3% of reasoning-mode accuracy gains in LLMs on science problems depend on domain knowledge rather than invariant logical structure.
MathDuels: Evaluating LLMs as Problem Posers and Solvers cs.CL · 2026-04-23 · unverdicted · none · ref 39 · internal anchor
Self-play between LLMs for problem authoring and solving, scored via Rasch modeling, shows that authoring and solving skills are partially decoupled and that the benchmark difficulty evolves with new models.
DeonticBench: A Benchmark for Reasoning over Rules cs.CL · 2026-04-06 · unverdicted · none · ref 12 · internal anchor
DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.
Self-Improving Language Models with Bidirectional Evolutionary Search cs.CL · 2026-05-27 · unverdicted · none · ref 6 · internal anchor
Bidirectional Evolutionary Search augments autoregressive expansion with evolutionary recombination operators and dense backward subgoal feedback to produce better candidates than standard best-of-N or tree search for language model self-improvement.
CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics cs.CL · 2025-09-19 · unverdicted · none · ref 19 · internal anchor
CFDLLMBench is a new benchmark suite with CFDQuery, CFDCodeBench, and FoamBench to evaluate LLMs on graduate-level CFD knowledge, numerical reasoning, and context-dependent code implementation.
How reliable are LLMs when it comes to playing dice? cs.CL · 2026-06-05 · unverdicted · none · ref 9 · internal anchor
LLMs score 0.96 on standard probability exercises but 0.59 on counterintuitive ones and drop further with biased wording or misleading cues, indicating they are not genuine probabilistic reasoners.
A Survey of Reinforcement Learning for Large Reasoning Models cs.CL · 2025-09-10 · accept · none · ref 163 · internal anchor
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer