pith. machine review for the scientific record. sign in

hub

Frontiermath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

hub tools

citation-role summary

background 3

citation-polarity summary

years

2026 13 2025 2

roles

background 3

polarities

background 3

clear filters

representative citing papers

MathDuels: Evaluating LLMs as Problem Posers and Solvers

cs.CL · 2026-04-23 · unverdicted · novelty 7.0

Self-play between LLMs for problem authoring and solving, scored via Rasch modeling, shows that authoring and solving skills are partially decoupled and that the benchmark difficulty evolves with new models.

DeonticBench: A Benchmark for Reasoning over Rules

cs.CL · 2026-04-06 · unverdicted · novelty 7.0

DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.

Agentic Frameworks for Reasoning Tasks: An Empirical Study

cs.AI · 2026-04-17 · unverdicted · novelty 6.0

An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.

Automated Conjecture Resolution with Formal Verification

cs.LG · 2026-04-04 · unverdicted · novelty 6.0

An AI framework combining informal reasoning and formal verification resolves an open commutative algebra problem and produces a Lean 4-checked proof with minimal human input.

Humanity's Last Exam

cs.LG · 2025-01-24 · unverdicted · novelty 5.0

Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.

Artificial Intelligence and the Structure of Mathematics

cs.AI · 2026-04-07 · unverdicted · novelty 4.0

AI agents exploring Platonic mathematical structures via proof hypergraphs may reveal the overall architecture of formal mathematics and what makes parts of it human-accessible.

citing papers explorer

Showing 1 of 1 citing paper after filters.