Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.
Thomas, and Charles Vial
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3roles
background 2polarities
background 2representative citing papers
This paper introduces a taxonomy of four LLM failure modes on research math proofs and empirically shows premise smuggling in all eight audited Gemini outputs, with a new audit instrument achieving 100% precision.
ProofRank benchmark shows substantial differences in LLM proof quality not captured by correctness, with trade-offs between quality metrics and accuracy.
citing papers explorer
-
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.
-
Failure Modes of Large Language Models on Research-Level Mathematics: A Taxonomy and an Empirical Characterisation
This paper introduces a taxonomy of four LLM failure modes on research math proofs and empirically shows premise smuggling in all eight audited Gemini outputs, with a new audit instrument achieving 100% precision.
-
Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness
ProofRank benchmark shows substantial differences in LLM proof quality not captured by correctness, with trade-offs between quality metrics and accuracy.