pith. machine review for the scientific record. sign in

arxiv: 2605.09063 · v1 · submitted 2026-05-09 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

Akari Asai, Akshelin R, Alexander B. Ivanov, Boboev Muhammadjon, Catherine Arnett, Chaeyoung Han, Christian Stump, Dmitrii Karp, Dohyun Kwon, DoYong Kwon, Duk-Soon Oh, Giovanni Resta, Graham Neubig, Greta Panova, Guijin Son, Hanearl Jung, Huiyun Noh, Hyein Lee, Hyeonah Kang, Hyungryul Baik, Hyungsun Bae, Hyunwoo Ko, Inomov Mashrafdzhon, Jeewon Kim, Jiang Longxi, Jiaqi Liu, Jieui Kang, Ji Eun Lee, Jimin Kim, Jin Yun, Jon-Lark Kim, JungYup Lee, Junseo Yoon, Junwoo Jo, Kibeom Kim, Kiwoon Kwon, Kyungmin Lee, Mario Kummer, Max Mercer, Minjun Kim, Nahyun Lee, Ng Ze-An, Rafa{\l} Marcin {\L}ochowski, Rapha\"el Lachi\`eze-Rey, Ruichen Zhang, Sam Yoosuk Kim, Sang Park, Sean Welleck, Sejin Park, Seonguk Seo, Seunghyeok Hong, Seungjae Lee, Seungone Kim, Seungyeop Yi, Shinae Shin, Shin Jaehoon, Sunatullo, SunHye Bok, Sunyoung Shin, Taewoong Eom, Yeachan Park, Yonghoon Ji, Yongseok Jang, Youchan Oh, Youngjae Yu, Youngtaek Kim, Zhaoyang Wang, Zolt\'an Kov\'acs

Pith reviewed 2026-05-12 01:49 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM evaluationmathematical reasoningresearch-level mathrefusal capabilityAI benchmarks
0
0 comments X

The pith

A benchmark of 439 original research-level math problems shows frontier LLMs solve at most 30 percent and recognize ill-posed questions less than half the time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Soohak, a collection of 439 mathematics problems written from scratch by 64 mathematicians to test whether large language models can perform the kind of reasoning that advances mathematical knowledge. The benchmark splits into a challenge set of solvable problems and a refusal set of deliberately ill-posed ones that require models to recognize when no justified answer exists. Top closed models reach only 30.4 percent on the challenge problems while open-weight models stay below 15 percent, and no model exceeds 50 percent on the refusal problems. This establishes measurable headroom and identifies refusal as a distinct capability that current training does not directly target.

Core claim

Soohak comprises 439 problems newly authored from scratch by 64 mathematicians. On the Challenge subset, models such as Gemini-3-Pro reach 30.4 percent accuracy, GPT-5 26.4 percent, and Claude-Opus-4.5 10.4 percent. On the refusal subset, which tests recognition of ill-posed problems, no model exceeds 50 percent accuracy. This demonstrates substantial remaining headroom in research-level mathematical reasoning and identifies refusal as an unoptimized capability.

What carries the argument

The refusal subset, a collection of deliberately ill-posed problems that requires models to pause rather than produce confident but unjustified answers.

If this is right

  • Model development must target both higher accuracy on novel problems and the ability to decline ill-posed queries.
  • Refusal becomes an explicit optimization target rather than an incidental byproduct of training.
  • Benchmarks that separate solvable research problems from ill-posed ones will replace olympiad-style tests as the next standard for measuring progress.
  • Open-weight models will require additional work to close the gap with closed frontier systems on these tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training signals that reward refusal on bad problems could be added without harming performance on well-posed ones.
  • The benchmark could serve as a filter to test whether models can help explore open mathematical questions instead of only solving closed ones.
  • Persistent low refusal rates suggest current systems optimize for answer generation even when evidence is insufficient.

Load-bearing premise

The 439 problems are genuinely original, uncontaminated by training data, and correctly classified as research-level by the mathematicians who authored them.

What would settle it

A model scoring above 70 percent on the challenge subset or above 70 percent on the refusal subset, or discovery that a substantial fraction of the problems already appears in public training corpora.

Figures

Figures reproduced from arXiv: 2605.09063 by Akari Asai, Akshelin R, Alexander B. Ivanov, Boboev Muhammadjon, Catherine Arnett, Chaeyoung Han, Christian Stump, Dmitrii Karp, Dohyun Kwon, DoYong Kwon, Duk-Soon Oh, Giovanni Resta, Graham Neubig, Greta Panova, Guijin Son, Hanearl Jung, Huiyun Noh, Hyein Lee, Hyeonah Kang, Hyungryul Baik, Hyungsun Bae, Hyunwoo Ko, Inomov Mashrafdzhon, Jeewon Kim, Jiang Longxi, Jiaqi Liu, Jieui Kang, Ji Eun Lee, Jimin Kim, Jin Yun, Jon-Lark Kim, JungYup Lee, Junseo Yoon, Junwoo Jo, Kibeom Kim, Kiwoon Kwon, Kyungmin Lee, Mario Kummer, Max Mercer, Minjun Kim, Nahyun Lee, Ng Ze-An, Rafa{\l} Marcin {\L}ochowski, Rapha\"el Lachi\`eze-Rey, Ruichen Zhang, Sam Yoosuk Kim, Sang Park, Sean Welleck, Sejin Park, Seonguk Seo, Seunghyeok Hong, Seungjae Lee, Seungone Kim, Seungyeop Yi, Shinae Shin, Shin Jaehoon, Sunatullo, SunHye Bok, Sunyoung Shin, Taewoong Eom, Yeachan Park, Yonghoon Ji, Yongseok Jang, Youchan Oh, Youngjae Yu, Youngtaek Kim, Zhaoyang Wang, Zolt\'an Kov\'acs.

Figure 1
Figure 1. Figure 1: Item-flow through the SH2 collection pipeline. Each candidate item passes through submission under an originality and copyright agreement, automated screening with model-gated routing and similarity checks, manual review by two human reviewers, contributor-controlled opt-in, and final inclusion. The figure reports candidate counts at each stage. Banned creators denote contributors found to have submitted A… view at source ↗
Figure 2
Figure 2. Figure 2: Compute scaling on Challenge and Refusal and unsolved counts Left: Pass@3 across the Qwen3 family (0.6B to 32B) on Challenge (blue) and Refusal (orange); Middle: Test-time scaling on the same two splits for GPT-OSS-120B (solid) at three settings (medium-reasoning at 16,384 tokens, hard-reasoning at 16,384 tokens, and hard-reasoning at 81,920 tokens) and for Qwen3-235B-A22B-thinking-2507 (dashed) at two set… view at source ↗
Figure 3
Figure 3. Figure 3: Model and human-team accuracy on the 79-problem human-evaluation set. The left panel shows closed and open-weight models. The right panel shows individual human teams A through E plus their combined coverage. Only Gemini-3-Pro exceeds combined-human coverage at 50.6%. The strongest single team is Math Major with IMO experience. mathematical expertise the benchmark rewards and to what degree. We describe ea… view at source ↗
Figure 4
Figure 4. Figure 4: Model rankings across per-subset Pass@3 and the three composite scores. Lower is better, with rank 1 at top. To the right of the dotted separator, models that are good at reasoning but careless on Refusal drop in rank. Models that are careful but mid-capability rise. GLM-5 rises 3 ranks from Capability to Avg-R. Kimi-2.5 drops 3 ranks. GPT-5 takes the top Avg-R rank from Gemini-3-Pro despite Gemini’s highe… view at source ↗
read the original abstract

Following the recent achievement of gold-medal performance on the IMO by frontier LLMs, the community is searching for the next meaningful and challenging target for measuring LLM reasoning. Whereas olympiad-style problems measure step-by-step reasoning alone, research-level problems use such reasoning to advance the frontier of mathematical knowledge itself, emerging as a compelling alternative. Yet research-level math benchmarks remain scarce because such problems are difficult to source (e.g., Riemann Bench and FrontierMath-Tier 4 contain 25 and 50 problems, respectively). To support reliable evaluation of next-generation frontier models, we introduce Soohak, a 439-problem benchmark newly authored from scratch by 64 mathematicians. Soohak comprises two subsets. On the Challenge subset, frontier models including Gemini-3-Pro, GPT-5, and Claude-Opus-4.5 reach 30.4%, 26.4%, and 10.4% respectively, leaving substantial headroom, while leading open-weight models such as Qwen3-235B, GPT-OSS-120B, and Kimi-2.5 remain below 15%. Notably, beyond standard problem solving, Soohak introduces a refusal subset that probes a capability intrinsic to research mathematics: recognizing ill-posed problems and pausing rather than producing confident but unjustified answers. On this subset, no model exceeds 50%, identifying refusal as a new optimization target that current models do not directly address. To prevent contamination, the dataset will be publicly released in late 2026, with model evaluations available upon request in the interim.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Soohak, a 439-problem benchmark for evaluating research-level mathematical capabilities of LLMs, curated from scratch by 64 mathematicians. It consists of a Challenge subset and a refusal subset. Frontier models achieve 30.4% (Gemini-3-Pro), 26.4% (GPT-5), and 10.4% (Claude-Opus-4.5) on the Challenge subset, with open models below 15%, while no model exceeds 50% on the refusal subset. The authors argue this leaves substantial headroom and identifies refusal as a new optimization target. The dataset will be released in late 2026 to prevent contamination, with evaluations available upon request.

Significance. Should the benchmark problems prove to be original, uncontaminated, and genuinely research-level as claimed, Soohak would represent a significant contribution by providing a scalable alternative to smaller benchmarks like Riemann Bench and FrontierMath-Tier 4. The explicit focus on refusal capabilities addresses an important gap in current LLM evaluation, as recognizing ill-posed problems is intrinsic to research mathematics. The mathematician curation by 64 experts is a strength that enhances credibility.

major comments (2)
  1. [Abstract] The manuscript does not detail the evaluation protocol, the process by which the 64 mathematicians verified the problems as research-level, or any statistical measures of significance for the reported accuracies (e.g., 30.4% on Challenge). These omissions are load-bearing because the central claims about model performance and headroom depend on the reliability of these numbers.
  2. [Abstract] Withholding the full 439-problem dataset until late 2026 prevents independent verification that the problems are original and uncontaminated by training data, and that the refusal subset contains genuinely ill-posed questions. This makes the empirical results non-reproducible in the current manuscript and weakens the assertion of a new optimization target.
minor comments (1)
  1. Consider including one or two example problems from each subset in the main text to better illustrate the distinction between olympiad-style and research-level problems.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major comment below and have revised the paper accordingly to strengthen the presentation of the evaluation details and reproducibility measures.

read point-by-point responses
  1. Referee: [Abstract] The manuscript does not detail the evaluation protocol, the process by which the 64 mathematicians verified the problems as research-level, or any statistical measures of significance for the reported accuracies (e.g., 30.4% on Challenge). These omissions are load-bearing because the central claims about model performance and headroom depend on the reliability of these numbers.

    Authors: We agree that additional methodological details are necessary to support the reliability of the reported results. In the revised manuscript, we have added a new subsection under Methods that describes the evaluation protocol in full, including the multi-stage verification process used by the 64 mathematicians to confirm that problems are research-level (involving independent review, discussion of mathematical novelty, and consensus). We also now include 95% confidence intervals computed via bootstrap resampling for all accuracy figures on the Challenge subset to quantify statistical significance and variability. revision: yes

  2. Referee: [Abstract] Withholding the full 439-problem dataset until late 2026 prevents independent verification that the problems are original and uncontaminated by training data, and that the refusal subset contains genuinely ill-posed questions. This makes the empirical results non-reproducible in the current manuscript and weakens the assertion of a new optimization target.

    Authors: We recognize the tension between long-term benchmark integrity and immediate reproducibility. The delayed public release is explicitly motivated by the need to minimize contamination risk from rapidly evolving training corpora, a practice adopted by other recent math benchmarks. To address this in the interim, the revised manuscript now provides: (1) a detailed account of the curation workflow that establishes originality (each problem authored by a mathematician with no prior public dissemination), (2) anonymized examples from the refusal subset illustrating ill-posedness, and (3) an explicit invitation for researchers to request the full evaluation logs and problem statements under a data-use agreement. We maintain that these steps allow the core empirical claims to be evaluated while preserving the benchmark's future utility. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical benchmark with no derivations or self-referential reductions

full rationale

The paper introduces Soohak as a new 439-problem benchmark authored from scratch by 64 mathematicians and reports direct empirical accuracies (e.g., 30.4% on Challenge subset for Gemini-3-Pro). No equations, fitted parameters, predictions derived from inputs, self-citations used as load-bearing uniqueness theorems, or ansatzes appear in the text. The central claims rest on the stated originality of the problems and the withheld dataset's future release rather than any internal derivation chain that reduces to its own inputs by construction. This is a standard empirical benchmark paper with no detectable circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper with no mathematical derivation, fitted parameters, or postulated entities.

pith-pipeline@v0.9.0 · 5911 in / 1066 out tokens · 30929 ms · 2026-05-12T01:49:00.166940+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 11 internal anchors

  1. [1]

    First Proof.arXiv preprintarXiv:2602.05192, 2026

    Abouzaid, M., Blumberg, A. J., Hairer, M., Kileel, J., Kolda, T. G., Nelson, P. D., Spiel- man, D., Srivastava, N., Ward, R., Weinberger, S., et al. (2026). First proof.arXiv preprint arXiv:2602.05192

  2. [2]

    gpt-oss-120b & gpt-oss-20b Model Card

    Agarwal, S., Ahmad, L., Ai, J., Altman, S., Applebaum, A., Arbus, E., Arora, R. K., Bai, Y ., Baker, B., Bao, H., et al. (2025). gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925

  3. [3]

    Alexeev, B., Putterman, M., Sawhney, M., Sellke, M., and Valiant, G. (2026a). Short proofs in combinatorics and number theory.arXiv preprint arXiv:2603.29961

  4. [4]

    Alexeev, B., Putterman, M., Sawhney, M., Sellke, M., and Valiant, G. (2026b). Short proofs in combinatorics, probability and number theory ii.arXiv preprint arXiv:2604.06609

  5. [5]

    An, S., Cai, X., Cao, X., Li, X., Lin, Y ., Liu, J., Lv, X., Ma, D., Wang, X., Wang, Z., and Zhou, S. (2025). Amo-bench: Large language models still struggle in high school math competitions.arXiv preprint arXiv:2510.26768

  6. [6]

    Introducing Claude Opus 4.5

    Anthropic (2025). Introducing Claude Opus 4.5. https://www.anthropic.com/news/ claude-opus-4-5. Accessed: 2026-05-04

  7. [7]

    American invitational mathematics examination (aime)

    Art of Problem Solving (2025). American invitational mathematics examination (aime). https: //artofproblemsolving.com/wiki/index.php/AIME. Accessed: 2026-01-24

  8. [8]

    Balunovi´c, M., Dekoninck, J., Petrov, I., Jovanovi ´c, N., and Vechev, M. (2025). Matharena: Evaluating llms on uncontaminated math competitions

  9. [9]

    Burnham, G. (2025). Less than 70% of FrontierMath is within reach for today’s models. Epoch AI, Gradient Updates. Accessed: 2026-02-24

  10. [10]

    BeyondAIME: Advancing Math Reasoning Evaluation Beyond High School Olympiads.https://huggingface.co/datasets/ByteDance-Seed/BeyondAIME

    ByteDance-Seed (2025). BeyondAIME: Advancing Math Reasoning Evaluation Beyond High School Olympiads.https://huggingface.co/datasets/ByteDance-Seed/BeyondAIME

  11. [11]

    Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. (2021). Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168

  12. [12]

    Feng, T., Trinh, T., Bingham, G., Kang, J., Zhang, S., Kim, S.-h., Barreto, K., Schildkraut, C., Jung, J., Seo, J., et al. (2026). Semi-autonomous mathematics discovery with gemini: A case study on the erd\h{o}s problems.arXiv preprint arXiv:2601.22401

  13. [13]

    Gao, B., Song, F., Yang, Z., Cai, Z., Miao, Y ., Dong, Q., Li, L., Ma, C., Chen, L., Xu, R., Tang, Z., Wang, B., Zan, D., Quan, S., Zhang, G., Sha, L., Zhang, Y ., Ren, X., Liu, T., and Chang, B. (2025). Omni-MATH: A universal olympiad level mathematic benchmark for large language models. InThe Thirteenth International Conference on Learning Representations

  14. [14]

    Garre, S., Knutsen, E., Mehta, S., and Chen, E. (2026). Riemann-bench: A benchmark for moonshot mathematics.arXiv preprint arXiv:2604.06802

  15. [15]

    Frontiermath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI.arXiv preprint arXiv:2411.04872, 2024

    Glazer, E., Erdil, E., Besiroglu, T., Chicharro, D., Chen, E., Gunning, A., Olsson, C. F., Denain, J.-S., Ho, A., Santos, E. d. O., et al. (2024). Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai.arXiv preprint arXiv:2411.04872. 10

  16. [16]

    Gemini 3.1 Pro.https://deepmind.google/models/gemini/ pro/

    Google DeepMind (2026). Gemini 3.1 Pro.https://deepmind.google/models/gemini/ pro/. Accessed: 2026-05-04

  17. [17]

    Guha, E., Marten, R., Keh, S., Raoof, N., Smyrnis, G., Bansal, H., Nezhurina, M., Mercat, J., Vu, T., Sprague, Z., et al. (2025). Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178

  18. [18]

    Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948

  19. [19]

    Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. (2021). Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)

  20. [20]

    HMMT.https://www.hmmt.org/

    HMMT (2025). HMMT.https://www.hmmt.org/. Accessed: 2026

  21. [21]

    Ko, H., Son, G., and Choi, D. (2025). Understand, solve and translate: Bridging the multilingual mathematical reasoning gap. InProceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025), pages 78–95

  22. [22]

    Ma, J., Wang, G., Feng, X., Liu, Y ., Hu, Z., and Liu, Y . (2026). Eternalmath: A living benchmark of frontier mathematics that evolves with human discovery.arXiv preprint arXiv:2601.01400

  23. [23]

    proprietary ai foundation model

    Ministry of Science and ICT (MSIT) (2025). “proprietary ai foundation model” project enters full-scale launch. Accessed 2026-02-15

  24. [24]

    Phan, L., Gatti, A., Han, Z., Li, N., Hu, J., Zhang, H., Zhang, C. B. C., Shaaban, M., Ling, J., Shi, S., et al. (2025). Humanity’s Last Exam.arXiv preprint arXiv:2501.14249

  25. [25]

    Schmitt, J., Bérczi, G., Dekoninck, J., Feusi, J., Gehrunger, T., Appenzeller, R., Bryan, J., Canova, N., de Wolff, T., Gaia, F., et al. (2025). Improofbench: Benchmarking ai on research-level mathematical proof generation.arXiv preprint arXiv:2509.26076

  26. [26]

    Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al. (2025). Openai gpt-5 system card.arXiv preprint arXiv:2601.03267

  27. [27]

    Skarlinski, M., Laurent, J., Bou, A., and White, A. (2025). About 30% of humanity’s last exam chemistry/biology answers are likely wrong. FutureHouse, Research Announcement. Accessed: 2026-02-24

  28. [28]

    Stump, C. (2025). Math sciencebench: Challenge the newest ai models with your hardest phd-level exercises.https://math.science-bench.ai/. Accessed: 2026-02

  29. [29]

    Team, K., Bai, T., Bai, Y ., Bao, Y ., Cai, S., Cao, Y ., Charles, Y ., Che, H., Chen, C., Chen, G., et al. (2026). Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276

  30. [30]

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. (2025). Qwen3 technical report.arXiv preprint arXiv:2505.09388

  31. [31]

    GLM-5.1: Towards Long-Horizon Tasks

    Z.ai (2026). GLM-5.1: Towards Long-Horizon Tasks. https://z.ai/blog/glm-5.1. Ac- cessed: 2026-05-04

  32. [32]

    Zhai, W., Wang, Z., Wang, J., Yang, B., Li, X., Xu, X., Wang, B., Wang, P., Wu, X., Li, A., et al. (2026). Hle-verified: A systematic verification and structured revision of humanity’s last exam. arXiv preprint arXiv:2602.13964

  33. [33]

    Sovereign AI Foundation Model

    Zhang, J., Petrui, C., Nikoli´c, K., and Tramèr, F. (2025). Realmath: A continuous benchmark for evaluating language models on research-level mathematics.arXiv preprint arXiv:2505.12575. 11 A Author affiliations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12 B Data collection details. . . . ....