pith. sign in

arxiv: 2509.26574 · v4 · submitted 2025-09-30 · 💻 cs.AI · cond-mat.other· cs.CL· hep-th· quant-ph

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

Pith reviewed 2026-05-18 11:45 UTC · model grok-4.3

classification 💻 cs.AI cond-mat.othercs.CLhep-thquant-ph
keywords AI reasoningphysics benchmarklarge language modelsfrontier researchLLM evaluationscientific discoveryresearch challengesphysics subfields
0
0 comments X

The pith

Large language models reach only 5.7 percent accuracy on full-scale frontier physics research challenges.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces the CritPt benchmark to test whether current LLMs can handle the kind of complex, open-ended reasoning required in actual physics research. The benchmark includes 71 new challenges created by working physicists across many subfields, each broken down into simpler checkpoints. Results show that even the best models manage only low single-digit percentages on the full challenges, rising modestly with coding tools. A sympathetic reader would care because this reveals a gap between AI performance on math contests and its ability to contribute to real scientific discovery. If true, it suggests that significant advances are needed before AI can reliably assist physicists with novel research problems.

Core claim

The paper establishes that while LLMs demonstrate early promise on isolated reasoning checkpoints drawn from physics research, they fall short when faced with complete, composite research-scale challenges. Specifically, the highest average accuracy on the 71 full challenges is 5.7 percent for the top base model, increasing to around 10 percent when coding tools are provided. This evaluation uses problems newly created by over 50 active physicists to ensure they reflect real demands and are verifiable by machine.

What carries the argument

The CritPt benchmark, a collection of 71 composite research challenges and 190 checkpoints designed to simulate entry-level full-scale physics research projects.

If this is right

  • LLMs equipped with coding tools show moderate improvement but remain unreliable on complete research challenges.
  • The benchmark spans condensed matter, quantum physics, astrophysics, high energy physics and other subfields to give a broad view of current capabilities.
  • Automated grading pipelines customized for advanced physics output formats enable consistent, scalable evaluation of research-style answers.
  • The results provide a standardized foundation for tracking whether future models can assist with realistic scientific projects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar benchmarks could be built for other scientific domains to test whether the same performance gap appears outside physics.
  • Training approaches that emphasize competition-style problems may need supplementation with research-like tasks to improve outcomes.
  • Integrating LLMs with domain-specific simulation or literature tools might help close the gap on challenges involving nonlinear dynamics or biophysics.

Load-bearing premise

The 71 composite challenges, newly created by active physicists and hand-curated to be guess-resistant and machine-verifiable, accurately represent the demands of frontier physics research.

What would settle it

If a future model achieves over 50 percent accuracy on the full set of 71 composite challenges without external tools, this would indicate the claimed large disconnect has been closed.

Figures

Figures reproduced from arXiv: 2509.26574 by Aaron W. Young, Allen Zang, Anjun Chu, Anqi Mu, Chang Liu, Chi Xue, Chris Akers, Christopher Wilson, Daniel Inafuku, Di Luo, Dmytro Bandak, Eli Chertkov, Eliu Huerta, Farshid Jafarpour, Hao Peng, Hao Tong, Indranil Das, Jarrod T. Reilly, Jessie Shelton, Jiabin Yu, Jinchao Zhao, Jinchen He, John Bartolotta, John Drew Wilson, Juntai Zhou, Junyi Cao, Kevin Zhou, Liang Yang, Lifan Yuan, Luyu Gao, Marvin Qi, Minhui Zhu, Minyang Tian, Nathan Brooks, Nicolas Chia, Ofir Press, Peixue Wu, Peizhi Mai, Penghao Zhu, Qingzhi Chen, Shengyan Liu, Tianci Zhou, Victor Colussi, Weizhen Jia, Wenbo Fu, Wenchao Xu, Xiaocheng Yang, Xinan Chen, Xuefei Guo, Xueying Wang, Ya\"ir Hein, Yang Lyu, Yifan Su, Yikun Jiang, Yonatan Kahn, Yong Zhao, Yubo Yang, Yue Sun, Yufeng Du, Yujie Zhang, Yunkai Wang, Ze-Min Huang, Ze Yang, Ziming Ji.

Figure 1
Figure 1. Figure 1: CritPt’s challenges (left) and checkpoints (right) cover three flavors of physics research – theoretical, experimental, and computational – encountered by physics researchers. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A schematic overview of the two-step generation process and the grading system. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A comparison of 10 models’ performance on 70 test [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Schematic of the two experimental setups for evaluating sequential checkpoints within a multi-turn [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A comparison of 10 models’ performance on the 187 test [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Percentage of CritPt problems consistently solved by models. A problem is considered consistently solved if at least four out of five independent runs yield the correct final answer. (a) Percentage of challenges consistently solved. (b) Percentage of checkpoints consistently solved. 4.4 Detailed analysis of full model responses Beyond aggregated accuracy metrics, we further analyze model behavior at the le… view at source ↗
read the original abstract

While large language models (LLMs) with reasoning capabilities are progressing rapidly on high-school math competitions and coding, can they reason effectively through complex, open-ended challenges found in frontier physics research? And crucially, what kinds of reasoning tasks do physicists want LLMs to assist with? To address these questions, we present the CritPt (Complex Research using Integrated Thinking - Physics Test, pronounced "critical point"), the first benchmark designed to test LLMs on unpublished, research-level reasoning tasks that broadly covers modern physics research areas, including condensed matter, quantum physics, atomic, molecular & optical physics, astrophysics, high energy physics, mathematical physics, statistical physics, nuclear physics, nonlinear dynamics, fluid dynamics and biophysics. CritPt consists of 71 composite research challenges designed to simulate full-scale research projects at the entry level, which are also decomposed to 190 simpler checkpoint tasks for more fine-grained insights. All problems are newly created by 50+ active physics researchers based on their own research. Every problem is hand-curated to admit a guess-resistant and machine-verifiable answer and is evaluated by an automated grading pipeline heavily customized for advanced physics-specific output formats. We find that while current state-of-the-art LLMs show early promise on isolated checkpoints, they remain far from being able to reliably solve full research-scale challenges: the best average accuracy among base models is only 5.7%, achieved by GPT-5 (high), moderately rising to around 10% when equipped with coding tools. Through the realistic yet standardized evaluation offered by CritPt, we highlight a large disconnect between current model capabilities and realistic physics research demands, offering a foundation to guide the development of scientifically grounded AI tools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the CritPt benchmark for testing LLMs on unpublished, research-level physics reasoning tasks across subfields including condensed matter, quantum physics, astrophysics, high energy physics, and others. It comprises 71 composite challenges created by 50+ active physicists from their own research, decomposed into 190 simpler checkpoint tasks. All problems are hand-curated for guess-resistant, machine-verifiable answers and evaluated via a customized automated grading pipeline for advanced physics output formats. The central empirical result is that current SOTA LLMs show limited performance on full challenges, with the best base-model average accuracy at 5.7% (GPT-5 high) and rising only to around 10% when equipped with coding tools, indicating a substantial gap between model capabilities and realistic frontier physics research demands.

Significance. If the 71 challenges prove representative of entry-level frontier research and the grading pipeline is robust, this benchmark would offer a valuable, contamination-resistant tool for measuring progress toward AI-assisted physics research. The involvement of active physicists in problem creation and the focus on composite, open-ended tasks rather than isolated math or coding problems represent a clear advance over existing benchmarks. However, the absence of the full manuscript prevents confirmation of these strengths or assessment of whether the reported accuracies genuinely reflect capability limits.

major comments (2)
  1. [Abstract] Abstract: The central claim that LLMs 'remain far from being able to reliably solve full research-scale challenges' rests on the reported 5.7% and ~10% accuracies. These figures depend entirely on the physics-specific automated grading pipeline and the hand-curation criteria for guess-resistance and machine-verifiability, none of which are described or exemplified in the provided text. Without these details it is impossible to determine whether the low scores arise from model limitations or from benchmark construction choices.
  2. [Abstract] Abstract: The assertion that the 71 composite challenges 'broadly covers modern physics research areas' and 'accurately represent the demands of frontier physics research' is load-bearing for the benchmark's claimed utility. The text provides no information on curation criteria, subfield balance, difficulty calibration, or how the 190 checkpoints were derived, leaving open the possibility that performance gaps reflect selection artifacts rather than general research demands.
minor comments (1)
  1. [Abstract] Abstract: The parenthetical expansion of the acronym CritPt ('Complex Research using Integrated Thinking - Physics Test') should be verified for consistency with the title phrasing 'Probing the Critical Point (CritPt)'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their comments and for acknowledging the potential value of the CritPt benchmark in testing LLMs on realistic frontier physics tasks. We note that the review was performed on the abstract alone, as the full manuscript (which details the grading pipeline, curation process, and problem construction) was not provided to the referee. This explains the absence of methodological specifics in the reviewed text. We address the major comments point by point below and are open to revisions that improve clarity without altering the core findings.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that LLMs 'remain far from being able to reliably solve full research-scale challenges' rests on the reported 5.7% and ~10% accuracies. These figures depend entirely on the physics-specific automated grading pipeline and the hand-curation criteria for guess-resistance and machine-verifiability, none of which are described or exemplified in the provided text. Without these details it is impossible to determine whether the low scores arise from model limitations or from benchmark construction choices.

    Authors: We agree the abstract lacks these details due to length constraints. The full manuscript includes a dedicated methods section describing the customized automated grading pipeline for advanced physics output formats (e.g., handling symbolic expressions, diagrams, and multi-step derivations) and the hand-curation criteria applied by active physicists to ensure guess-resistance and machine-verifiability. These design choices were deliberate to produce a contamination-resistant benchmark focused on verifiable research outputs rather than open-ended generation. We will revise the abstract to include a concise reference to the pipeline and curation approach. revision: partial

  2. Referee: [Abstract] Abstract: The assertion that the 71 composite challenges 'broadly covers modern physics research areas' and 'accurately represent the demands of frontier physics research' is load-bearing for the benchmark's claimed utility. The text provides no information on curation criteria, subfield balance, difficulty calibration, or how the 190 checkpoints were derived, leaving open the possibility that performance gaps reflect selection artifacts rather than general research demands.

    Authors: The full manuscript provides these details: the 71 challenges were created by 50+ active physicists directly from their unpublished research, decomposed into 190 checkpoints for granular evaluation, and selected to span entry-level frontier work across condensed matter, quantum physics, astrophysics, high energy physics, and the other listed subfields with attention to balance and realistic composite structure. Difficulty was calibrated by the contributing researchers to reflect actual research demands rather than artificial selection. We will add a brief summary of subfield distribution and curation criteria to the abstract or include a supporting table in revisions. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with independent problem creation and direct accuracy measurement

full rationale

The paper introduces a new benchmark of 71 hand-curated research challenges created by 50+ physicists from their own work, decomposed into 190 checkpoints, and evaluates LLMs via an automated grading pipeline. The central result (5.7% average accuracy for base models) is a direct empirical measurement on this freshly constructed test set. No derivations, equations, fitted parameters, or predictions are presented that could reduce to inputs by construction. No self-citations or uniqueness theorems are invoked in the available text. The evaluation is self-contained against the stated benchmark; representativeness is an external validity question, not a circularity issue within the reported chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents exhaustive enumeration; problem selection and curation choices function as implicit free parameters, while the assumption that the selected tasks represent frontier research acts as a domain assumption.

pith-pipeline@v0.9.0 · 6080 in / 1128 out tokens · 36055 ms · 2026-05-18T11:45:54.292365+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Fine-Tuning Small Reasoning Models for Quantum Field Theory

    cs.LG 2026-04 unverdicted novelty 7.0

    Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.

  2. ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents

    cs.CR 2026-05 conditional novelty 6.0

    Clarification-seeking in LLM agents amplifies prompt injection attack success from ~2% to over 30% across ten frontier models in a new 728-scenario benchmark.

  3. Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations

    physics.comp-ph 2026-03 unverdicted novelty 6.0

    QMP-Bench supplies a realistic test set for AI on quantum many-body problems while PhysVEC uses integrated verifiers to turn unreliable LLM generations into code that passes both syntax and physics checks, outperformi...

  4. From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

    cs.SE 2026-04 unverdicted novelty 5.0

    Compact Gene representations of experience outperform documentation-oriented Skill packages for test-time control and iterative evolution in code-solving tasks, with measured gains on CritPt from 9.1% to 18.57% and 17...

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · cited by 4 Pith papers · 14 internal anchors

  1. [1]

    P. W. Anderson. More is different: broken symmetry and the nature of the hierarchical structure of science.Science, 177(4047):393–396, 1972

  2. [2]

    Sinatra, P

    R. Sinatra, P. Deville, M. Szell, D. Wang, and A.-L. Barabási. A century of physics.Nature Physics, 11(10):791–796, 2015

  3. [3]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  4. [4]

    Devlin, M.-W

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. pages, 4171–4186, 2019

  5. [5]

    Raffel, N

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

  6. [6]

    F. M. Delgado-Chaves, M. J. Jennings, A. Atalaia, J. Wolff, R. Horvath, Z. M. Mamdouh, J. Baumbach, and L. Baumbach. Transforming literature screening: The emerging role of large language models in systematic reviews.Proceedings of the National Academy of Sciences, 122(2): e2411962122, 2025

  7. [7]

    Scherbakov, N

    D. Scherbakov, N. Hubig, V . Jansari, A. Bakumenko, and L. A. Lenert. The emergence of large language models as tools in literature reviews: a large language model-assisted systematic review. Journal of the American Medical Informatics Association, 32(6):1071–1086, 2025

  8. [8]

    Pramanick, R

    S. Pramanick, R. Chellappa, and S. Venugopalan. SPIQA: A dataset for multimodal question answering on scientific papers.Advances in Neural Information Processing Systems, 37:118807– 118833, 2024

  9. [9]

    T. Gao, H. Yen, J. Yu, and D. Chen. Enabling large language models to generate text with citations. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023

  10. [10]

    Y . Wang, Q. Guo, W. Yao, H. Zhang, X. Zhang, Z. Wu, M. Zhang, X. Dai, Q. Wen, W. Ye, et al. Autosurvey: Large Language Models can automatically write surveys.Advances in neural information processing systems, 37:115119–115145, 2024

  11. [11]

    A. Asai, J. He, R. Shao, W. Shi, A. Singh, J. C. Chang, K. Lo, L. Soldaini, S. Feldman, M. D’arcy, et al. Openscholar: Synthesizing scientific literature with retrieval-augmented LMs.arXiv preprint arXiv:2411.14199, 2024

  12. [12]

    M. D. Skarlinski, S. Cox, J. M. Laurent, J. D. Braza, M. Hinks, M. J. Hammerling, M. Ponnapati, S. G. Rodriques, and A. D. White. Language agents achieve superhuman synthesis of scientific knowledge.arXiv preprint arXiv:2409.13740, 2024

  13. [13]

    H. Cui, Z. Shamsi, G. Cheon, X. Ma, S. Li, M. Tikhanovskaya, P. C. Norgaard, N. Mudur, M. B. Plomecka, P. Raccuglia, et al. CURIE: evaluating LLMs on multitask scientific long- context understanding and reasoning. InThe Thirteenth International Conference on Learning Representations, 2025

  14. [14]

    Introducing GPT-5, 2025

    OpenAI. Introducing GPT-5, 2025. https://openai.com/index/introducing-gpt-5/

  15. [15]

    Introducing OpenAI o3 and o4-mini, 2025

    OpenAI. Introducing OpenAI o3 and o4-mini, 2025. https://openai.com/index/introducing-o3- and-o4-mini/

  16. [16]

    Gemini 2.5: Our most intelligent AI model, 2025

    Google. Gemini 2.5: Our most intelligent AI model, 2025. https://blog.google/technology/google- deepmind/gemini-model-thinking-updates-march-2025/#gemini-2-5-thinking

  17. [17]

    D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 645(8081): 633–638, 2025

  18. [18]

    Introducing Claude 4, 2025

    Anthropic. Introducing Claude 4, 2025. https://www.anthropic.com/news/claude-4. 24

  19. [19]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024

  20. [20]

    The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation, 2025

    Meta. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

  21. [21]

    OpenAI o1 System Card

    A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. OpenAI o1 system card.arXiv preprint arXiv:2412.16720, 2024

  22. [22]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced rea- soning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  23. [23]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  24. [24]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  25. [25]

    Schick, J

    T. Schick, J. Dwivedi-Yu, R. Dessí, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Can- cedda, and T. Scialom. Toolformer: language models can teach themselves to use tools. In Proceedings of the 37th International Conference on Neural Information Processing Systems, 2023

  26. [26]

    X. Wang, Y . Chen, L. Yuan, Y . Zhang, Y . Li, H. Peng, and H. Ji. Executable code actions elicit better LLM agents. InProceedings of the International Conference on Machine Learning, 2024

  27. [27]

    L. Yuan, Y . Chen, X. Wang, Y . Fung, H. Peng, and H. Ji. CRAFT: Customizing LLMs by creating and retrieving from specialized toolsets. InThe Twelfth International Conference on Learning Representations, 2024

  28. [28]

    Lewis, E

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InProceedings of the 34th International Conference on Neural Information Processing Systems, 2020

  29. [29]

    X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023

  30. [30]

    C. V . Snell, J. Lee, K. Xu, and A. Kumar. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025

  31. [31]

    Hazra, G

    R. Hazra, G. Venturato, P. Z. Dos Martires, and L. De Raedt. Have large language models learned to reason? a characterization via 3-sat. InSecond Conference on Language Modeling, 2025

  32. [32]

    Gandhi, A

    K. Gandhi, A. K. Chakravarthy, A. Singh, N. Lile, and N. Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective STaRs. InSecond Conference on Language Modeling, 2025

  33. [33]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  34. [34]

    Program Synthesis with Large Language Models

    J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  35. [35]

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InThe Twelfth International Conference on Learning Representations, 2024. 25

  36. [36]

    Advanced version of Gemini with deep think officially achieves gold-medal standard at the International Mathematical Olympiad, 2025

    Google DeepMind. Advanced version of Gemini with deep think officially achieves gold-medal standard at the International Mathematical Olympiad, 2025

  37. [37]

    Competitive programming with large reasoning models.arXiv preprint arXiv:2502.06807, 2025

    A. El-Kishky, A. Wei, A. Saraiva, B. Minaiev, D. Selsam, D. Dohan, F. Song, H. Lightman, I. Clavera, J. Pachocki, et al. Competitive programming with large reasoning models.arXiv preprint arXiv:2502.06807, 2025

  38. [38]

    MathArena: Evaluating LLMs on Uncontaminated Math Competitions

    M. Balunovi´c, J. Dekoninck, I. Petrov, N. Jovanovi´c, and M. Vechev. MathArena: Evaluating llms on uncontaminated math competitions.arXiv preprint arXiv:2505.23281, 2025

  39. [39]

    C. He, R. Luo, Y . Bai, S. Hu, Z. Thai, J. Shen, J. Hu, X. Han, Y . Huang, Y . Zhang, et al. OlympiadBench: A challenging benchmark for promoting AGI with Olympiad-level bilingual multimodal scientific problems. pages, 3828–3850, 2024

  40. [40]

    N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, 2025

  41. [41]

    S. Qiu, S. Guo, Z.-Y . Song, Y . Sun, Z. Cai, J. Wei, T. Luo, Y . Yin, H. Zhang, Y . Hu, et al. PHYBench: Holistic evaluation of physical perception and reasoning in large language models. arXiv preprint arXiv:2504.16074, 2025

  42. [42]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

  43. [43]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Mea- suring massive multitask language understanding. InInternational Conference on Learning Representations, 2021

  44. [44]

    Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, et al. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems, 37:95266–95290, 2024

  45. [45]

    X. Wang, Z. Hu, P. Lu, Y . Zhu, J. Zhang, S. Subramaniam, A. R. Loomba, S. Zhang, Y . Sun, and W. Wang. SciBench: Evaluating college-level scientific problem-solving abilities of large language models. pages, 50622–50649. PMLR, 2024

  46. [46]

    X. Xu, Q. Xu, T. Xiao, T. Chen, Y . Yan, J. ZHANG, S. Diao, C. Yang, and Y . Wang. UGPhysics: A comprehensive benchmark for undergraduate physics reasoning with large language models. In Forty-second International Conference on Machine Learning, 2025

  47. [47]

    D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024

  48. [48]

    M. Tian, L. Gao, S. Zhang, X. Chen, C. Fan, X. Guo, R. Haas, P. Ji, K. Krongchon, Y . Li, et al. Scicode: A research coding benchmark curated by scientists.Advances in Neural Information Processing Systems, 37:30624–30650, 2024

  49. [49]

    FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

    E. Glazer, E. Erdil, T. Besiroglu, D. Chicharro, E. Chen, A. Gunning, C. F. Olsson, J.-S. Denain, A. Ho, E. d. O. Santos, et al. FrontierMath: a benchmark for evaluating advanced mathematical reasoning in AI.arXiv preprint arXiv:2411.04872, 2024

  50. [50]

    D. J. Chung, Z. Gao, Y . Kvasiuk, T. Li, M. Münchmeyer, M. Rudolph, F. Sala, and S. C. Tadepalli. Theoretical physics benchmark (TPBench)–a dataset and study of AI reasoning capabilities in theoretical physics.arXiv preprint arXiv:2502.15815, 2025

  51. [51]

    L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249, 2025

  52. [52]

    H. Wang, T. Fu, Y . Du, W. Gao, K. Huang, Z. Liu, P. Chandak, S. Liu, P. Van Katwyk, A. Deac, et al. Scientific discovery in the age of artificial intelligence.Nature, 620(7972):47–60, 2023. 26

  53. [53]

    Z. Wu, L. Qiu, A. Ross, E. Akyürek, B. Chen, B. Wang, N. Kim, J. Andreas, and Y . Kim. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. pages, 1819–1862, 2024

  54. [54]

    Balepur, A

    N. Balepur, A. Ravichander, and R. Rudinger. Artifacts or abduction: How do LLMs answer multiple-choice questions without the question? pages, 10308–10330, 2024

  55. [55]

    C. Deng, Y . Zhao, X. Tang, M. Gerstein, and A. Cohan. Investigating data contamination in modern benchmarks for large language models. pages, 8698–8711, 2024

  56. [56]

    S. Ott, A. Barbosa-Silva, K. Blagec, J. Brauner, and M. Samwald. Mapping global dynamics of benchmark creation and saturation in artificial intelligence.Nature Communications, 13(1):6793, 2022

  57. [57]

    Y . Li, Y . Guo, F. Guerin, and C. Lin. An open-source data contamination report for large language models. pages, 528–541, 2024

  58. [58]

    Balepur, R

    N. Balepur, R. Rudinger, and J. L. Boyd-Graber. Which of these best describes multiple choice evaluation with LLMs? A) forced B) flawed C) fixable D) all of the above. pages, 3394–3418. Association for Computational Linguistics, 2025

  59. [59]

    Dodge, M

    J. Dodge, M. Sap, A. Marasovi ´c, W. Agnew, G. Ilharco, D. Groeneveld, M. Mitchell, and M. Gardner. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. pages, 1286–1305, 2021

  60. [60]

    Golchin and M

    S. Golchin and M. Surdeanu. Time travel in LLMs: Tracing data contamination in large language models. InThe Twelfth International Conference on Learning Representations, 2024

  61. [61]

    Roberts, H

    M. Roberts, H. Thakur, C. Herlihy, C. White, and S. Dooley. To the cutoff... and beyond? a longitudinal perspective on llm data contamination. InThe Twelfth International Conference on Learning Representations, 2023

  62. [62]

    P. Wang, L. Li, L. Chen, Z. Cai, D. Zhu, B. Lin, Y . Cao, L. Kong, Q. Liu, T. Liu, et al. Large language models are not fair evaluators. pages, 9440–9450, 2024

  63. [63]

    J. Ye, Y . Wang, Y . Huang, D. Chen, Q. Zhang, N. Moniz, T. Gao, W. Geyer, C. Huang, P.-Y . Chen, et al. Justice or prejudice? quantifying biases in LLM-as-a-judge. InInternational Conference on Learning Representations, 2025

  64. [64]

    M. T. R. Laskar, S. Alqahtani, M. S. Bari, M. Rahman, M. A. M. Khan, H. Khan, I. Jahan, A. Bhuiyan, C. W. Tan, M. R. Parvez, E. Hoque, S. Joty, and J. Huang. A systematic survey and critical review on evaluating large language models: Challenges, limitations, and recommendations. pages, 13785–13816. Association for Computational Linguistics, 2024

  65. [65]

    Meurer, C

    A. Meurer, C. P. Smith, M. Paprocki, O. ˇCertík, S. B. Kirpichev, M. Rocklin, Kumar, et al. SymPy: symbolic computing in Python.PeerJ Computer Science, 3:e103, 2017

  66. [66]

    https://physh.org/about

    PhySH – Physics Subject Headings. https://physh.org/about. Accessed: August 18, 2025

  67. [67]

    A. T. Kalai, O. Nachum, S. S. Vempala, and E. Zhang. Why language models hallucinate.arXiv preprint arXiv:2509.04664, 2025

  68. [68]

    Quantum fault tolerance in small experiments

    D. Gottesman. Quantum fault tolerance in small experiments.arXiv preprint arXiv:1610.03507, 2016

  69. [69]

    C. Vuillot. Is error detection helpful on IBM 5Q chips?Quantum Inf. Comput., 18(11):0949, 2017

  70. [70]

    N. M. Linke, M. Gutierrez, K. A. Landsman, C. Figgatt, S. Debnath, K. R. Brown, and C. Monroe. Fault-tolerant quantum error detection.Sci. Adv., 3(10):e1701074, 2017

  71. [71]

    Harper and S

    R. Harper and S. T. Flammia. Fault-tolerant logical gates in the IBM quantum experience.Phys. Rev. Lett., 122:080504, 2019. 27

  72. [72]

    Komar, E

    P. Komar, E. M. Kessler, M. Bishof, L. Jiang, A. S. Sørensen, J. Ye, and M. D. Lukin. A quantum network of clocks.Nature Physics, 10(8):582–587, 2014

  73. [73]

    Zhang and Q

    Z. Zhang and Q. Zhuang. Distributed quantum sensing.Quantum Science and Technology, 6(4): 043001, 2021

  74. [74]

    A. Zang, A. Kolar, A. Gonzales, J. Chung, S. K. Gray, R. Kettimuthu, T. Zhong, and Z. H. Saleem. Quantum advantage in distributed sensing with noisy quantum networks.arXiv preprint arXiv:2409.17089, 2024

  75. [75]

    Zang, T.-X

    A. Zang, T.-X. Zheng, P. C. Maurer, F. T. Chong, M. Suchara, and T. Zhong. Enhancing noisy quantum sensing by GHZ state partitioning.arXiv preprint arXiv:2507.02829, 2025

  76. [76]

    A. H. Guth. Inflationary universe: A possible solution to the horizon and flatness problems.Phys. Rev. D, 23:347–356, 1981

  77. [77]

    A. Linde. A new inflationary universe scenario: A possible solution of the horizon, flatness, homogeneity, isotropy and primordial monopole problems.Physics Letters B, 108(6):389–393, 1982

  78. [78]

    Albrecht and P

    A. Albrecht and P. J. Steinhardt. Cosmology for grand unified theories with radiatively induced symmetry breaking.Phys. Rev. Lett., 48:1220–1223, 1982

  79. [79]

    A. Linde. Chaotic inflation.Physics Letters B, 129(3):177–181, 1983

  80. [80]

    Freese, J

    K. Freese, J. A. Frieman, and A. V . Olinto. Natural inflation with pseudo Nambu-Goldstone bosons.Phys. Rev. Lett., 65:3233–3236, 1990

Showing first 80 references.