Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

Aaron W. Young; Allen Zang; Anjun Chu; Anqi Mu; Chang Liu; Chi Xue; Chris Akers; Christopher Wilson; Daniel Inafuku; Di Luo

arxiv: 2509.26574 · v4 · submitted 2025-09-30 · 💻 cs.AI · cond-mat.other· cs.CL· hep-th· quant-ph

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

Minhui Zhu , Minyang Tian , Xiaocheng Yang , Tianci Zhou , Lifan Yuan , Penghao Zhu , Eli Chertkov , Shengyan Liu

show 56 more authors

Yufeng Du Ziming Ji Indranil Das Qingzhi Chen Junyi Cao Jiabin Yu Peixue Wu Jinchen He Yifan Su Yikun Jiang Yujie Zhang Chang Liu Ze-Min Huang Weizhen Jia Yunkai Wang Farshid Jafarpour Yong Zhao Xinan Chen Jessie Shelton Aaron W. Young John Bartolotta Wenchao Xu Yue Sun Anjun Chu Victor Colussi Chris Akers Nathan Brooks Wenbo Fu Jinchao Zhao Marvin Qi Anqi Mu Yubo Yang Allen Zang Yang Lyu Peizhi Mai Christopher Wilson Xuefei Guo Juntai Zhou Daniel Inafuku Chi Xue Luyu Gao Ze Yang Ya\"ir Hein Yonatan Kahn Kevin Zhou Di Luo John Drew Wilson Jarrod T. Reilly Dmytro Bandak Ofir Press Liang Yang Xueying Wang Hao Tong Nicolas Chia Eliu Huerta Hao Peng

This is my paper

Pith reviewed 2026-05-18 11:45 UTC · model grok-4.3

classification 💻 cs.AI cond-mat.othercs.CLhep-thquant-ph

keywords AI reasoningphysics benchmarklarge language modelsfrontier researchLLM evaluationscientific discoveryresearch challengesphysics subfields

0 comments

The pith

Large language models reach only 5.7 percent accuracy on full-scale frontier physics research challenges.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces the CritPt benchmark to test whether current LLMs can handle the kind of complex, open-ended reasoning required in actual physics research. The benchmark includes 71 new challenges created by working physicists across many subfields, each broken down into simpler checkpoints. Results show that even the best models manage only low single-digit percentages on the full challenges, rising modestly with coding tools. A sympathetic reader would care because this reveals a gap between AI performance on math contests and its ability to contribute to real scientific discovery. If true, it suggests that significant advances are needed before AI can reliably assist physicists with novel research problems.

Core claim

The paper establishes that while LLMs demonstrate early promise on isolated reasoning checkpoints drawn from physics research, they fall short when faced with complete, composite research-scale challenges. Specifically, the highest average accuracy on the 71 full challenges is 5.7 percent for the top base model, increasing to around 10 percent when coding tools are provided. This evaluation uses problems newly created by over 50 active physicists to ensure they reflect real demands and are verifiable by machine.

What carries the argument

The CritPt benchmark, a collection of 71 composite research challenges and 190 checkpoints designed to simulate entry-level full-scale physics research projects.

If this is right

LLMs equipped with coding tools show moderate improvement but remain unreliable on complete research challenges.
The benchmark spans condensed matter, quantum physics, astrophysics, high energy physics and other subfields to give a broad view of current capabilities.
Automated grading pipelines customized for advanced physics output formats enable consistent, scalable evaluation of research-style answers.
The results provide a standardized foundation for tracking whether future models can assist with realistic scientific projects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar benchmarks could be built for other scientific domains to test whether the same performance gap appears outside physics.
Training approaches that emphasize competition-style problems may need supplementation with research-like tasks to improve outcomes.
Integrating LLMs with domain-specific simulation or literature tools might help close the gap on challenges involving nonlinear dynamics or biophysics.

Load-bearing premise

The 71 composite challenges, newly created by active physicists and hand-curated to be guess-resistant and machine-verifiable, accurately represent the demands of frontier physics research.

What would settle it

If a future model achieves over 50 percent accuracy on the full set of 71 composite challenges without external tools, this would indicate the claimed large disconnect has been closed.

Figures

Figures reproduced from arXiv: 2509.26574 by Aaron W. Young, Allen Zang, Anjun Chu, Anqi Mu, Chang Liu, Chi Xue, Chris Akers, Christopher Wilson, Daniel Inafuku, Di Luo, Dmytro Bandak, Eli Chertkov, Eliu Huerta, Farshid Jafarpour, Hao Peng, Hao Tong, Indranil Das, Jarrod T. Reilly, Jessie Shelton, Jiabin Yu, Jinchao Zhao, Jinchen He, John Bartolotta, John Drew Wilson, Juntai Zhou, Junyi Cao, Kevin Zhou, Liang Yang, Lifan Yuan, Luyu Gao, Marvin Qi, Minhui Zhu, Minyang Tian, Nathan Brooks, Nicolas Chia, Ofir Press, Peixue Wu, Peizhi Mai, Penghao Zhu, Qingzhi Chen, Shengyan Liu, Tianci Zhou, Victor Colussi, Weizhen Jia, Wenbo Fu, Wenchao Xu, Xiaocheng Yang, Xinan Chen, Xuefei Guo, Xueying Wang, Ya\"ir Hein, Yang Lyu, Yifan Su, Yikun Jiang, Yonatan Kahn, Yong Zhao, Yubo Yang, Yue Sun, Yufeng Du, Yujie Zhang, Yunkai Wang, Ze-Min Huang, Ze Yang, Ziming Ji.

**Figure 1.** Figure 1: CritPt’s challenges (left) and checkpoints (right) cover three flavors of physics research – theoretical, experimental, and computational – encountered by physics researchers. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: A schematic overview of the two-step generation process and the grading system. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: A comparison of 10 models’ performance on 70 test [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Schematic of the two experimental setups for evaluating sequential checkpoints within a multi-turn [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: A comparison of 10 models’ performance on the 187 test [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Percentage of CritPt problems consistently solved by models. A problem is considered consistently solved if at least four out of five independent runs yield the correct final answer. (a) Percentage of challenges consistently solved. (b) Percentage of checkpoints consistently solved. 4.4 Detailed analysis of full model responses Beyond aggregated accuracy metrics, we further analyze model behavior at the le… view at source ↗

read the original abstract

While large language models (LLMs) with reasoning capabilities are progressing rapidly on high-school math competitions and coding, can they reason effectively through complex, open-ended challenges found in frontier physics research? And crucially, what kinds of reasoning tasks do physicists want LLMs to assist with? To address these questions, we present the CritPt (Complex Research using Integrated Thinking - Physics Test, pronounced "critical point"), the first benchmark designed to test LLMs on unpublished, research-level reasoning tasks that broadly covers modern physics research areas, including condensed matter, quantum physics, atomic, molecular & optical physics, astrophysics, high energy physics, mathematical physics, statistical physics, nuclear physics, nonlinear dynamics, fluid dynamics and biophysics. CritPt consists of 71 composite research challenges designed to simulate full-scale research projects at the entry level, which are also decomposed to 190 simpler checkpoint tasks for more fine-grained insights. All problems are newly created by 50+ active physics researchers based on their own research. Every problem is hand-curated to admit a guess-resistant and machine-verifiable answer and is evaluated by an automated grading pipeline heavily customized for advanced physics-specific output formats. We find that while current state-of-the-art LLMs show early promise on isolated checkpoints, they remain far from being able to reliably solve full research-scale challenges: the best average accuracy among base models is only 5.7%, achieved by GPT-5 (high), moderately rising to around 10% when equipped with coding tools. Through the realistic yet standardized evaluation offered by CritPt, we highlight a large disconnect between current model capabilities and realistic physics research demands, offering a foundation to guide the development of scientifically grounded AI tools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper introduces a benchmark of real unpublished physics research problems created by active researchers, but the low LLM scores can't be properly judged without seeing the actual tasks and grading details.

read the letter

The main thing here is a new benchmark called CritPt that uses 71 composite challenges freshly made by over 50 working physicists from their own unpublished research. It reports that even top models like GPT-5 hit only 5.7% average accuracy on the full tasks, rising to about 10% with coding tools, while doing better on the 190 simpler checkpoints. The coverage spans condensed matter, quantum physics, astrophysics, high energy, and several other areas, which is broader than most existing tests that stick to contests or public problems.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the CritPt benchmark for testing LLMs on unpublished, research-level physics reasoning tasks across subfields including condensed matter, quantum physics, astrophysics, high energy physics, and others. It comprises 71 composite challenges created by 50+ active physicists from their own research, decomposed into 190 simpler checkpoint tasks. All problems are hand-curated for guess-resistant, machine-verifiable answers and evaluated via a customized automated grading pipeline for advanced physics output formats. The central empirical result is that current SOTA LLMs show limited performance on full challenges, with the best base-model average accuracy at 5.7% (GPT-5 high) and rising only to around 10% when equipped with coding tools, indicating a substantial gap between model capabilities and realistic frontier physics research demands.

Significance. If the 71 challenges prove representative of entry-level frontier research and the grading pipeline is robust, this benchmark would offer a valuable, contamination-resistant tool for measuring progress toward AI-assisted physics research. The involvement of active physicists in problem creation and the focus on composite, open-ended tasks rather than isolated math or coding problems represent a clear advance over existing benchmarks. However, the absence of the full manuscript prevents confirmation of these strengths or assessment of whether the reported accuracies genuinely reflect capability limits.

major comments (2)

[Abstract] Abstract: The central claim that LLMs 'remain far from being able to reliably solve full research-scale challenges' rests on the reported 5.7% and ~10% accuracies. These figures depend entirely on the physics-specific automated grading pipeline and the hand-curation criteria for guess-resistance and machine-verifiability, none of which are described or exemplified in the provided text. Without these details it is impossible to determine whether the low scores arise from model limitations or from benchmark construction choices.
[Abstract] Abstract: The assertion that the 71 composite challenges 'broadly covers modern physics research areas' and 'accurately represent the demands of frontier physics research' is load-bearing for the benchmark's claimed utility. The text provides no information on curation criteria, subfield balance, difficulty calibration, or how the 190 checkpoints were derived, leaving open the possibility that performance gaps reflect selection artifacts rather than general research demands.

minor comments (1)

[Abstract] Abstract: The parenthetical expansion of the acronym CritPt ('Complex Research using Integrated Thinking - Physics Test') should be verified for consistency with the title phrasing 'Probing the Critical Point (CritPt)'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their comments and for acknowledging the potential value of the CritPt benchmark in testing LLMs on realistic frontier physics tasks. We note that the review was performed on the abstract alone, as the full manuscript (which details the grading pipeline, curation process, and problem construction) was not provided to the referee. This explains the absence of methodological specifics in the reviewed text. We address the major comments point by point below and are open to revisions that improve clarity without altering the core findings.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that LLMs 'remain far from being able to reliably solve full research-scale challenges' rests on the reported 5.7% and ~10% accuracies. These figures depend entirely on the physics-specific automated grading pipeline and the hand-curation criteria for guess-resistance and machine-verifiability, none of which are described or exemplified in the provided text. Without these details it is impossible to determine whether the low scores arise from model limitations or from benchmark construction choices.

Authors: We agree the abstract lacks these details due to length constraints. The full manuscript includes a dedicated methods section describing the customized automated grading pipeline for advanced physics output formats (e.g., handling symbolic expressions, diagrams, and multi-step derivations) and the hand-curation criteria applied by active physicists to ensure guess-resistance and machine-verifiability. These design choices were deliberate to produce a contamination-resistant benchmark focused on verifiable research outputs rather than open-ended generation. We will revise the abstract to include a concise reference to the pipeline and curation approach. revision: partial
Referee: [Abstract] Abstract: The assertion that the 71 composite challenges 'broadly covers modern physics research areas' and 'accurately represent the demands of frontier physics research' is load-bearing for the benchmark's claimed utility. The text provides no information on curation criteria, subfield balance, difficulty calibration, or how the 190 checkpoints were derived, leaving open the possibility that performance gaps reflect selection artifacts rather than general research demands.

Authors: The full manuscript provides these details: the 71 challenges were created by 50+ active physicists directly from their unpublished research, decomposed into 190 checkpoints for granular evaluation, and selected to span entry-level frontier work across condensed matter, quantum physics, astrophysics, high energy physics, and the other listed subfields with attention to balance and realistic composite structure. Difficulty was calibrated by the contributing researchers to reflect actual research demands rather than artificial selection. We will add a brief summary of subfield distribution and curation criteria to the abstract or include a supporting table in revisions. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with independent problem creation and direct accuracy measurement

full rationale

The paper introduces a new benchmark of 71 hand-curated research challenges created by 50+ physicists from their own work, decomposed into 190 checkpoints, and evaluates LLMs via an automated grading pipeline. The central result (5.7% average accuracy for base models) is a direct empirical measurement on this freshly constructed test set. No derivations, equations, fitted parameters, or predictions are presented that could reduce to inputs by construction. No self-citations or uniqueness theorems are invoked in the available text. The evaluation is self-contained against the stated benchmark; representativeness is an external validity question, not a circularity issue within the reported chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents exhaustive enumeration; problem selection and curation choices function as implicit free parameters, while the assumption that the selected tasks represent frontier research acts as a domain assumption.

pith-pipeline@v0.9.0 · 6080 in / 1128 out tokens · 36055 ms · 2026-05-18T11:45:54.292365+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CritPt consists of 71 composite research challenges... evaluated by an automated grading pipeline heavily customized for advanced physics-specific output formats.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

All problems are newly created by 50+ active physics researchers... hand-curated to admit a guess-resistant and machine-verifiable answer

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Fine-Tuning Small Reasoning Models for Quantum Field Theory
cs.LG 2026-04 unverdicted novelty 7.0

Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents
cs.CR 2026-05 conditional novelty 6.0

Clarification-seeking in LLM agents amplifies prompt injection attack success from ~2% to over 30% across ten frontier models in a new 728-scenario benchmark.
Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations
physics.comp-ph 2026-03 unverdicted novelty 6.0

QMP-Bench supplies a realistic test set for AI on quantum many-body problems while PhysVEC uses integrated verifiers to turn unreliable LLM generations into code that passes both syntax and physics checks, outperformi...
From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution
cs.SE 2026-04 unverdicted novelty 5.0

Compact Gene representations of experience outperform documentation-oriented Skill packages for test-time control and iterative evolution in code-solving tasks, with measured gains on CritPt from 9.1% to 18.57% and 17...

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · cited by 4 Pith papers · 14 internal anchors

[1]

P. W. Anderson. More is different: broken symmetry and the nature of the hierarchical structure of science.Science, 177(4047):393–396, 1972

work page 1972
[2]

Sinatra, P

R. Sinatra, P. Deville, M. Szell, D. Wang, and A.-L. Barabási. A century of physics.Nature Physics, 11(10):791–796, 2015

work page 2015
[3]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[4]

Devlin, M.-W

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. pages, 4171–4186, 2019

work page 2019
[5]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

work page 2020
[6]

F. M. Delgado-Chaves, M. J. Jennings, A. Atalaia, J. Wolff, R. Horvath, Z. M. Mamdouh, J. Baumbach, and L. Baumbach. Transforming literature screening: The emerging role of large language models in systematic reviews.Proceedings of the National Academy of Sciences, 122(2): e2411962122, 2025

work page 2025
[7]

Scherbakov, N

D. Scherbakov, N. Hubig, V . Jansari, A. Bakumenko, and L. A. Lenert. The emergence of large language models as tools in literature reviews: a large language model-assisted systematic review. Journal of the American Medical Informatics Association, 32(6):1071–1086, 2025

work page 2025
[8]

Pramanick, R

S. Pramanick, R. Chellappa, and S. Venugopalan. SPIQA: A dataset for multimodal question answering on scientific papers.Advances in Neural Information Processing Systems, 37:118807– 118833, 2024

work page 2024
[9]

T. Gao, H. Yen, J. Yu, and D. Chen. Enabling large language models to generate text with citations. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023
[10]

Y . Wang, Q. Guo, W. Yao, H. Zhang, X. Zhang, Z. Wu, M. Zhang, X. Dai, Q. Wen, W. Ye, et al. Autosurvey: Large Language Models can automatically write surveys.Advances in neural information processing systems, 37:115119–115145, 2024

work page 2024
[11]

A. Asai, J. He, R. Shao, W. Shi, A. Singh, J. C. Chang, K. Lo, L. Soldaini, S. Feldman, M. D’arcy, et al. Openscholar: Synthesizing scientific literature with retrieval-augmented LMs.arXiv preprint arXiv:2411.14199, 2024

work page arXiv 2024
[12]

M. D. Skarlinski, S. Cox, J. M. Laurent, J. D. Braza, M. Hinks, M. J. Hammerling, M. Ponnapati, S. G. Rodriques, and A. D. White. Language agents achieve superhuman synthesis of scientific knowledge.arXiv preprint arXiv:2409.13740, 2024

work page arXiv 2024
[13]

H. Cui, Z. Shamsi, G. Cheon, X. Ma, S. Li, M. Tikhanovskaya, P. C. Norgaard, N. Mudur, M. B. Plomecka, P. Raccuglia, et al. CURIE: evaluating LLMs on multitask scientific long- context understanding and reasoning. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[14]

Introducing GPT-5, 2025

OpenAI. Introducing GPT-5, 2025. https://openai.com/index/introducing-gpt-5/

work page 2025
[15]

Introducing OpenAI o3 and o4-mini, 2025

OpenAI. Introducing OpenAI o3 and o4-mini, 2025. https://openai.com/index/introducing-o3- and-o4-mini/

work page 2025
[16]

Gemini 2.5: Our most intelligent AI model, 2025

Google. Gemini 2.5: Our most intelligent AI model, 2025. https://blog.google/technology/google- deepmind/gemini-model-thinking-updates-march-2025/#gemini-2-5-thinking

work page 2025
[17]

D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 645(8081): 633–638, 2025

work page 2025
[18]

Introducing Claude 4, 2025

Anthropic. Introducing Claude 4, 2025. https://www.anthropic.com/news/claude-4. 24

work page 2025
[19]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation, 2025

Meta. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

work page 2025
[21]

OpenAI o1 System Card

A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. OpenAI o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced rea- soning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[25]

Schick, J

T. Schick, J. Dwivedi-Yu, R. Dessí, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Can- cedda, and T. Scialom. Toolformer: language models can teach themselves to use tools. In Proceedings of the 37th International Conference on Neural Information Processing Systems, 2023

work page 2023
[26]

X. Wang, Y . Chen, L. Yuan, Y . Zhang, Y . Li, H. Peng, and H. Ji. Executable code actions elicit better LLM agents. InProceedings of the International Conference on Machine Learning, 2024

work page 2024
[27]

L. Yuan, Y . Chen, X. Wang, Y . Fung, H. Peng, and H. Ji. CRAFT: Customizing LLMs by creating and retrieving from specialized toolsets. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[28]

Lewis, E

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InProceedings of the 34th International Conference on Neural Information Processing Systems, 2020

work page 2020
[29]

X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[30]

C. V . Snell, J. Lee, K. Xu, and A. Kumar. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[31]

Hazra, G

R. Hazra, G. Venturato, P. Z. Dos Martires, and L. De Raedt. Have large language models learned to reason? a characterization via 3-sat. InSecond Conference on Language Modeling, 2025

work page 2025
[32]

Gandhi, A

K. Gandhi, A. K. Chakravarthy, A. Singh, N. Lile, and N. Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective STaRs. InSecond Conference on Language Modeling, 2025

work page 2025
[33]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[34]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[35]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InThe Twelfth International Conference on Learning Representations, 2024. 25

work page 2024
[36]

Advanced version of Gemini with deep think officially achieves gold-medal standard at the International Mathematical Olympiad, 2025

Google DeepMind. Advanced version of Gemini with deep think officially achieves gold-medal standard at the International Mathematical Olympiad, 2025

work page 2025
[37]

Competitive programming with large reasoning models.arXiv preprint arXiv:2502.06807, 2025

A. El-Kishky, A. Wei, A. Saraiva, B. Minaiev, D. Selsam, D. Dohan, F. Song, H. Lightman, I. Clavera, J. Pachocki, et al. Competitive programming with large reasoning models.arXiv preprint arXiv:2502.06807, 2025

work page arXiv 2025
[38]

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

M. Balunovi´c, J. Dekoninck, I. Petrov, N. Jovanovi´c, and M. Vechev. MathArena: Evaluating llms on uncontaminated math competitions.arXiv preprint arXiv:2505.23281, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

C. He, R. Luo, Y . Bai, S. Hu, Z. Thai, J. Shen, J. Hu, X. Han, Y . Huang, Y . Zhang, et al. OlympiadBench: A challenging benchmark for promoting AGI with Olympiad-level bilingual multimodal scientific problems. pages, 3828–3850, 2024

work page 2024
[40]

N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[41]

S. Qiu, S. Guo, Z.-Y . Song, Y . Sun, Z. Cai, J. Wei, T. Luo, Y . Yin, H. Zhang, Y . Hu, et al. PHYBench: Holistic evaluation of physical perception and reasoning in large language models. arXiv preprint arXiv:2504.16074, 2025

work page arXiv 2025
[42]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

work page 2021
[43]

Hendrycks, C

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Mea- suring massive multitask language understanding. InInternational Conference on Learning Representations, 2021

work page 2021
[44]

Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, et al. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems, 37:95266–95290, 2024

work page 2024
[45]

X. Wang, Z. Hu, P. Lu, Y . Zhu, J. Zhang, S. Subramaniam, A. R. Loomba, S. Zhang, Y . Sun, and W. Wang. SciBench: Evaluating college-level scientific problem-solving abilities of large language models. pages, 50622–50649. PMLR, 2024

work page 2024
[46]

X. Xu, Q. Xu, T. Xiao, T. Chen, Y . Yan, J. ZHANG, S. Diao, C. Yang, and Y . Wang. UGPhysics: A comprehensive benchmark for undergraduate physics reasoning with large language models. In Forty-second International Conference on Machine Learning, 2025

work page 2025
[47]

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024

work page 2024
[48]

M. Tian, L. Gao, S. Zhang, X. Chen, C. Fan, X. Guo, R. Haas, P. Ji, K. Krongchon, Y . Li, et al. Scicode: A research coding benchmark curated by scientists.Advances in Neural Information Processing Systems, 37:30624–30650, 2024

work page 2024
[49]

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

E. Glazer, E. Erdil, T. Besiroglu, D. Chicharro, E. Chen, A. Gunning, C. F. Olsson, J.-S. Denain, A. Ho, E. d. O. Santos, et al. FrontierMath: a benchmark for evaluating advanced mathematical reasoning in AI.arXiv preprint arXiv:2411.04872, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

D. J. Chung, Z. Gao, Y . Kvasiuk, T. Li, M. Münchmeyer, M. Rudolph, F. Sala, and S. C. Tadepalli. Theoretical physics benchmark (TPBench)–a dataset and study of AI reasoning capabilities in theoretical physics.arXiv preprint arXiv:2502.15815, 2025

work page arXiv 2025
[51]

L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

H. Wang, T. Fu, Y . Du, W. Gao, K. Huang, Z. Liu, P. Chandak, S. Liu, P. Van Katwyk, A. Deac, et al. Scientific discovery in the age of artificial intelligence.Nature, 620(7972):47–60, 2023. 26

work page 2023
[53]

Z. Wu, L. Qiu, A. Ross, E. Akyürek, B. Chen, B. Wang, N. Kim, J. Andreas, and Y . Kim. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. pages, 1819–1862, 2024

work page 2024
[54]

Balepur, A

N. Balepur, A. Ravichander, and R. Rudinger. Artifacts or abduction: How do LLMs answer multiple-choice questions without the question? pages, 10308–10330, 2024

work page 2024
[55]

C. Deng, Y . Zhao, X. Tang, M. Gerstein, and A. Cohan. Investigating data contamination in modern benchmarks for large language models. pages, 8698–8711, 2024

work page 2024
[56]

S. Ott, A. Barbosa-Silva, K. Blagec, J. Brauner, and M. Samwald. Mapping global dynamics of benchmark creation and saturation in artificial intelligence.Nature Communications, 13(1):6793, 2022

work page 2022
[57]

Y . Li, Y . Guo, F. Guerin, and C. Lin. An open-source data contamination report for large language models. pages, 528–541, 2024

work page 2024
[58]

Balepur, R

N. Balepur, R. Rudinger, and J. L. Boyd-Graber. Which of these best describes multiple choice evaluation with LLMs? A) forced B) flawed C) fixable D) all of the above. pages, 3394–3418. Association for Computational Linguistics, 2025

work page 2025
[59]

Dodge, M

J. Dodge, M. Sap, A. Marasovi ´c, W. Agnew, G. Ilharco, D. Groeneveld, M. Mitchell, and M. Gardner. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. pages, 1286–1305, 2021

work page 2021
[60]

Golchin and M

S. Golchin and M. Surdeanu. Time travel in LLMs: Tracing data contamination in large language models. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[61]

Roberts, H

M. Roberts, H. Thakur, C. Herlihy, C. White, and S. Dooley. To the cutoff... and beyond? a longitudinal perspective on llm data contamination. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[62]

P. Wang, L. Li, L. Chen, Z. Cai, D. Zhu, B. Lin, Y . Cao, L. Kong, Q. Liu, T. Liu, et al. Large language models are not fair evaluators. pages, 9440–9450, 2024

work page 2024
[63]

J. Ye, Y . Wang, Y . Huang, D. Chen, Q. Zhang, N. Moniz, T. Gao, W. Geyer, C. Huang, P.-Y . Chen, et al. Justice or prejudice? quantifying biases in LLM-as-a-judge. InInternational Conference on Learning Representations, 2025

work page 2025
[64]

M. T. R. Laskar, S. Alqahtani, M. S. Bari, M. Rahman, M. A. M. Khan, H. Khan, I. Jahan, A. Bhuiyan, C. W. Tan, M. R. Parvez, E. Hoque, S. Joty, and J. Huang. A systematic survey and critical review on evaluating large language models: Challenges, limitations, and recommendations. pages, 13785–13816. Association for Computational Linguistics, 2024

work page 2024
[65]

Meurer, C

A. Meurer, C. P. Smith, M. Paprocki, O. ˇCertík, S. B. Kirpichev, M. Rocklin, Kumar, et al. SymPy: symbolic computing in Python.PeerJ Computer Science, 3:e103, 2017

work page 2017
[66]

https://physh.org/about

PhySH – Physics Subject Headings. https://physh.org/about. Accessed: August 18, 2025

work page 2025
[67]

A. T. Kalai, O. Nachum, S. S. Vempala, and E. Zhang. Why language models hallucinate.arXiv preprint arXiv:2509.04664, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

Quantum fault tolerance in small experiments

D. Gottesman. Quantum fault tolerance in small experiments.arXiv preprint arXiv:1610.03507, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[69]

C. Vuillot. Is error detection helpful on IBM 5Q chips?Quantum Inf. Comput., 18(11):0949, 2017

work page 2017
[70]

N. M. Linke, M. Gutierrez, K. A. Landsman, C. Figgatt, S. Debnath, K. R. Brown, and C. Monroe. Fault-tolerant quantum error detection.Sci. Adv., 3(10):e1701074, 2017

work page 2017
[71]

Harper and S

R. Harper and S. T. Flammia. Fault-tolerant logical gates in the IBM quantum experience.Phys. Rev. Lett., 122:080504, 2019. 27

work page 2019
[72]

Komar, E

P. Komar, E. M. Kessler, M. Bishof, L. Jiang, A. S. Sørensen, J. Ye, and M. D. Lukin. A quantum network of clocks.Nature Physics, 10(8):582–587, 2014

work page 2014
[73]

Zhang and Q

Z. Zhang and Q. Zhuang. Distributed quantum sensing.Quantum Science and Technology, 6(4): 043001, 2021

work page 2021
[74]

A. Zang, A. Kolar, A. Gonzales, J. Chung, S. K. Gray, R. Kettimuthu, T. Zhong, and Z. H. Saleem. Quantum advantage in distributed sensing with noisy quantum networks.arXiv preprint arXiv:2409.17089, 2024

work page arXiv 2024
[75]

Zang, T.-X

A. Zang, T.-X. Zheng, P. C. Maurer, F. T. Chong, M. Suchara, and T. Zhong. Enhancing noisy quantum sensing by GHZ state partitioning.arXiv preprint arXiv:2507.02829, 2025

work page arXiv 2025
[76]

A. H. Guth. Inflationary universe: A possible solution to the horizon and flatness problems.Phys. Rev. D, 23:347–356, 1981

work page 1981
[77]

A. Linde. A new inflationary universe scenario: A possible solution of the horizon, flatness, homogeneity, isotropy and primordial monopole problems.Physics Letters B, 108(6):389–393, 1982

work page 1982
[78]

Albrecht and P

A. Albrecht and P. J. Steinhardt. Cosmology for grand unified theories with radiatively induced symmetry breaking.Phys. Rev. Lett., 48:1220–1223, 1982

work page 1982
[79]

A. Linde. Chaotic inflation.Physics Letters B, 129(3):177–181, 1983

work page 1983
[80]

Freese, J

K. Freese, J. A. Frieman, and A. V . Olinto. Natural inflation with pseudo Nambu-Goldstone bosons.Phys. Rev. Lett., 65:3233–3236, 1990

work page 1990

Showing first 80 references.

[1] [1]

P. W. Anderson. More is different: broken symmetry and the nature of the hierarchical structure of science.Science, 177(4047):393–396, 1972

work page 1972

[2] [2]

Sinatra, P

R. Sinatra, P. Deville, M. Szell, D. Wang, and A.-L. Barabási. A century of physics.Nature Physics, 11(10):791–796, 2015

work page 2015

[3] [3]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017

[4] [4]

Devlin, M.-W

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. pages, 4171–4186, 2019

work page 2019

[5] [5]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

work page 2020

[6] [6]

F. M. Delgado-Chaves, M. J. Jennings, A. Atalaia, J. Wolff, R. Horvath, Z. M. Mamdouh, J. Baumbach, and L. Baumbach. Transforming literature screening: The emerging role of large language models in systematic reviews.Proceedings of the National Academy of Sciences, 122(2): e2411962122, 2025

work page 2025

[7] [7]

Scherbakov, N

D. Scherbakov, N. Hubig, V . Jansari, A. Bakumenko, and L. A. Lenert. The emergence of large language models as tools in literature reviews: a large language model-assisted systematic review. Journal of the American Medical Informatics Association, 32(6):1071–1086, 2025

work page 2025

[8] [8]

Pramanick, R

S. Pramanick, R. Chellappa, and S. Venugopalan. SPIQA: A dataset for multimodal question answering on scientific papers.Advances in Neural Information Processing Systems, 37:118807– 118833, 2024

work page 2024

[9] [9]

T. Gao, H. Yen, J. Yu, and D. Chen. Enabling large language models to generate text with citations. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023

[10] [10]

Y . Wang, Q. Guo, W. Yao, H. Zhang, X. Zhang, Z. Wu, M. Zhang, X. Dai, Q. Wen, W. Ye, et al. Autosurvey: Large Language Models can automatically write surveys.Advances in neural information processing systems, 37:115119–115145, 2024

work page 2024

[11] [11]

A. Asai, J. He, R. Shao, W. Shi, A. Singh, J. C. Chang, K. Lo, L. Soldaini, S. Feldman, M. D’arcy, et al. Openscholar: Synthesizing scientific literature with retrieval-augmented LMs.arXiv preprint arXiv:2411.14199, 2024

work page arXiv 2024

[12] [12]

M. D. Skarlinski, S. Cox, J. M. Laurent, J. D. Braza, M. Hinks, M. J. Hammerling, M. Ponnapati, S. G. Rodriques, and A. D. White. Language agents achieve superhuman synthesis of scientific knowledge.arXiv preprint arXiv:2409.13740, 2024

work page arXiv 2024

[13] [13]

H. Cui, Z. Shamsi, G. Cheon, X. Ma, S. Li, M. Tikhanovskaya, P. C. Norgaard, N. Mudur, M. B. Plomecka, P. Raccuglia, et al. CURIE: evaluating LLMs on multitask scientific long- context understanding and reasoning. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[14] [14]

Introducing GPT-5, 2025

OpenAI. Introducing GPT-5, 2025. https://openai.com/index/introducing-gpt-5/

work page 2025

[15] [15]

Introducing OpenAI o3 and o4-mini, 2025

OpenAI. Introducing OpenAI o3 and o4-mini, 2025. https://openai.com/index/introducing-o3- and-o4-mini/

work page 2025

[16] [16]

Gemini 2.5: Our most intelligent AI model, 2025

Google. Gemini 2.5: Our most intelligent AI model, 2025. https://blog.google/technology/google- deepmind/gemini-model-thinking-updates-march-2025/#gemini-2-5-thinking

work page 2025

[17] [17]

D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 645(8081): 633–638, 2025

work page 2025

[18] [18]

Introducing Claude 4, 2025

Anthropic. Introducing Claude 4, 2025. https://www.anthropic.com/news/claude-4. 24

work page 2025

[19] [19]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation, 2025

Meta. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

work page 2025

[21] [21]

OpenAI o1 System Card

A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. OpenAI o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced rea- soning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022

[25] [25]

Schick, J

T. Schick, J. Dwivedi-Yu, R. Dessí, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Can- cedda, and T. Scialom. Toolformer: language models can teach themselves to use tools. In Proceedings of the 37th International Conference on Neural Information Processing Systems, 2023

work page 2023

[26] [26]

X. Wang, Y . Chen, L. Yuan, Y . Zhang, Y . Li, H. Peng, and H. Ji. Executable code actions elicit better LLM agents. InProceedings of the International Conference on Machine Learning, 2024

work page 2024

[27] [27]

L. Yuan, Y . Chen, X. Wang, Y . Fung, H. Peng, and H. Ji. CRAFT: Customizing LLMs by creating and retrieving from specialized toolsets. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[28] [28]

Lewis, E

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InProceedings of the 34th International Conference on Neural Information Processing Systems, 2020

work page 2020

[29] [29]

X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023

[30] [30]

C. V . Snell, J. Lee, K. Xu, and A. Kumar. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[31] [31]

Hazra, G

R. Hazra, G. Venturato, P. Z. Dos Martires, and L. De Raedt. Have large language models learned to reason? a characterization via 3-sat. InSecond Conference on Language Modeling, 2025

work page 2025

[32] [32]

Gandhi, A

K. Gandhi, A. K. Chakravarthy, A. Singh, N. Lile, and N. Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective STaRs. InSecond Conference on Language Modeling, 2025

work page 2025

[33] [33]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[34] [34]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[35] [35]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InThe Twelfth International Conference on Learning Representations, 2024. 25

work page 2024

[36] [36]

Advanced version of Gemini with deep think officially achieves gold-medal standard at the International Mathematical Olympiad, 2025

Google DeepMind. Advanced version of Gemini with deep think officially achieves gold-medal standard at the International Mathematical Olympiad, 2025

work page 2025

[37] [37]

Competitive programming with large reasoning models.arXiv preprint arXiv:2502.06807, 2025

A. El-Kishky, A. Wei, A. Saraiva, B. Minaiev, D. Selsam, D. Dohan, F. Song, H. Lightman, I. Clavera, J. Pachocki, et al. Competitive programming with large reasoning models.arXiv preprint arXiv:2502.06807, 2025

work page arXiv 2025

[38] [38]

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

M. Balunovi´c, J. Dekoninck, I. Petrov, N. Jovanovi´c, and M. Vechev. MathArena: Evaluating llms on uncontaminated math competitions.arXiv preprint arXiv:2505.23281, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

C. He, R. Luo, Y . Bai, S. Hu, Z. Thai, J. Shen, J. Hu, X. Han, Y . Huang, Y . Zhang, et al. OlympiadBench: A challenging benchmark for promoting AGI with Olympiad-level bilingual multimodal scientific problems. pages, 3828–3850, 2024

work page 2024

[40] [40]

N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[41] [41]

S. Qiu, S. Guo, Z.-Y . Song, Y . Sun, Z. Cai, J. Wei, T. Luo, Y . Yin, H. Zhang, Y . Hu, et al. PHYBench: Holistic evaluation of physical perception and reasoning in large language models. arXiv preprint arXiv:2504.16074, 2025

work page arXiv 2025

[42] [42]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

work page 2021

[43] [43]

Hendrycks, C

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Mea- suring massive multitask language understanding. InInternational Conference on Learning Representations, 2021

work page 2021

[44] [44]

Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, et al. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems, 37:95266–95290, 2024

work page 2024

[45] [45]

X. Wang, Z. Hu, P. Lu, Y . Zhu, J. Zhang, S. Subramaniam, A. R. Loomba, S. Zhang, Y . Sun, and W. Wang. SciBench: Evaluating college-level scientific problem-solving abilities of large language models. pages, 50622–50649. PMLR, 2024

work page 2024

[46] [46]

X. Xu, Q. Xu, T. Xiao, T. Chen, Y . Yan, J. ZHANG, S. Diao, C. Yang, and Y . Wang. UGPhysics: A comprehensive benchmark for undergraduate physics reasoning with large language models. In Forty-second International Conference on Machine Learning, 2025

work page 2025

[47] [47]

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024

work page 2024

[48] [48]

M. Tian, L. Gao, S. Zhang, X. Chen, C. Fan, X. Guo, R. Haas, P. Ji, K. Krongchon, Y . Li, et al. Scicode: A research coding benchmark curated by scientists.Advances in Neural Information Processing Systems, 37:30624–30650, 2024

work page 2024

[49] [49]

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

E. Glazer, E. Erdil, T. Besiroglu, D. Chicharro, E. Chen, A. Gunning, C. F. Olsson, J.-S. Denain, A. Ho, E. d. O. Santos, et al. FrontierMath: a benchmark for evaluating advanced mathematical reasoning in AI.arXiv preprint arXiv:2411.04872, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

D. J. Chung, Z. Gao, Y . Kvasiuk, T. Li, M. Münchmeyer, M. Rudolph, F. Sala, and S. C. Tadepalli. Theoretical physics benchmark (TPBench)–a dataset and study of AI reasoning capabilities in theoretical physics.arXiv preprint arXiv:2502.15815, 2025

work page arXiv 2025

[51] [51]

L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

H. Wang, T. Fu, Y . Du, W. Gao, K. Huang, Z. Liu, P. Chandak, S. Liu, P. Van Katwyk, A. Deac, et al. Scientific discovery in the age of artificial intelligence.Nature, 620(7972):47–60, 2023. 26

work page 2023

[53] [53]

Z. Wu, L. Qiu, A. Ross, E. Akyürek, B. Chen, B. Wang, N. Kim, J. Andreas, and Y . Kim. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. pages, 1819–1862, 2024

work page 2024

[54] [54]

Balepur, A

N. Balepur, A. Ravichander, and R. Rudinger. Artifacts or abduction: How do LLMs answer multiple-choice questions without the question? pages, 10308–10330, 2024

work page 2024

[55] [55]

C. Deng, Y . Zhao, X. Tang, M. Gerstein, and A. Cohan. Investigating data contamination in modern benchmarks for large language models. pages, 8698–8711, 2024

work page 2024

[56] [56]

S. Ott, A. Barbosa-Silva, K. Blagec, J. Brauner, and M. Samwald. Mapping global dynamics of benchmark creation and saturation in artificial intelligence.Nature Communications, 13(1):6793, 2022

work page 2022

[57] [57]

Y . Li, Y . Guo, F. Guerin, and C. Lin. An open-source data contamination report for large language models. pages, 528–541, 2024

work page 2024

[58] [58]

Balepur, R

N. Balepur, R. Rudinger, and J. L. Boyd-Graber. Which of these best describes multiple choice evaluation with LLMs? A) forced B) flawed C) fixable D) all of the above. pages, 3394–3418. Association for Computational Linguistics, 2025

work page 2025

[59] [59]

Dodge, M

J. Dodge, M. Sap, A. Marasovi ´c, W. Agnew, G. Ilharco, D. Groeneveld, M. Mitchell, and M. Gardner. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. pages, 1286–1305, 2021

work page 2021

[60] [60]

Golchin and M

S. Golchin and M. Surdeanu. Time travel in LLMs: Tracing data contamination in large language models. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[61] [61]

Roberts, H

M. Roberts, H. Thakur, C. Herlihy, C. White, and S. Dooley. To the cutoff... and beyond? a longitudinal perspective on llm data contamination. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023

[62] [62]

P. Wang, L. Li, L. Chen, Z. Cai, D. Zhu, B. Lin, Y . Cao, L. Kong, Q. Liu, T. Liu, et al. Large language models are not fair evaluators. pages, 9440–9450, 2024

work page 2024

[63] [63]

J. Ye, Y . Wang, Y . Huang, D. Chen, Q. Zhang, N. Moniz, T. Gao, W. Geyer, C. Huang, P.-Y . Chen, et al. Justice or prejudice? quantifying biases in LLM-as-a-judge. InInternational Conference on Learning Representations, 2025

work page 2025

[64] [64]

M. T. R. Laskar, S. Alqahtani, M. S. Bari, M. Rahman, M. A. M. Khan, H. Khan, I. Jahan, A. Bhuiyan, C. W. Tan, M. R. Parvez, E. Hoque, S. Joty, and J. Huang. A systematic survey and critical review on evaluating large language models: Challenges, limitations, and recommendations. pages, 13785–13816. Association for Computational Linguistics, 2024

work page 2024

[65] [65]

Meurer, C

A. Meurer, C. P. Smith, M. Paprocki, O. ˇCertík, S. B. Kirpichev, M. Rocklin, Kumar, et al. SymPy: symbolic computing in Python.PeerJ Computer Science, 3:e103, 2017

work page 2017

[66] [66]

https://physh.org/about

PhySH – Physics Subject Headings. https://physh.org/about. Accessed: August 18, 2025

work page 2025

[67] [67]

A. T. Kalai, O. Nachum, S. S. Vempala, and E. Zhang. Why language models hallucinate.arXiv preprint arXiv:2509.04664, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[68] [68]

Quantum fault tolerance in small experiments

D. Gottesman. Quantum fault tolerance in small experiments.arXiv preprint arXiv:1610.03507, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[69] [69]

C. Vuillot. Is error detection helpful on IBM 5Q chips?Quantum Inf. Comput., 18(11):0949, 2017

work page 2017

[70] [70]

N. M. Linke, M. Gutierrez, K. A. Landsman, C. Figgatt, S. Debnath, K. R. Brown, and C. Monroe. Fault-tolerant quantum error detection.Sci. Adv., 3(10):e1701074, 2017

work page 2017

[71] [71]

Harper and S

R. Harper and S. T. Flammia. Fault-tolerant logical gates in the IBM quantum experience.Phys. Rev. Lett., 122:080504, 2019. 27

work page 2019

[72] [72]

Komar, E

P. Komar, E. M. Kessler, M. Bishof, L. Jiang, A. S. Sørensen, J. Ye, and M. D. Lukin. A quantum network of clocks.Nature Physics, 10(8):582–587, 2014

work page 2014

[73] [73]

Zhang and Q

Z. Zhang and Q. Zhuang. Distributed quantum sensing.Quantum Science and Technology, 6(4): 043001, 2021

work page 2021

[74] [74]

A. Zang, A. Kolar, A. Gonzales, J. Chung, S. K. Gray, R. Kettimuthu, T. Zhong, and Z. H. Saleem. Quantum advantage in distributed sensing with noisy quantum networks.arXiv preprint arXiv:2409.17089, 2024

work page arXiv 2024

[75] [75]

Zang, T.-X

A. Zang, T.-X. Zheng, P. C. Maurer, F. T. Chong, M. Suchara, and T. Zhong. Enhancing noisy quantum sensing by GHZ state partitioning.arXiv preprint arXiv:2507.02829, 2025

work page arXiv 2025

[76] [76]

A. H. Guth. Inflationary universe: A possible solution to the horizon and flatness problems.Phys. Rev. D, 23:347–356, 1981

work page 1981

[77] [77]

A. Linde. A new inflationary universe scenario: A possible solution of the horizon, flatness, homogeneity, isotropy and primordial monopole problems.Physics Letters B, 108(6):389–393, 1982

work page 1982

[78] [78]

Albrecht and P

A. Albrecht and P. J. Steinhardt. Cosmology for grand unified theories with radiatively induced symmetry breaking.Phys. Rev. Lett., 48:1220–1223, 1982

work page 1982

[79] [79]

A. Linde. Chaotic inflation.Physics Letters B, 129(3):177–181, 1983

work page 1983

[80] [80]

Freese, J

K. Freese, J. A. Frieman, and A. V . Olinto. Natural inflation with pseudo Nambu-Goldstone bosons.Phys. Rev. Lett., 65:3233–3236, 1990

work page 1990