arxiv: 2605.11269 · v1 · submitted 2026-05-11 · 🌀 gr-qc · astro-ph.HE· astro-ph.IM· cs.AI

Recognition: 2 theorem links

· Lean Theorem

gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy

Digvijay Wadekar, Tousif Islam, Zihan Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:09 UTC · model grok-4.3

classification 🌀 gr-qc astro-ph.HEastro-ph.IMcs.AI

keywords gravitational wave astronomyLLM coding agentsscientific benchmarkswaveform modelinghigh-precision modelingnumerical relativityblack hole dynamicsagent evaluation

0 comments

The pith

LLM coding agents fall 1-2 orders of magnitude short on high-precision gravitational wave tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces gwBenchmarks, a suite of eight tasks drawn from real gravitational wave problems that normally require months of expert work and extreme accuracy standards such as relative errors below 10 to the minus 4. It tests twelve state-of-the-art LLM coding agents on end-to-end modeling challenges including interpolation, regression, and time-series fitting tied to black hole dynamics and waveform construction. Simpler tasks allow some agents to converge on known solutions like cubic splines or rediscover useful coordinate transformations, but harder analytic waveform tasks expose consistent shortfalls where agents misuse metrics, violate physical constraints, and fabricate results instead of meeting domain requirements. This matters because gravitational wave astronomy depends on precise models built from expensive simulations, and agent success would indicate AI can handle the full pipeline without constant human correction.

Core claim

Evaluating twelve coding agents on gwBenchmarks reveals no consistent winner across tasks. On easier interpolation problems multiple agents reach the same cubic spline solution and one rediscovers a standard coordinate transformation, yet on analytic waveform modeling every agent produces errors one to two orders of magnitude above the 10^{-4} relative error threshold required by the field, accompanied by systematic problems such as proxy metric use, constraint violations, and result fabrication.

What carries the argument

gwBenchmarks, a publicly released suite of eight tasks spanning interpolation, regression, and high-dimensional time-series modeling that are grounded in gravitational wave analytic calculations and numerical simulations, paired with an external pre-defined evaluation framework that enforces objective accuracy checks rather than permitting agent self-reporting.

If this is right

Progress on high-precision scientific tasks will require agents that reliably select and apply correct error metrics without external guidance.
Systematic failures on waveform modeling indicate that current LLM reasoning chains cannot yet enforce physical constraints or avoid fabrication in complex modeling pipelines.
The lack of a single dominant agent across tasks implies that different architectures or training regimes may be needed for different classes of precision astronomy problems.
gwBenchmarks supplies a standardized, reproducible testbed that can track whether future agents close the observed accuracy gap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the benchmark with tasks that couple directly to full numerical relativity simulations would likely expose even larger performance gaps.
Hybrid agent designs that call external physics libraries or simulators rather than generating all code from scratch could bypass the observed fabrication and constraint problems.
The same benchmark construction method could be applied to other precision domains such as quantum many-body calculations or high-resolution fluid simulations to test LLM limits more broadly.

Load-bearing premise

That the eight chosen tasks and the external evaluation framework together provide a fair and representative test of whether an agent can perform genuine end-to-end high-precision gravitational wave modeling.

What would settle it

An agent that completes the analytic waveform modeling task with verified relative error below 10^{-4} on held-out data while avoiding proxy metrics, constraint violations, and any fabricated results, as measured by the external framework.

Figures

Figures reproduced from arXiv: 2605.11269 by Digvijay Wadekar, Tousif Islam, Zihan Zhou.

**Figure 1.** Figure 1: Top: Overview of the gwBenchmarks pipeline and task suite. Agents operate in an end-toend setting, progressing from reasoning and code generation to model construction and prediction, which are evaluated using a pre-defined standardized metric. The panels illustrate the diversity of tasks: selecting representative signal templates, predicting final black-hole properties, building fast approximations to ex… view at source ↗

**Figure 2.** Figure 2: Per-sample performance distributions for LLM coding agents across the eight [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Example time-domain waveforms generated by the analytic model discovered by Opus 4.7 [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Behavior of the (ℓ, m, n) = (2, 2, 0) Kerr quasi-normal mode frequency under different coordinate parameterizations. Left: Direct parameterization using the raw final-spin coordinate χf . Near the extremal limit (χf → 1), the QNM frequencies develop increasingly steep gradients, making interpolation numerically difficult. q Right: Reparameterization using the transformed coordinate 1 − χ 2 f , closely rela… view at source ↗

read the original abstract

Modern gravitational wave astronomy relies on modeling tasks that often require months of graduate-level effort, including building fast waveform surrogates from expensive numerical relativity simulations, modeling orbital dynamics of black holes, fitting merger remnant properties and constructing template banks. These problems demand extreme precision to support detection and parameter inference, with state-of-the-art models achieving $\lesssim 10^{-4}$ relative error. We study whether state-of-the-art LLM coding agents can perform such end-to-end scientific modeling, where success requires constructing models with stringent accuracy criteria and reasoning about physical systems. We introduce gwBenchmarks, a suite of eight tasks grounded in gravitational wave analytic calculations and numerical simulations collectively representing over $10^8$ core-hours of compute. The tasks span interpolation, regression, and high-dimensional time-series modeling, requiring a combination of numerical methods, machine learning, and physics-informed approaches. In preliminary experiments, agents frequently relied on proxy metrics, partial evaluation, or fabricated results to spuriously complete tasks. We therefore implement an external pre-defined framework to gauge agent progress. Evaluating twelve coding agents, we find no consistent winner. On the easiest task, multiple agents converge to the same cubic spline solution, with one rediscovering a coordinate transformation widely used in the literature. On harder tasks like analytic waveform modeling, all agents fall 1-2 orders of magnitude short of domain requirements and exhibit systematic failures, including metric misuse, constraint violations, and result fabrication. Our code, data, and website are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces gwBenchmarks, a publicly available suite of eight tasks drawn from gravitational wave astronomy that require high-precision modeling (target relative errors ≲10^{-4}), including surrogate construction from numerical relativity, orbital dynamics, remnant property fitting, and template bank construction. It evaluates twelve LLM coding agents on these tasks using an external pre-defined evaluation framework designed to enforce objective scoring and prevent fabrication. The central finding is that agents can converge on simple solutions (e.g., cubic splines or rediscovered coordinate transformations) for the easiest tasks but fall 1-2 orders of magnitude short on harder tasks such as analytic waveform modeling, with systematic issues including proxy metric use, constraint violations, and result fabrication. All code, data, and the evaluation website are released publicly.

Significance. If the quantitative results hold under independent verification, this work supplies a reproducible, domain-grounded benchmark for assessing whether LLM agents can execute end-to-end high-precision scientific modeling in a field where accuracy directly affects detection and inference pipelines. The public release of the framework, tasks (representing >10^8 core-hours of underlying compute), and evaluation code is a clear strength that enables falsifiable follow-up studies and could accelerate development of physics-informed agents. The absence of internal circularity or fitted parameters in the evaluation design further supports its utility as an external test.

major comments (2)

[§4] §4 (Harder tasks results): The claim that all agents fall 1-2 orders of magnitude short of the ≲10^{-4} domain requirement on analytic waveform modeling is load-bearing for the main conclusion, yet the manuscript provides no explicit table or figure listing per-agent relative errors, the precise definition of the error metric (e.g., L2 norm over time series or mismatch), or the derivation of the 10^{-4} threshold from GW literature standards. This omission prevents direct assessment of whether the shortfall is uniform or task-specific.
[§3.2] §3.2 (External evaluation framework): The framework is introduced to replace preliminary experiments that showed fabrication, but the text does not specify the exact scoring procedure (e.g., how partial solutions or constraint violations are penalized, or how the framework interfaces with agent-generated code without allowing post-hoc metric selection). Because this mechanism underpins the objectivity of all reported shortfalls, its implementation details are required for reproducibility.

minor comments (2)

[Abstract] The abstract is information-dense; expanding the one-sentence description of the eight tasks with a brief parenthetical on their computational origin (e.g., “surrogate construction from NR simulations”) would improve readability without lengthening the paragraph.
[Figures] Figure captions and axis labels for performance plots should explicitly state the error metric and the horizontal line indicating the 10^{-4} domain threshold so readers can immediately interpret the 1-2 order shortfall.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading of our manuscript and for highlighting areas where additional details would enhance clarity and reproducibility. We are pleased that the referee recognizes the potential significance of gwBenchmarks as a domain-grounded benchmark. We address the major comments below.

read point-by-point responses

Referee: [§4] §4 (Harder tasks results): The claim that all agents fall 1-2 orders of magnitude short of the ≲10^{-4} domain requirement on analytic waveform modeling is load-bearing for the main conclusion, yet the manuscript provides no explicit table or figure listing per-agent relative errors, the precise definition of the error metric (e.g., L2 norm over time series or mismatch), or the derivation of the 10^{-4} threshold from GW literature standards. This omission prevents direct assessment of whether the shortfall is uniform or task-specific.

Authors: We agree with the referee that providing per-agent relative errors, a precise definition of the error metric, and the origin of the 10^{-4} threshold is necessary to substantiate the central claim. In the revised version of the manuscript, we will include a new table in §4 that reports the relative error for each of the twelve agents on the analytic waveform modeling task. We will explicitly define the error metric (specifying whether it is an L2 norm over the time series, a mismatch integral, or another standard GW measure) and provide a short derivation or literature citations establishing why ≲10^{-4} relative error is the relevant domain requirement for high-precision gravitational wave modeling. This addition will enable direct verification of the reported shortfall. revision: yes
Referee: [§3.2] §3.2 (External evaluation framework): The framework is introduced to replace preliminary experiments that showed fabrication, but the text does not specify the exact scoring procedure (e.g., how partial solutions or constraint violations are penalized, or how the framework interfaces with agent-generated code without allowing post-hoc metric selection). Because this mechanism underpins the objectivity of all reported shortfalls, its implementation details are required for reproducibility.

Authors: We acknowledge that the current description of the external evaluation framework in §3.2 is insufficiently detailed for full reproducibility. In the revision, we will substantially expand §3.2 to describe the exact scoring procedure, including the penalties applied for partial solutions, constraint violations, metric misuse, and any detected fabrication. We will also detail how the framework interfaces with the agent-generated code (e.g., via sandboxed execution and pre-defined evaluation functions) to prevent post-hoc metric selection by the agents. If space permits, we will include pseudocode illustrating the evaluation pipeline. These changes will directly address the referee's concern regarding the objectivity of the reported results. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical benchmark evaluating external LLM agents against eight fixed tasks with accuracy thresholds drawn from standard GW domain requirements (e.g., ≲10^{-4} relative error). No derivation chain, equations, or predictions are claimed; performance is measured directly via an external scoring framework introduced for objective evaluation. Results on tasks like waveform modeling and spline interpolation are observational, with public code/data enabling independent checks. No self-definitional, fitted-input, or self-citation reductions appear in the argument.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark relies on established gravitational wave physics, numerical methods, and existing LLM agent architectures; no new physical constants, particles, or ad-hoc entities are introduced.

axioms (1)

domain assumption Gravitational wave modeling tasks can be decomposed into interpolation, regression, and time-series problems with well-defined accuracy targets.
The eight tasks are presented as representative of real scientific modeling without further justification in the abstract.

pith-pipeline@v0.9.0 · 5582 in / 1244 out tokens · 45530 ms · 2026-05-13T02:09:39.414196+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We introduce gwBenchmarks, a suite of eight tasks grounded in gravitational wave analytic calculations and numerical simulations... agents frequently relied on proxy metrics, partial evaluation, or fabricated results... external pre-defined framework
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
On harder tasks like analytic waveform modeling, all agents fall 1-2 orders of magnitude short of domain requirements

Reference graph

Works this paper leans on

132 extracted references · 132 canonical work pages · 8 internal anchors

[1]

ImageNet large scale visual recognition challenge.International Journal of Computer Vision, 115(3):211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, et al. ImageNet large scale visual recognition challenge.International Journal of Computer Vision, 115(3):211–252, 2015

work page 2015
[2]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, et al. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2020

work page 2020
[3]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.arXiv preprint arXiv:2206.04615, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Holistic evaluation of language models.Transactions on Machine Learning Research, 2023

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, et al. Holistic evaluation of language models.Transactions on Machine Learning Research, 2023

work page 2023
[5]

AgentBench: Evaluating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, et al. AgentBench: Evaluating LLMs as agents. InInternational Conference on Learning Representations, 2023

work page 2023
[6]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants.arXiv preprint arXiv:2311.12983, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Daniel J. H. Chung, Zhiqi Gao, Yurii Kvasiuk, Tianyi Li, Moritz Münchmeyer, Maja Rudolph, Frederic Sala, and Sai Chaitanya Tadepalli. Theoretical physics benchmark (TPBench)—a dataset and study of AI reasoning capabilities in theoretical physics.Mach. Learn. Sci. Tech., 6(3):030505, 2025

work page 2025
[8]

Benchmarking materials property prediction methods: The Matbench test set and automatminer reference algorithm.npj Computational Materials, 6:138, 2020

Alexander Dunn, Qi Wang, Alex Ganose, Daniel Dopp, and Anubhav Jain. Benchmarking materials property prediction methods: The Matbench test set and automatminer reference algorithm.npj Computational Materials, 6:138, 2020

work page 2020
[9]

Feinberg, Joseph Gomes, et al

Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, et al. MoleculeNet: A benchmark for molecular machine learning.Chemical Science, 9(2):513–530, 2017

work page 2017
[10]

Open catalyst 2020 (OC20) dataset and community challenges.ACS Catalysis, 11(10):6059–6072, 2021

Lowik Chanussot, Abhishek Das, Siddharth Goyal, Thibaut Lavril, et al. Open catalyst 2020 (OC20) dataset and community challenges.ACS Catalysis, 11(10):6059–6072, 2021

work page 2020
[11]

PDEBench: An extensive benchmark for scientific machine learning

Makoto Takamoto, Timothy Praditia, Raphael Leiteritz, Dan MacKinlay, Francesco Alesiani, Dirk Pflüger, and Mathias Niepert. PDEBench: An extensive benchmark for scientific machine learning. InAdvances in Neural Information Processing Systems, 2022

work page 2022
[12]

Duncan Watson-Parris, Y . Rao, D. Oliviè, Ø. Seland, et al. ClimateBench: A benchmark dataset for data-driven climate projections.ESS Open Archive, 2021

work page 2021
[13]

van Rijn, Bernd Bischl, and Luis Torgo

Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. OpenML: Networked science in machine learning.SIGKDD Explorations, 15(2):49–60, 2014

work page 2014
[14]

Open graph benchmark: Datasets for machine learning on graphs

Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, et al. Open graph benchmark: Datasets for machine learning on graphs. InAdvances in Neural Information Processing Systems, 2020

work page 2020
[15]

Xu, Hao Zhu, Xuhui Zhou, et al

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, et al. WebArena: A realistic web environ- ment for building autonomous agents. InInternational Conference on Learning Representa- tions, 2023

work page 2023
[16]

arXiv preprint arXiv:2410.05080 , year=

Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, et al. ScienceAgentBench: Toward rigorous assessment of language agents for data-driven scientific discovery.arXiv preprint arXiv:2410.05080, 2024

work page arXiv 2024
[17]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, et al. Terminal-Bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

TruthfulQA: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. InAnnual Meeting of the Association for Computational Linguistics, 2022. 19

work page 2022
[19]

MLPerf training bench- mark.Proceedings of Machine Learning and Systems, 2:336–349, 2020

Peter Mattson, Christine Cheng, Cody Coleman, Greg Diamos, et al. MLPerf training bench- mark.Proceedings of Machine Learning and Systems, 2:336–349, 2020

work page 2020
[20]

Observation of gravitational waves from a binary black hole merger.Physical Review Letters, 116:061102, 2016

LIGO Scientific Collaboration and Virgo Collaboration. Observation of gravitational waves from a binary black hole merger.Physical Review Letters, 116:061102, 2016

work page 2016
[21]

Anderson, Patrick R

Bruce Allen, Warren G. Anderson, Patrick R. Brady, Duncan A. Brown, and Jolien D. E. Creighton. FINDCHIRP: An algorithm for detection of gravitational waves from inspiraling compact binaries.Physical Review D, 85:122006, 2005

work page 2005
[22]

Gravitational radiation from post-newtonian sources and inspiralling compact binaries.Living Reviews in Relativity, 17:2, 2014

Luc Blanchet. Gravitational radiation from post-newtonian sources and inspiralling compact binaries.Living Reviews in Relativity, 17:2, 2014

work page 2014
[23]

The general relativistic two body problem and the effective one body formalism

Thibault Damour. The general relativistic two body problem and the effective one body formalism. InGeneral Relativity, Cosmology and Astrophysics. Springer, 2014

work page 2014
[24]

Michael Boyle, Daniel Hemberger, Deborah A. B. Iozzo, Geoffrey Lovelace, et al. The SXS collaboration catalog of binary black hole simulations.Classical and Quantum Gravity, 36(19):195006, 2019

work page 2019
[25]

The Einstein Toolkit: A community computational infrastructure for relativistic astrophysics.Classical and Quantum Gravity, 29(11):115001, 2011

Frank Löffler, Joshua Faber, Eloisa Bentivegna, Tanja Bode, et al. The Einstein Toolkit: A community computational infrastructure for relativistic astrophysics.Classical and Quantum Gravity, 29(11):115001, 2011

work page 2011
[26]

Field, Mark A

Jonathan Blackman, Scott E. Field, Mark A. Scheel, Chad R. Galley, et al. A surrogate model of gravitational waveforms from numerical relativity simulations of precessing binary black hole mergers.Physical Review D, 95:104023, 2017

work page 2017
[27]

Field, Mark A

Vijay Varma, Scott E. Field, Mark A. Scheel, Jonathan Blackman, et al. Surrogate models for precessing binary black hole simulations with unequal masses.Physical Review Research, 1:033015, 2019

work page 2019
[28]

SWIGLAL: Python and octave interfaces to the LALSuite gravitational-wave data analysis libraries.SoftwareX, 12:100634, 2020

Karl Wette. SWIGLAL: Python and octave interfaces to the LALSuite gravitational-wave data analysis libraries.SoftwareX, 12:100634, 2020

work page 2020
[29]

Brown, Thomas Cokelaer, Ian Harry, et al

Chris van den Broeck, Duncan A. Brown, Thomas Cokelaer, Ian Harry, et al. Template banks to search for compact binaries with spinning components in gravitational wave data.Physical Review D, 80:024009, 2009

work page 2009
[30]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[31]

Measuring coding challenge competence with APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, et al. Measuring coding challenge competence with APPS. InNeurIPS Datasets and Benchmarks, 2021

work page 2021
[32]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world github issues? In International Conference on Learning Representations, 2023

work page 2023
[33]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, et al. LiveCodeBench: Holistic and contami- nation free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[35]

Jimenez, Alexander Wettig, Kilian Adriano Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Adriano Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering.Neural Information Processing Systems, 2024

work page 2024
[36]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019. 20

work page 2019
[37]

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. InAdvances in Neural Information Processing Systems, 2019

work page 2019
[38]

Gomez, Lukasz Kaiser, and I

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and I. Polosukhin. Attention is all you need. InNeural Information Processing Systems, 2017

work page 2017
[39]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InNorth American Chapter of the Association for Computational Linguistics, 2019

work page 2019
[40]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 2019

work page 2019
[41]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, J

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, G. Sastry, Amanda Askell, et al. Language models are few-shot learners.Neural Information Processing Systems, 2020

work page 2020
[42]

Hudson, Ehsan Adeli, R

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, R. Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, E. Brunskill, et al. On the opportunities and risks of foundation models.arXiv.org, 2021

work page 2021
[43]

Kaplan, Sam McCandlish, T

J. Kaplan, Sam McCandlish, T. Henighan, Tom B. Brown, Benjamin Chess, R. Child, Scott Gray, Alec Radford, Jeff Wu, and Dario Amodei. Scaling laws for neural language models. arXiv.org, 2020

work page 2020
[44]

Training compute-optimal large language models.Advances in Neural Information Processing Systems 35, 2022

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.Advances in Neural Information Processing Systems 35, 2022

work page 2022
[45]

Wainwright, Pamela Mishkin, Chong Zhang, S

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, S. Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Neural Information Processing Systems, 2022

work page 2022
[46]

Agarwal, L

OpenAI Josh Achiam, Steven Adler, S. Agarwal, L. Ahmad, Ilge Akkaya, Florencia Leoni Aleman, D. Almeida, Janko Altenschmidt, S. Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv, 2023

work page 2023
[47]

Martinet, M

Hugo Touvron, Thibaut Lavril, Gautier Izacard, X. Martinet, M. Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv.org, 2023

work page 2023
[48]

Emergent abilities of large language models.Trans

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.Trans. Mach. Learn. Res., 2022

work page 2022
[49]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, F. Xia, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models.Neural Information Processing Systems, 2022

work page 2022
[50]

Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa

Takeshi Kojima, S. Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Neural Information Processing Systems, 2022

work page 2022
[51]

HellaSwag: Can a machine really finish your sentence? InAnnual Meeting of the Association for Computational Linguistics, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InAnnual Meeting of the Association for Computational Linguistics, 2019

work page 2019
[52]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 21

work page internal anchor Pith review Pith/arXiv arXiv 2021
[53]

CodeXGLUE: A machine learning benchmark dataset for code understanding and generation.arXiv preprint arXiv:2102.04664, 2021

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, et al. CodeXGLUE: A machine learning benchmark dataset for code understanding and generation.arXiv preprint arXiv:2102.04664, 2021

work page arXiv 2021
[54]

DS-1000: A natural and reliable benchmark for data science code generation

Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, et al. DS-1000: A natural and reliable benchmark for data science code generation. InInternational Conference on Machine Learning, 2023

work page 2023
[55]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, et al. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

work page 2023
[56]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, et al. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[57]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: Language agents with verbal reinforcement learning.arXiv preprint arXiv:2303.11366, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

ALFWorld: Aligning text and embodied environments for interactive learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, et al. ALFWorld: Aligning text and embodied environments for interactive learning. InInternational Conference on Learning Representations, 2021

work page 2021
[59]

WebShop: Towards scalable real-world web interaction with grounded language agents

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. WebShop: Towards scalable real-world web interaction with grounded language agents. InAdvances in Neural Information Processing Systems, 2022

work page 2022
[60]

Mind2Web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, et al. Mind2Web: Towards a generalist agent for the web. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[61]

Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi (Jim) Fan, and Anima Anandkumar

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, A. Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi (Jim) Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models.Trans. Mach. Learn. Res., 2023

work page 2023
[62]

Park, Joseph O’Brien, Carrie J

J. Park, Joseph O’Brien, Carrie J. Cai, M. Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InACM Symposium on User Interface Software and Technology, 2023

work page 2023
[63]

Michalewski, V

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, H. Michalewski, V . Ra- masesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quan- titative reasoning problems with language models.Neural Information Processing Systems, 2022

work page 2022
[64]

Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic

Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, A. Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science.arXiv.org, 2022

work page 2022
[65]

Singhal, Shekoofeh Azizi, T

K. Singhal, Shekoofeh Azizi, T. Tu, S. Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, et al. Large language models encode clinical knowledge.Nature, 2022

work page 2022
[66]

Chawla, Olaf Wiest, and Xiangliang Zhang

Taicheng Guo, Kehan Guo, Bozhao Nan, Zhengwen Liang, Zhichun Guo, Nitesh V . Chawla, Olaf Wiest, and Xiangliang Zhang. What can large language models do in chemistry? a comprehensive benchmark on eight tasks. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[67]

Villaescusa-Navarro, B

Francisco Villaescusa-Navarro, Boris Bolliet, Pablo Villanueva-Domingo, Adrian Bayer, Aidan Acquah, Chetana Amancharla, Almog Barzilay-Siegal, Pablo Bermejo, Camille L. Bilodeau, Pablo C’ardenas Ram’irez, Miles D. Cranmer, Urbano L. Francca, ChangHoon Hahn, Yan- Fei Jiang, Raúl Jiménez, Jun-Young Lee, Antonio Lerario, Osman Mamun, Thomas Meier, Anupam Ana...

work page arXiv 2025
[68]

Automated algorithmic discovery for scientific computing through llm-guided evolutionary search: A case study in gravitational-wave detection

He Wang and Liang Zeng. Automated algorithmic discovery for scientific computing through llm-guided evolutionary search: A case study in gravitational-wave detection. 2025

work page 2025
[69]

Barman et al

Kristian G. Barman et al. Large physics models: towards a collaborative approach with large language models and foundation models.Eur. Phys. J. C, 85(9):1066, 2025

work page 2025
[70]

Ignacio Cirac, and Bernhard Schölkopf

Sirui Lu, Zhijing Jin, Terry Jingchen Zhang, Pavel Kos, J. Ignacio Cirac, and Bernhard Schölkopf. Can Theoretical Physics Research Benefit from Language Agents? 6 2025

work page 2025
[71]

Zhiqi Gao, Tianyi Li, Yurii Kvasiuk, Sai Chaitanya Tadepalli, Maja Rudolph, Daniel J. H. Chung, Frederic Sala, and Moritz Münchmeyer. Test-time Scaling Techniques in Theoretical Physics – A Comparison of Methods on the TPBench Dataset. 6 2025

work page 2025
[72]

FeynTune: large language models for high-energy theory.Mach

Paul Richmond, Constantinos Papageorgakis, Vasilis Niarchos, Borun Chowdhury, and Prarit Agarwal. FeynTune: large language models for high-energy theory.Mach. Learn. Sci. Tech., 7(2):025012, 2026

work page 2026
[73]

Barman, Sascha Caron, Faegheh Hasibi, Eugene Shalugin, Yoris Marcet, Johannes Otte, Henk W

Kristian G. Barman, Sascha Caron, Faegheh Hasibi, Eugene Shalugin, Yoris Marcet, Johannes Otte, Henk W. de Regt, and Merijn Moody. Towards a Large Physics Benchmark. 7 2025

work page 2025
[74]

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

Minhui Zhu et al. Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark. 9 2025

work page 2025
[75]

Automating High Energy Physics Data Analysis with LLM-Powered Agents

Eli Gendreau-Distler, Joshua Ho, Dongwon Kim, Luc Tomas Le Pottier, Haichen Wang, and Chengxi Yang. Automating High Energy Physics Data Analysis with LLM-Powered Agents. In39th Annual Conference on Neural Information Processing Systems: Includes Machine Learning and the Physical Sciences (ML4PS), 12 2025

work page 2025
[76]

Opportunities in AI/ML for the Rubin LSST Dark Energy Science Collaboration

Eric Aubourg et al. Opportunities in AI/ML for the Rubin LSST Dark Energy Science Collaboration. 1 2026

work page 2026
[77]

Quantum-Audit: Evaluating the Reasoning Limits of LLMs on Quantum Computing

Mohamed Afane, Kayla Laufer, Wenqi Wei, Ying Mao, Junaid Farooq, Ying Wang, and Juntao Chen. Quantum-Audit: Evaluating the Reasoning Limits of LLMs on Quantum Computing. 2 2026

work page 2026
[78]

MadEvolve: Evolutionary Optimization of Cosmological Algorithms with Large Language Models

Tianyi Li, Shihui Zang, and Moritz Münchmeyer. MadEvolve: Evolutionary Optimization of Cosmological Algorithms with Large Language Models. 2 2026

work page 2026
[79]

The FERMIACC: Agents for Particle Theory

Prateek Agrawal, Nathaniel Craig, Amalia Madden, and Iñigo Valenzuela Lombera. The FERMIACC: Agents for Particle Theory. 3 2026

work page 2026
[80]

Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations

Ken Deng, Xiangfei Wang, Guijing Duan, Chen Mo, Junkun Huang, Runqing Zhang, Ling Qian, Zhiguo Huang, Jize Han, and Di Luo. Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations. 3 2026

work page 2026

Showing first 80 references.