pith. machine review for the scientific record. sign in

arxiv: 2605.08212 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.CL· gr-qc· hep-th

Recognition: 2 theorem links

· Lean Theorem

LLMs with in-context learning for Algorithmic Theoretical Physics

Anamaria Hell, Leander Thiele

Pith reviewed 2026-05-12 00:56 UTC · model grok-4.3

classification 💻 cs.LG cs.CLgr-qchep-th
keywords large language modelsin-context learningcosmological perturbationsmodified gravitycomputer algebra systemsalgorithmic computationstheoretical physics
0
0 comments X

The pith

Frontier LLMs supplied with worked examples solve most cosmological perturbation problems in modified gravity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can carry out the detailed algorithmic computations that arise in theoretical physics when they have access to a computer algebra system and are given worked examples in their prompt context. The specific domain is deriving cosmological perturbations within modified theories of gravity, tasks that are conceptually direct yet filled with opportunities for algebraic error. If the approach succeeds, physicists could shift effort away from routine symbolic work toward interpreting results and exploring new models. The authors document both the successes achieved and the recurring failure modes, together with practical ways to reduce those failures.

Core claim

Interfacing a frontier LLM with Maple and providing several worked examples enables the model to solve most of the chosen test problems involving cosmological perturbations in modified gravity theories. The paper records typical failure patterns and shows how additional context improves reliability.

What carries the argument

The LLM-CAS interface driven by in-context learning from worked examples, which steers the model to produce and execute correct symbolic steps.

If this is right

  • Routine but error-prone perturbative algebra in cosmology can be assisted or automated for the majority of standard cases.
  • Researchers can test larger families of modified gravity models without proportional increases in manual calculation time.
  • The rate of overlooked algebraic subtleties drops when the model is guided by concrete prior solutions.
  • The same pattern of example-guided LLM use can be applied to other algorithmic tasks that combine symbolic manipulation with physical constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be checked on calculations drawn from other subfields such as black-hole perturbation theory or early-universe cosmology beyond the tested set.
  • Running the LLM output against existing published derivations might surface discrepancies that warrant re-examination by humans.
  • Extending the test suite to higher-order or non-linear perturbations would reveal where current model limits appear.

Load-bearing premise

The selected test problems capture the subtleties and edge cases that actually arise in real cosmological perturbation calculations for modified gravity theories.

What would settle it

Apply the same interface and prompting style to a fresh, previously unseen perturbation calculation in the same domain and compare the LLM output against an independent expert derivation; mismatch on edge cases would falsify reliable performance.

Figures

Figures reproduced from arXiv: 2605.08212 by Anamaria Hell, Leander Thiele.

Figure 1
Figure 1. Figure 1: Results for the problems depending on the provided context. R2Fs sRFs sRFv sRFt sRMs sRMt sRi2Ms sRi2Fs sRi2Ft 10ex 3broad 3tailored instruction 49 | 2 82 | 2 32 | 2 57 | 1 93 | 7 18 | 1 37 | 3 66 | 2 69 | 1 75 | 2 89 | 2 32 | 1 37 | 2 100 | 6 36 | 1 40 | 2 38 | 1 66 | 2 49 | 1 100 | 4 30 | 1 29 | 1 55 | 2 19 | 1 24 | 1 77 | 2 41 | 1 50 | 4 95 | 13 79 | 9 73 | 17 70 | 8 28 | 7 28 | 4 80 | 9 100 | 7 turns (… view at source ↗
Figure 2
Figure 2. Figure 2: The lower panel shows the solution length as turns|restarts depending on context (horizontal) and problem (ver￾tical). The upper panel summarizes the table in terms of mean (solid) and median (dashed). 5 [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

There is an increasing number of algorithmic computations in theoretical physics. These, while conceptually simple, can nevertheless be time-consuming and contain subtleties that should not be overlooked. Given the recent improvement of Large Language Models (LLM), it is natural to investigate whether LLMs equipped with a computer algebra system (CAS) runtime and sufficiently informative context can reliably carry out these algorithmic tasks. In this work, we interface Claude with Maple, and apply this framework to cosmological perturbations in modified theories of gravity. We demonstrate the current capabilities of this approach, the typical failures, and how the same can be improved. We find that a frontier LLM supplied with worked examples is able to solve most test problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper describes an interface between the frontier LLM Claude and the Maple computer algebra system, using in-context learning with worked examples to perform algorithmic computations in theoretical physics. The central focus is on cosmological perturbation theory in modified gravity models; the authors report that this setup enables the LLM to solve most of their test problems, while cataloguing typical failure modes and suggesting improvements to the prompting and runtime framework.

Significance. If the central claim holds under rigorous evaluation, the work would demonstrate a practical, reproducible method for automating conceptually simple yet subtlety-prone calculations in cosmology and modified gravity. This could accelerate research by reducing manual effort on perturbation expansions, constraint handling, and background consistency checks, while providing a template for LLM+CAS hybrids in other areas of theoretical physics. The explicit discussion of failure modes is a strength that could guide future system design.

major comments (2)
  1. [Abstract and results section] Abstract and results section: The claim that a frontier LLM 'is able to solve most test problems' is presented without any quantitative metrics (success rate, total number of test cases, breakdown by model or order, or comparison against baselines such as direct Maple scripting or other LLMs). This absence makes it impossible to assess whether the reported performance supports the central claim of reliability for algorithmic tasks.
  2. [Test-suite description (likely §3 or §4)] Test-suite description (likely §3 or §4): The manuscript provides no explicit enumeration or characterization of the test problems. It is therefore unclear whether the suite includes representative edge cases such as higher-order perturbations, gauge-invariant variable choices, constraint propagation, background consistency in f(R) or Horndeski models, or non-canonical kinetic terms. Without such coverage, success on the chosen tests does not establish robustness for real cosmological perturbation calculations.
minor comments (2)
  1. [Abstract] The abstract would benefit from one or two concrete examples of both a successfully solved problem and a typical failure mode to give readers an immediate sense of the capabilities and limitations.
  2. Notation for the Maple interface and the in-context prompt templates should be standardized early in the paper to improve readability for readers unfamiliar with the specific LLM-CAS setup.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and will revise the manuscript to incorporate the suggested improvements for greater clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract and results section] The claim that a frontier LLM 'is able to solve most test problems' is presented without any quantitative metrics (success rate, total number of test cases, breakdown by model or order, or comparison against baselines such as direct Maple scripting or other LLMs). This absence makes it impossible to assess whether the reported performance supports the central claim of reliability for algorithmic tasks.

    Authors: We agree that the absence of quantitative metrics weakens the central claim. The current manuscript states only that the LLM solves 'most' test problems without providing success rates, total case counts, breakdowns by order or model, or baseline comparisons. In the revised manuscript we will add a dedicated results subsection with a summary table reporting the total number of test cases, overall and per-category success rates, breakdowns by perturbation order and modified-gravity model, and direct comparisons against manual Maple scripting on the same problems. This will allow readers to evaluate reliability quantitatively. revision: yes

  2. Referee: [Test-suite description (likely §3 or §4)] The manuscript provides no explicit enumeration or characterization of the test problems. It is therefore unclear whether the suite includes representative edge cases such as higher-order perturbations, gauge-invariant variable choices, constraint propagation, background consistency in f(R) or Horndeski models, or non-canonical kinetic terms. Without such coverage, success on the chosen tests does not establish robustness for real cosmological perturbation calculations.

    Authors: We acknowledge that an explicit characterization of the test suite is required to demonstrate coverage. While the manuscript applies the framework to cosmological perturbations in modified gravity and catalogues typical failure modes, it does not enumerate the individual test problems or map them to the edge cases listed. In the revision we will insert a new subsection that lists and characterizes every test problem, indicating the gravity model (including f(R) and Horndeski), perturbation order, and whether it exercises gauge-invariant variables, constraint propagation, background consistency, or non-canonical kinetics. We will also state the limitations of the current suite and any planned extensions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance report on explicit test cases

full rationale

The paper reports an empirical demonstration that a frontier LLM (Claude) interfaced with Maple solves most author-selected test problems in cosmological perturbation theory when supplied with worked examples. No mathematical derivation chain, fitted parameters, ansatzes, or uniqueness theorems appear. The central claim is a direct count of successes on the provided test set rather than a prediction that reduces to the inputs by construction. No self-citations are invoked as load-bearing justification for the result. The evaluation is therefore self-contained against its own stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are mentioned in the abstract.

pith-pipeline@v0.9.0 · 5414 in / 996 out tokens · 30137 ms · 2026-05-12T00:56:52.256471+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 5 internal anchors

  1. [1]

    Agrawal, P., Craig, N., Madden, A., and Lombera, I

    URLhttps://arxiv.org/abs/2404.11018. Agrawal, P., Craig, N., Madden, A., and Lombera, I. V . The fermiacc: Agents for particle theory,

  2. [2]

    Anderson, J

    URL https://arxiv.org/abs/2603.22538. Anderson, J. L. and Bergmann, P. G. Constraints in covari- ant field theories.Phys. Rev., 83:1018–1025,

  3. [3]

    Bergmann, P

    doi: 10.1103/PhysRev.83.1018. Bergmann, P. G. Non-Linear Field Theories.Phys. Rev., 75: 680–685,

  4. [4]

    Breen, B., Tredici, M

    doi: 10.1103/PhysRev.75.680. Breen, B., Tredici, M. D., McCarran, J., Mijares, J. A., Yin, W. W., Sulimany, K., Taylor, J. M., Koppens, F. H. L., and Englund, D. Ax-prover: A deep reasoning agentic framework for theorem proving in mathematics and quantum physics,

  5. [5]

    Ax-prover: A deep reasoning agentic framework for theorem proving in mathematics and quantum physics.arXiv preprint arXiv:2510.12787, 2025

    URL https://arxiv. org/abs/2510.12787. Chen, W., Ma, X., Wang, X., and Cohen, W. W. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks,

  6. [6]

    Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    URL https://arxiv.org/abs/2211.12588. Chibisov, G. V . and Mukhanov, V . F. Galaxy formation and phonons.Mon. Not. Roy. Astron. Soc., 200:535–550,

  7. [7]

    https://arxiv.org/abs/2502.15815

    URLhttps://arxiv.org/abs/2502.15815. Das, D., Banerjee, D., Aditya, S., and Kulkarni, A. Math- sensei: A tool-augmented large language model for math- ematical reasoning,

  8. [8]

    Mathsensei: a tool-augmented large language model for mathematical reasoning

    URL https://arxiv. org/abs/2402.17231. Dirac, P. A. M. Generalized Hamiltonian dynamics.Can. J. Math., 2:129–148,

  9. [9]

    Dong, Q., Li, L., Dai, D., Zheng, C., Ma, J., Li, R., Xia, H., Xu, J., Wu, Z., Liu, T., Chang, B., Sun, X., Li, L., and Sui, Z

    doi: 10.4153/CJM-1950-012-1. Dong, Q., Li, L., Dai, D., Zheng, C., Ma, J., Li, R., Xia, H., Xu, J., Wu, Z., Liu, T., Chang, B., Sun, X., Li, L., and Sui, Z. A survey on in-context learning,

  10. [10]

    A Survey on In-context Learning

    URL https://arxiv.org/abs/2301.00234. Du, Y ., Tian, M., Ronanki, S., Rongali, S., Bodapati, S., Galstyan, A., Wells, A., Schwartz, R., Huerta, E. A., and Peng, H. Context length alone hurts llm performance despite perfect retrieval,

  11. [11]

    Context length alone hurts llm performance despite perfect retrieval.arXiv preprint arXiv:2510.05381, 2025

    URL https://arxiv. org/abs/2510.05381. Faddeev, L. D. and Jackiw, R. Hamiltonian Reduction of Unconstrained and Constrained Systems.Phys. Rev. Lett., 60:1692–1694,

  12. [12]

    doi: 10.1103/PhysRevLett. 60.1692. Gao, A., Zhang, C., Zhang, X., Li, D., Zhao, M., Liu, F., and Zhang, X. Process in-context learning: Enhancing mathematical reasoning via dynamic demonstration inser- tion,

  13. [13]

    Pal: Program-aided language models,

    URL https://arxiv.org/ abs/2211.10435. Gao, Z., Li, T., Kvasiuk, Y ., Tadepalli, S. C., Rudolph, M., Chung, D. J. H., Sala, F., and M¨unchmeyer, M. Test-time scaling techniques in theoretical physics – a comparison of methods on the tpbench dataset,

  14. [14]

    Gou, Z., Shao, Z., Gong, Y ., Shen, Y ., Yang, Y ., Huang, M., Duan, N., and Chen, W

    URL https: //arxiv.org/abs/2506.20729. Gou, Z., Shao, Z., Gong, Y ., Shen, Y ., Yang, Y ., Huang, M., Duan, N., and Chen, W. Tora: A tool-integrated reasoning agent for mathematical problem solving,

  15. [15]

    arXiv preprint arXiv:2309.17452

    URLhttps://arxiv.org/abs/2309.17452. Heisenberg, L. Counting Degrees of Freedom: A Method Applicable from Scalars to f(Q) Gravity and Beyond. 9

  16. [16]

    Huang, X., Zhang, L

    doi: 10.1007/JHEP03(2026)235. Huang, X., Zhang, L. L., Cheng, K.-T., Yang, F., and Yang, M. Fewer is more: Boosting llm reasoning with reinforced context pruning,

  17. [17]

    Jin, B., Yoon, J., Han, J., and Arik, S

    URL https: //arxiv.org/abs/2312.08901. Jin, B., Yoon, J., Han, J., and Arik, S. O. Long-context llms meet rag: Overcoming challenges for long inputs in 9 LLMs with in-context learning for Algorithmic Theoretical Physics rag,

  18. [18]

    Kodama, H

    URL https://arxiv.org/abs/2510.12350. Kodama, H. and Sasaki, M. Cosmological Perturbation Theory.Prog. Theor. Phys. Suppl., 78:1–166,

  19. [19]

    Li, T., Zhang, G., Do, Q

    doi: 10.1143/PTPS.78.1. Li, T., Zhang, G., Do, Q. D., Yue, X., and Chen, W. Long- context llms struggle with long in-context learning,

  20. [20]

    Long-context llms struggle with long in-context learning.arXiv preprint arXiv:2404.02060, 2024

    URLhttps://arxiv.org/abs/2404.02060. Liu, J., Huang, Z., Wang, C., Huang, X., Zhai, C., and Chen, E. What makes in-context learning effective for mathematical reasoning: A theoretical analysis,

  21. [21]

    URLhttps://arxiv.org/abs/2412.12157. Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. Lost in the middle: How language models use long contexts,

  22. [22]

    Lost in the Middle: How Language Models Use Long Contexts

    URL https: //arxiv.org/abs/2307.03172. Lu, S., Jin, Z., Zhang, T. J., Kos, P., Cirac, J. I., and Sch¨olkopf, B. Can theoretical physics research benefit from language agents?,

  23. [23]

    org/abs/2506.06214

    URL https://arxiv. org/abs/2506.06214. Luo, H., Feng, H., Sun, Q., Xu, C., Zheng, K., Wang, Y ., Yang, T., Hu, H., and Tang, Y . Agentmath: Empowering mathematical reasoning for large language models via tool-augmented agent,

  24. [24]

    org/abs/2512.20745

    URL https://arxiv. org/abs/2512.20745. Menzo, T., Roman, A., Fleming, G. T., Gleyzer, S., Matchev, K. T., and Mrenna, S. Agentic diagrammatica: To- wards autonomous symbolic computation in high energy physics,

  25. [25]

    Mukhanov, V

    URL https://arxiv.org/abs/ 2603.26990. Mukhanov, V . F. and Chibisov, G. V . Quantum Fluctuations and a Nonsingular Universe.JETP Lett., 33:532–535,

  26. [26]

    Theory of cosmological perturba- tions. Part 1. Classical perturbations. Part 2. Quantum theory of perturbations. Part 3. Extensions

    doi: 10.1016/0370-1573(92)90044-Z. Nezhad, S. B., Li, Y ., and Agrawal, A. Symcode: A neu- rosymbolic approach to mathematical reasoning via veri- fiable code generation,

  27. [27]

    org/abs/2510.25975

    URL https://arxiv. org/abs/2510.25975. Pan, H., Roggeveen, J. V ., Berg, E., Carrasquilla, J., Chowd- hury, D., Ganguli, S., Ghimenti, F., Hasik, J., Hunt, H., Jiang, H.-C., Kamb, M., Kao, Y .-J., Khatami, E., Lawler, M. J., Luo, D., Neupert, T., Qi, X., Brenner, M. P., and Kim, E.-A. Cmt-benchmark: A benchmark for condensed matter theory built by expert ...

  28. [28]

    Sasaki, M

    URL https://arxiv.org/abs/2510.05228. Sasaki, M. Large Scale Quantum Fluctuations in the Infla- tionary Universe.Prog. Theor. Phys., 76:1036,

  29. [29]

    Starobinsky, A

    doi: 10.1143/PTP.76.1036. Starobinsky, A. A. Spectrum of relict gravitational radiation and the early state of the universe.JETP Lett., 30:682– 685,

  30. [30]

    A Neuro-Symbolic Approach for Reliable Proof Generation with LLMs: A Case Study in Euclidean Geometry

    URL https:// arxiv.org/abs/2505.14479. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. Chain-of- thought prompting elicits reasoning in large language models,

  31. [31]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    URL https://arxiv.org/abs/ 2201.11903. Yang, X., Lin, J., Wang, Z., and Zhai, C. Learning by analogy: Enhancing few-shot prompting for math word problem solving with computational graph-based retrieval,

  32. [32]

    Zou, K., Khalifa, M., and Wang, L

    URL https://arxiv.org/abs/ 2411.16454. Zou, K., Khalifa, M., and Wang, L. On many-shot in- context learning for long-context evaluation,

  33. [33]

    10 LLMs with in-context learning for Algorithmic Theoretical Physics A

    URL https://arxiv.org/abs/2411.07130. 10 LLMs with in-context learning for Algorithmic Theoretical Physics A. Presentation of the problems for the context 10ex with one solved example The full list of the problems for which we have presented detailed code with explanations in Maple is given by the following: Example 1:Consider the w=0 Brans-Dicke theory w...