arxiv: 2604.14696 · v1 · submitted 2026-04-16 · ⚛️ physics.data-an

Recognition: unknown

Development of an LLM-Based System for Automatic Code Generation from HEP Publications

Masahiko Saito , Tomoe Kishimoto , Junichi Tanaka

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:50 UTC · model grok-4.3

classification ⚛️ physics.data-an

keywords analysisresultsselectioncodellmseventgenerateopen-weight

0 comments

The pith

A two-stage LLM system extracts structured analysis selections from HEP papers and references then generates and validates executable code, achieving partial event-level matches on an ATLAS Higgs-to-four-leptons benchmark but limited by hallucination and stochasticity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This work tests whether current large language models can help solve a practical problem in particle physics: making published analyses easier to reproduce. The authors created a two-part system. In the first part, an LLM reads a target paper plus papers it cites, identifies things like which particles to keep, what cuts to apply on their energy or momentum, and organizes those rules into a clean list. In the second part, the list is fed to the model again to write computer code that runs the analysis on real data. The code is then executed, checked against expected outputs, and the process repeats if errors appear. They tested the system on a well-known ATLAS analysis of the Higgs boson decaying into four leptons, using publicly released 2015-2016 collision data. A human-written version of the same analysis served as the comparison baseline. The models often recovered most of the documented selection rules and, in some attempts, produced code whose event-by-event decisions matched the baseline. However, the models sometimes invented rules that were not in the papers, the generated code failed to run, and different runs of the same prompt gave different results. The authors conclude the tools are already useful when a physicist supervises and corrects them, but not yet ready to work completely on their own.

Core claim

Our initial results show that recent open-weight models can recover many documented selection criteria from papers and references, and that in some runs they can generate event selections fully matching a baseline implementation at the event level.

Load-bearing premise

That iterative prompting and execution feedback can reliably overcome LLM stochasticity and hallucination to produce code that matches a human baseline without substantial human correction, an assumption the abstract itself flags as still problematic.

Figures

Figures reproduced from arXiv: 2604.14696 by Junichi Tanaka, Masahiko Saito, Tomoe Kishimoto.

**Figure 1.** Figure 1: Overview of the proposed two-stage workflow. Step 1 iteratively extracts and merges selection criteria from the target paper and its references. Step 2 utilizes these extracted criteria to sequentially generate, execute, and validate analysis code until successful reproduction is achieved. 3. Benchmark Design and Evaluation Protocol As our benchmark, we selected the ATLAS 𝐻 → 𝑍 𝑍∗ → 4ℓ analysis [4]. Becaus… view at source ↗

**Figure 2.** Figure 2: Comparison of the four-lepton invariant mass distribution between (a) the published result [9] and (b) our manually reproduced baseline. reference provenance (Appendix B). This format facilitates downstream information integration and provides a verifiable, human-readable intermediate result. 4.1 Workflow Design The selection extraction system is an iterative workflow consisting of four components (Figure… view at source ↗

**Figure 3.** Figure 3: Comparison of Bulk and Chunk settings in Step 1: (a) correctly extracted cuts (out of 27 groundtruth cuts) and (b) hallucinations. Points represent successful runs, with horizontal bars indicating medians. Failed runs are omitted. 6 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Ensuring the reproducibility of physics results is one of the crucial challenges in high-energy physics (HEP). In this study, we develop a proof-of-concept system that uses large language models (LLMs) to extract analysis procedures from HEP publications and generate executable analysis code for reproducing published results. Our method consists of two stages. In the first stage, open-weight LLMs extract event selection criteria, object definitions, and other relevant analysis information from a target paper and, when necessary, from its referenced publications, and then produce a structured selection list. In the second stage, the structured selection list is used to generate analysis code, which is then executed and validated iteratively. As a benchmark, we use the ATLAS $H \to ZZ^{*} \to 4\ell$ analysis based on proton-proton collision data recorded in 2015 and 2016 and released as ATLAS Open Data. This benchmark allows direct comparison between the generated results and the published analysis, as well as comparison with a manually developed baseline implementation. We separately evaluate selection extraction and code generation in order to clarify the current capabilities and limitations of open-weight LLMs for HEP analysis reproduction. Our initial results show that recent open-weight models can recover many documented selection criteria from papers and references, and that in some runs they can generate event selections fully matching a baseline implementation at the event level. At the same time, stochasticity, hallucination, and execution failure remain significant challenges. These results suggest that LLMs are already promising as human-in-the-loop tools for reproducibility support, although they are not yet reliable as fully autonomous HEP analysis agents. In this paper, we report the design of the prototype system and its initial performance evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A concrete but lightly evaluated prototype for LLM-assisted HEP code generation from papers, with honest limits but missing numbers on how often it succeeds.

read the letter

The paper builds a two-stage system: LLMs first pull event selections and object defs from a target HEP paper plus its references into a structured list, then turn that list into code that runs on ATLAS Open Data and gets checked iteratively against a human baseline. They use the public H to ZZ to 4l analysis from 2015-2016 data, which is a good choice because you can actually compare outputs at the event level. That setup is new enough in the HEP context and they keep the claims modest, calling it a human-in-the-loop aid rather than an autonomous agent. The abstract flags stochasticity and hallucination as ongoing problems, which matches what the stress-test note points out. What they do well is separate the extraction and code-generation steps and ground the test in real open data instead of synthetic examples. The main weakness is the evaluation itself. They report that models recover many criteria and succeed in some runs, but give no trial counts, no precision or recall for the extraction stage, no fraction of code runs that pass without edits, and no clear definition of what counts as an event-level match. Without those figures the iterative validation loop cannot be shown to overcome the failure modes the authors themselves name. This is worth a referee for groups working on reproducibility tools or domain-specific LLM applications in physics. The core idea is practical and the benchmark is solid, but it needs quantitative metrics before the results can be assessed properly. I would send it out for review with a request to add those numbers and failure breakdowns.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a proof-of-concept two-stage system that employs open-weight LLMs to extract event selection criteria, object definitions, and related information from HEP publications (and referenced papers when needed) into a structured list, followed by iterative generation and validation of executable analysis code. The system is evaluated on the ATLAS H → ZZ* → 4ℓ analysis using 2015–2016 proton-proton open data, with direct comparison to published results and a manually developed baseline implementation. The authors separately assess selection extraction and code generation, reporting that recent models recover many documented criteria and that in some runs the generated code produces event selections fully matching the baseline at the event level, while acknowledging persistent issues with stochasticity, hallucination, and execution failures. The work concludes that LLMs are promising as human-in-the-loop reproducibility aids but not yet reliable as fully autonomous agents.

Significance. If the reported qualitative successes can be substantiated with quantitative metrics, this prototype would represent a concrete step toward reducing the manual effort required to reproduce complex HEP analyses from the literature. The use of a public benchmark with an explicit baseline implementation and the separation of extraction versus code-generation evaluation are appropriate design choices that facilitate assessment. The explicit acknowledgment of current limitations strengthens the manuscript by framing the system realistically as an assistive tool rather than a complete solution.

major comments (2)

[Abstract and Results section] The central claim in the abstract and results that LLMs 'in some runs' generate code whose output 'fully matching a baseline implementation at the event level' is load-bearing for the paper's assessment of current capabilities, yet no quantitative details are supplied: number of trials performed, success fraction, definition of event-level match (e.g., identical cutflow tables versus per-event agreement on the Open Data sample), or precision/recall for the extraction stage. Without these, it is impossible to determine whether the iterative prompting and execution feedback loop reliably mitigates the stochasticity and hallucination problems the authors themselves flag.
[Method section (second stage)] The description of the second-stage iterative validation process (code generation, execution, and feedback) lacks concrete operational details such as the typical number of iterations required, the distribution of execution failure modes encountered, or the extent of human corrections needed per successful run. These metrics are necessary to evaluate whether the claimed partial successes can be achieved with acceptable human effort.

minor comments (2)

[Abstract] The abstract would be strengthened by the inclusion of at least one concrete quantitative indicator (e.g., fraction of selection criteria recovered or success rate across runs) to give readers an immediate sense of scale.
[Benchmark and Evaluation] Clarify the exact criteria used to declare a 'match' between generated and baseline code outputs, including any tolerance for floating-point differences or ordering of cuts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the constructive major comments. We agree that additional quantitative and operational details will strengthen the manuscript and clarify the current capabilities and limitations of the system. We respond to each comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and Results section] The central claim in the abstract and results that LLMs 'in some runs' generate code whose output 'fully matching a baseline implementation at the event level' is load-bearing for the paper's assessment of current capabilities, yet no quantitative details are supplied: number of trials performed, success fraction, definition of event-level match (e.g., identical cutflow tables versus per-event agreement on the Open Data sample), or precision/recall for the extraction stage. Without these, it is impossible to determine whether the iterative prompting and execution feedback loop reliably mitigates the stochasticity and hallucination problems the authors themselves flag.

Authors: We agree that the manuscript would benefit from more quantitative information to support the claims. The current presentation is qualitative because the work is a proof-of-concept demonstration of the two-stage system. To address the referee's concern, we will revise the abstract and results section to include quantitative details from our evaluation runs, such as the number of trials performed for code generation, the fraction of runs achieving full event-level match, a clear definition of what constitutes an event-level match, and precision/recall metrics for the selection extraction stage. This will provide a better basis for assessing the effectiveness of the iterative feedback loop in handling stochasticity and hallucination. revision: yes
Referee: [Method section (second stage)] The description of the second-stage iterative validation process (code generation, execution, and feedback) lacks concrete operational details such as the typical number of iterations required, the distribution of execution failure modes encountered, or the extent of human corrections needed per successful run. These metrics are necessary to evaluate whether the claimed partial successes can be achieved with acceptable human effort.

Authors: We concur that more concrete details on the iterative process are needed to assess the practicality of the system. In the revised manuscript, we will expand the method section to describe the operational aspects of the second stage, including the typical number of iterations in the validation loop, the main categories of execution failures observed, and the nature and extent of any human corrections applied in successful cases. These additions will help readers understand the human effort required for the partial successes reported. revision: yes

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an applied systems prototype that relies on existing open-weight LLMs and public ATLAS Open Data rather than introducing new theoretical entities or fitted parameters.

axioms (1)

domain assumption Open-weight LLMs can extract complex, domain-specific selection criteria from scientific text and references with usable accuracy when prompted appropriately
This underpins the first extraction stage and is implicitly required for the system to function.

pith-pipeline@v0.9.0 · 5615 in / 1389 out tokens · 63437 ms · 2026-05-10T08:50:49.026575+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 11 canonical work pages · 3 internal anchors

[1]

Esmail, A

W. Esmail, A. Hammad and M. Nojiri,CoLLM: AI engineering toolbox for end-to-end deep learning in collider analyses, 2026. arXiv:2602.06496 [hep-ph]

work page arXiv 2026
[2]

arXiv:2512.07785 , year =

E. Gendreau-Distler, J. Ho, D. Kim, L.T.L. Pottier, H. Wang and C. Yang,Automating High Energy Physics Data Analysis with LLM-Powered Agents, 2025. arXiv:2512.07785 [physics.data-an]

work page arXiv 2025
[3]

Ai agents can already autonomously perform experimental high energy physics,

E.A. Moreno, S. Bright-Thonney, A. Novak, D. Garcia and P. Harris,AI Agents Can Already Autonomously Perform Experimental High Energy Physics, 2026. arXiv:2603.20179 [hep-ex]

work page arXiv 2026
[4]

ATLAS Collaboration,Measurement of the Higgs boson mass in the𝐻→𝑍 𝑍∗ →4ℓand 𝐻→𝛾𝛾channels with √𝑠=13 TeV pp collisions using the ATLAS detector,Physics Letters B 784(2018) 345

2018
[5]

http://doi.org/10.7483/OPENDATA.ATLAS.AOQL.8TT3

ATLAS Collaboration,ATLAS DAOD_PHYSLITE format Run 2 2015 proton-proton collision data, 2024. http://doi.org/10.7483/OPENDATA.ATLAS.AOQL.8TT3

work page doi:10.7483/opendata.atlas.aoql.8tt3 2015
[6]

http://doi.org/10.7483/OPENDATA.ATLAS.4ZES.DJHA

ATLAS Collaboration,ATLAS DAOD_PHYSLITE format Run 2 2016 proton-proton collision data, 2024. http://doi.org/10.7483/OPENDATA.ATLAS.4ZES.DJHA

work page doi:10.7483/opendata.atlas.4zes.djha 2016
[7]

http://doi.org/10.7483/OPENDATA.ATLAS.Z2J9.709J

ATLAS Collaboration,ATLAS DAOD_PHYSLITE format MC simulation Higgs nominal samples, 2024. http://doi.org/10.7483/OPENDATA.ATLAS.Z2J9.709J

work page doi:10.7483/opendata.atlas.z2j9.709j 2024
[8]

http://doi.org/10.7483/OPENDATA.ATLAS.K5SU.X65Y

ATLAS Collaboration,ATLAS DAOD_PHYSLITE format MC simulation electroweak boson nominal samples, 2024. http://doi.org/10.7483/OPENDATA.ATLAS.K5SU.X65Y

work page doi:10.7483/opendata.atlas.k5su.x65y 2024
[9]

ATLAS Collaboration,Measurement of inclusive and differential cross sections in the 𝐻→𝑍 𝑍 ∗ →4ℓdecay channel in pp collisions at√𝑠= 13 TeV with the ATLAS detector, Journal of High Energy Physics2017(2017) 132

2017
[10]

“Marker.” https://github.com/datalab-to/marker
[11]

Chase,LangChain, Oct., 2022

H. Chase,LangChain, Oct., 2022. https://github.com/langchain-ai/langchain

2022
[12]

LangGraph

“LangGraph.” https://github.com/langchain-ai/langgraph
[13]

W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C.H. Yu et al.,Efficient Memory Management for Large Language Model Serving with PagedAttention, inProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

2023
[14]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng et al.,Qwen3 Technical Report, 2025. arXiv:2505.09388 [cs.CL]. 9 Automatic Code Generation from HEP PublicationsMasahiko Saito

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI,gpt-oss-120b & gpt-oss-20b Model Card, 2025. arXiv:2508.10925 [cs.CL]

work page internal anchor Pith review arXiv 2025
[16]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon et al.,Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities, 2025. arXiv:2507.06261 [cs.CL]. A. Benchmark dataset Table3summarizestheeventselectioncriteriausedasthegroundtruthforStep1extractionand the inpu...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Pre-selection 1.1 (*) Good Run List 1.2 (*) Trigger requirements 1.3 Number of primary vertices>0
[18]

Electron 2.1 Loose criteria 2.2𝐸 T >7GeV 2.3|𝜂|<2.47 2.4(𝑝 cone20 T /𝐸T)<0.15 2.5(𝐸 cone20 T /𝐸T)<0.20 2.6|𝑧 0 sin𝜃|<0.5mm 2.7|𝑑 0/𝜎(𝑑 0)|<5
[19]

Muon 3.1|𝜂|<2.7 3.1.1|𝜂|<0.1for Segmented-tagged muon and Calo-tagged muon 3.1.20.1<|𝜂|<2.5for Combined muons 3.1.32.4<|𝜂|<2.7for Muon-Spectrometer standalone muons 3.2𝑝 T >5GeV 3.2.1𝑝 T >15 GeVfor Calo-tagged muon 3.3𝑝 cone30 T /𝑝 T <0.15 3.4𝐸 cone20 T /𝑝 T <0.30 3.5|𝑧 0 sin𝜃|<0.5 mm 3.6|𝑑 0/𝜎(𝑑 0)|<3 3.7|𝑑 0|<1 mm
[20]

Measurement of inclusive and differential cross sections in the $H \rightarrow ZZ^* \rightarrow 4\ell$ decay channel in $pp$ collisions at $\sqrt{s}$ = 13 TeV with the ATLAS detector

Quadruplet 4.1 Number of same-flavour opposite-sign lepton pairs≥2 4.250< 𝑚 12 <106 GeV 4.312< 𝑚 34 <115 GeV 4.4𝑝 T >20,15,10 GeVfor 1st, 2nd, 3rd leptons 4.5Δ𝑅(ℓ, ℓ)>0.1(0.2)for same-flavour (opposite-flavour) 4.6𝑚 ℓℓ >5 GeVfor same-flavour opposite-sign leptons 4.7 (*) 4 leptons vertex fit 4.8 Number of Combined muons≥3for4𝜇channel 4.9 (*)𝑍mass constrai...

work page Pith review arXiv