arxiv: 2604.22541 · v1 · submitted 2026-04-24 · ✦ hep-ex

Recognition: unknown

Dr.Sai: An agentic AI for real-world physics analysis at BESIII

Beijiang Liu, Bolun Zhang, Changzheng Yuan, Dongbo Xiong, Fayu Jiang, Fazhi Qi, Hong Wang, Junkun Jiao, Ke Li, Mingfeng He, Mingrun Li, Tong Liu, WeiMin Song, Xiongfei Wang, Xuliang Zhu, Yipu Liao, Yue Sun, Zhengde Zhang, Zijie Shang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:17 UTC · model grok-4.3

classification ✦ hep-ex

keywords Dr.SaiLLM agentBESIIIJ/psi decaysbranching fractionsautonomous analysishigh energy physicsphysics workflows

0 comments

The pith

Dr.Sai, an LLM-powered multi-agent system, autonomously translates natural language into complete physics analysis workflows and reproduces established J/psi branching fraction measurements at BESIII.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Dr.Sai as a multi-agent AI system that converts plain-language instructions into full high-energy physics analysis pipelines, including simulation, reconstruction, and statistical steps. It validates the system by directing it to re-measure ten J/psi decay branching fractions inside the actual BESIII computing environment, with no human-written code required at any stage. The resulting values agreed with existing expert measurements. A reader would care because petabyte-scale HEP datasets demand months or years of expert effort under current manual methods, creating a bottleneck as data volumes grow. If the approach holds, physicists could shift focus from implementing analysis code to designing measurements and interpreting outcomes.

Core claim

Dr.Sai is an LLM-powered multi-agent system that translates natural language into rigorous physics workflows. As validation, Dr.Sai performed large-scale re-measurements of ten J/psi decay branching fractions without manual coding. It successfully navigated the real BESIII computing environment and produced results matching established benchmarks. The article details Dr.Sai's architecture, the validation results, and performance evaluation.

What carries the argument

Dr.Sai, the LLM-powered multi-agent system that interprets natural language tasks, generates code for HEP tools such as ROOT and BOSS, and executes the full workflow in the target computing environment.

If this is right

Large-scale systematic scans of multiple decay channels become feasible without a proportional increase in human labor.
Analysis results can be reproduced and cross-checked more consistently by directing the same agent with identical instructions.
The interval between data taking and final physics results can be reduced by automating the workflow generation and execution steps.
The same architecture supplies a template for autonomous analysis pipelines in other data-heavy domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar agent systems could be deployed at other particle physics facilities that share comparable data processing chains.
Adding iterative feedback from partial results might allow the agent to refine selection criteria or fit procedures on its own.
The method could accelerate studies of low-rate processes that are currently impractical to analyze at scale by hand.
Integration with version-controlled analysis repositories could let the agent produce auditable, reusable workflows as a standard output.

Load-bearing premise

Large language models can accurately interpret complex scientific tasks, generate error-free code for specialized HEP tools, and manage real computing-environment quirks without undetected biases or failures.

What would settle it

Applying Dr.Sai to an independent set of J/psi decay channels or a different experiment and obtaining branching fraction values that systematically deviate from independent measurements beyond statistical uncertainties.

read the original abstract

High Energy Physics (HEP) experiments like BESIII produce petabyte-scale data. Extracting physics results requires complex workflows (simulation, reconstruction, statistical analysis, etc.) that traditionally take experts months or years. Current manual methods are labor-intensive, prone to bias, and limit large-scale systematic scans. As data grows, this paradigm slows discovery. Large Language Models (LLMs) offer a solution. Their natural language understanding and code generation capabilities allow them to interpret scientific tasks and integrate with HEP tools (e.g., ROOT, BOSS) to act as an "AI partner" for autonomous analysis. We present Dr.Sai, an LLM-powered multi-agent system that translates natural language into rigorous physics workflows. As validation, Dr.Sai performed large-scale re-measurements of ten J/psi decay branching fractions - without manual coding. It successfully navigated the real BESIII computing environment and produced results matching established benchmarks. The article details Dr.Sai's architecture, the validation results, and performance evaluation. This work provides a blueprint for autonomous discovery, with relevance to other data-intensive fields like astronomy and genomics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Dr.Sai builds a BESIII-specific multi-agent system that claims full autonomy on ten real branching-fraction measurements, but supplies none of the prompts, code, or logs needed to check it.

read the letter

The paper's core contribution is a multi-agent LLM setup that translates plain-language tasks into complete BESIII workflows using BOSS, ROOT, and the live computing environment. It reports running ten J/psi decay analyses end-to-end and recovering branching fractions that line up with existing benchmarks, all without manual coding. That is a concrete engineering step beyond generic LLM science demos, and the architecture description shows how they wired the agents to actual HEP tools. Readers working on similar agent systems could pick up practical details on tool integration and task decomposition from that part alone.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Dr.Sai, an LLM-powered multi-agent system intended to automate complex high-energy physics analysis workflows at BESIII. The central claim is that Dr.Sai translates natural-language task descriptions into complete, executable pipelines (including BOSS/ROOT jobs for simulation, reconstruction, efficiency corrections, background subtraction, and fits) and successfully re-measured branching fractions for ten J/psi decay channels in the real BESIII computing environment, yielding results that match established benchmarks without any manual coding or human intervention.

Significance. If the autonomy and correctness claims hold with full reproducibility, the work would be significant for HEP by showing a practical route to reduce the months-to-years effort required for standard analyses, enable broader systematic scans, and minimize workflow biases. The reported integration with live experimental infrastructure (rather than toy environments) is a concrete strength. The approach could also serve as a template for agentic systems in other data-intensive domains. However, the absence of released artifacts and quantitative validation metrics substantially limits the immediate scientific impact and verifiability of these contributions.

major comments (2)

[Validation results section] Validation results section: The manuscript states that Dr.Sai produced branching-fraction values matching established benchmarks for ten channels, but supplies no quantitative agreement metrics (e.g., differences from PDG values, combined uncertainties, or fit-quality measures), no statistics on attempt success/failure rates, and no description of how LLM hallucinations or code errors were detected and corrected. These details are load-bearing for assessing whether the system reliably navigated real-data complexities.
[Reproducibility and system description] Reproducibility and system description: The claim of fully autonomous execution without manual coding is central, yet the paper releases neither the natural-language prompts, the LLM-generated scripts and job files, the execution logs, nor the final output files for the ten channels. Without these artifacts, independent verification of the workflow autonomy and absence of hidden human guidance is impossible.

minor comments (1)

[Abstract and performance-evaluation section] The abstract and performance-evaluation section would benefit from explicit definitions of the success criteria used to declare a workflow 'successful' and from clearer notation distinguishing agent-generated code from any template or helper scripts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We agree that additional details on validation metrics and reproducibility are crucial for establishing the reliability and verifiability of Dr.Sai. Below we provide point-by-point responses and indicate the revisions we will make.

read point-by-point responses

Referee: [Validation results section] The manuscript states that Dr.Sai produced branching-fraction values matching established benchmarks for ten channels, but supplies no quantitative agreement metrics (e.g., differences from PDG values, combined uncertainties, or fit-quality measures), no statistics on attempt success/failure rates, and no description of how LLM hallucinations or code errors were detected and corrected. These details are load-bearing for assessing whether the system reliably navigated real-data complexities.

Authors: We acknowledge that the current version of the manuscript lacks explicit quantitative metrics and detailed error-handling descriptions, which are important for a rigorous assessment. In the revised manuscript, we will add a dedicated subsection in the validation results with a table of measured branching fractions versus PDG values, including relative differences, uncertainties, and goodness-of-fit measures such as chi-squared per degree of freedom. We will also report success rates from multiple independent runs of the agent system and describe the built-in verification mechanisms, including cross-checks by specialized agents for code correctness and consistency with physics expectations. This will strengthen the evidence for autonomous navigation of real-data complexities without misrepresenting the original claims. revision: yes
Referee: [Reproducibility and system description] The claim of fully autonomous execution without manual coding is central, yet the paper releases neither the natural-language prompts, the LLM-generated scripts and job files, the execution logs, nor the final output files for the ten channels. Without these artifacts, independent verification of the workflow autonomy and absence of hidden human guidance is impossible.

Authors: We agree that releasing the artifacts is essential for full reproducibility and independent verification. Although the initial submission emphasized the system description and results to highlight the conceptual advance, we will include the natural-language prompts, sample LLM-generated scripts, execution logs (anonymized where necessary), and output files as supplementary material or host them in a public GitHub repository linked in the revised manuscript. This will enable others to inspect the autonomy of the process. We note that the multi-agent architecture includes logging of all steps, which facilitates this release. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical validation of agentic system

full rationale

The paper describes an LLM-based multi-agent architecture (Dr.Sai) that translates natural-language tasks into BESIII workflows and validates it by re-measuring ten J/ψ branching fractions, reporting agreement with PDG benchmarks. No derivation chain, equations, fitted parameters, or first-principles predictions exist that could reduce to inputs by construction. The central claim is an empirical demonstration of autonomous execution in a real computing environment; any self-citations (if present) support tool integration or prior LLM capabilities but are not load-bearing for the reported results. The validation is falsifiable against external benchmarks and does not rely on self-definition or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work introduces no new physical parameters or mathematical axioms. It relies on the domain assumption that current LLMs plus tool integration can perform reliable scientific code generation and execution.

axioms (1)

domain assumption LLM agents guided by multi-agent orchestration can reliably translate natural-language physics tasks into correct, executable code for domain tools such as ROOT and BOSS
This assumption underpins the claim of autonomous analysis without manual coding.

invented entities (1)

Dr.Sai multi-agent system no independent evidence
purpose: To serve as an autonomous AI partner that executes full HEP analysis workflows
The system itself is the primary contribution and is validated empirically on real data.

pith-pipeline@v0.9.0 · 5556 in / 1210 out tokens · 44901 ms · 2026-05-08T09:17:18.074574+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 31 canonical work pages · 6 internal anchors

[1]

Ablikim, M.,et al.: Design and Construction of the BESIII Detector. Nucl. Instrum. Meth. A614, 345–399 (2010) https://doi.org/10.1016/j.nima.2009.12. 050 arXiv:0911.4960 [physics.ins-det]

work page doi:10.1016/j.nima.2009.12 2010
[2]

Ablikim, M.,et al.: Future Physics Programme of BESIII. Chin. Phys. C44(4), 040001 (2020) https://doi.org/10.1088/1674-1137/44/4/040001 arXiv:1912.05983 [hep-ex]

work page doi:10.1088/1674-1137/44/4/040001 2020
[3]

Brun, R., Rademakers, F.: ROOT: An object oriented data analysis frame- work. Nucl. Instrum. Meth. A389, 81–86 (1997) https://doi.org/10.1016/ S0168-9002(97)00048-X

1997
[4]

Zou, J.,et al.: Offline data processing system of the BESIII experiment. Eur. Phys. J. C84(9), 937 (2024) https://doi.org/10.1140/epjc/s10052-024-13241-3

work page doi:10.1140/epjc/s10052-024-13241-3 2024
[5]

Lepton flavor (universality) violation studies at CMS

Zhang, Z., et al.: Dr. sai: Physical analysis agents based on llms for besiii experi- ment and exploration of future ai scientist. In: 24th International Conference on High Energy Physics (ICHEP 2024), Prague, Czech Republic. Presented by Z. Zhang on behalf of Dr.Sai working group. https://indico.cern.ch/event/1291157/ contributions/5889603/

work page arXiv 2024
[6]

Li, K., Liu, B., Mellado, B., Yuan, C.-Z., Zhang, Z.: AI agents, language, deep learning, and the next revolution in science. Front. Phys. (Beijing)21(9), 096401 (2026) https://doi.org/10.15302/frontphys.2026.096401 arXiv:2603.07940 [hep-ex]

work page doi:10.15302/frontphys.2026.096401 2026
[7]

CoRR abs/2402.03578 (2024)

Han, S., et al.: LLM Multi-Agent Systems: Challenges and Open Problems (2025). https://arxiv.org/abs/2402.03578

work page arXiv 2025
[8]

The Rise and Potential of Large Language Model Based Agents: A Survey

Xi, Z., et al.: The Rise and Potential of Large Language Model Based Agents: A Survey (2023). https://arxiv.org/abs/2309.07864

work page internal anchor Pith review arXiv 2023
[9]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Lewis, P., et al.: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2021). https://arxiv.org/abs/2005.11401

work page internal anchor Pith review arXiv 2021
[10]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Wu, Q., et al.: AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation (2023). https://arxiv.org/abs/2308.08155

work page internal anchor Pith review arXiv 2023
[11]

https://arxiv.org/abs/2410.08328

Christakopoulou, K., Mourad, S., Matari´ c, M.: Agents Thinking Fast and Slow: A Talker-Reasoner Architecture (2024). https://arxiv.org/abs/2410.08328

work page arXiv 2024
[12]

https://htcondor.org

HTCondor. https://htcondor.org
[13]

https://arxiv.org/abs/2505

Yang, A., et al.: Qwen3 Technical Report (2025). https://arxiv.org/abs/2505. 09388 36

2025
[14]

DeepSeek-V3 Technical Report

DeepSeek-AI: DeepSeek-V3 Technical Report (2024). https://arxiv.org/abs/ 2412.19437

work page internal anchor Pith review arXiv 2024
[15]

2022.LlamaIndex

Liu, J.: LlamaIndex. https://doi.org/10.5281/zenodo.1234 . https://github.com/ jerryjliu/llama index

work page doi:10.5281/zenodo.1234
[16]

https://qdrant.tech/

Qdrant. https://qdrant.tech/
[17]

In: Proceedings of the 42nd Inter- national Conference on High Energy Physics (ICHEP 2024)

Beringer, J.: New ways to access pdg data. In: Proceedings of the 42nd Inter- national Conference on High Energy Physics (ICHEP 2024). https://example. com/proceedings-ichep2024, Prague, Czech Republic (2024). https://doi.org/10. 22323/1.476.1023 . Accessed: 2025-11-06

2024
[18]

Python in HEP

Beringer, J., Kramer, M.: The new pdg python api. In: Proceedings of PyHEP 2024 - “Python in HEP” Users Workshop. https://example.com/ recording-pyhep2024, Virtual (2024). Recording of the workshop presentation

2024
[19]

In: Proceedings of the HADRON 2023 Conference

Beringer, J.: Programmatic access to pdg data. In: Proceedings of the HADRON 2023 Conference. https://example.com/proceedings-hadron2023, Genova, Italy (2023). https://doi.org/10.1393/ncc/i2024-24206-9 . Accessed: 2025-11-06

work page doi:10.1393/ncc/i2024-24206-9 2023
[20]

Particle: A pythonic interface to the Particle Data Group (PDG) data. PyPI. Accessed: 2026-04-23 (2026). https://pypi.org/project/particle/

2026
[21]

https://openwebui.com

Open WebIO. https://openwebui.com
[22]

Guo, Y.P., Yuan, C.Z.: Impact of the interference between the resonance and continuum amplitudes on vector quarkonia decay branching fraction measure- ments. Phys. Rev. D105(11), 114001 (2022) https://doi.org/10.1103/PhysRevD. 105.114001 arXiv:2203.00244 [hep-ph]

work page doi:10.1103/physrevd 2022
[23]

and J/ψ→Λ¯π-Σ++c.c

Ablikim, M.,et al.: Precise measurement of the branching fractions of J/ψ→Λ¯π+Σ-+c.c. and J/ψ→Λ¯π-Σ++c.c. Phys. Rev. D108(11), 112012 (2023) https://doi.org/10.1103/PhysRevD.108.112012 arXiv:2306.10319 [hep-ex]

work page doi:10.1103/physrevd.108.112012 2023
[24]

Appelquist, T., Politzer, H.D.: Orthocharmonium and e+ e- Annihilation. Phys. Rev. Lett.34, 43 (1975) https://doi.org/10.1103/PhysRevLett.34.43

work page doi:10.1103/physrevlett.34.43 1975
[25]

De Rujula, A., Glashow, S.L.: Is Bound Charm Found? Phys. Rev. Lett.34, 46–49 (1975) https://doi.org/10.1103/PhysRevLett.34.46

work page doi:10.1103/physrevlett.34.46 1975
[26]

Brambilla, N.,et al.: Heavy Quarkonium: Progress, Puzzles, and Opportunities. Eur. Phys. J. C71, 1534 (2011) https://doi.org/10.1140/epjc/s10052-010-1534-9 arXiv:1010.5827 [hep-ph]

work page doi:10.1140/epjc/s10052-010-1534-9 2011
[27]

Ablikim, M.,et al.: Determination of the number ofψ(3686) events taken at BESIII*. Chin. Phys. C48(9), 093001 (2024) https://doi.org/10.1088/1674-1137/ ad595b arXiv:2403.06766 [hep-ex] 37

work page doi:10.1088/1674-1137/ 2024
[28]

Agostinelli, S.,et al.: GEANT4 - A Simulation Toolkit. Nucl. Instrum. Meth. A 506, 250–303 (2003) https://doi.org/10.1016/S0168-9002(03)01368-8

work page doi:10.1016/s0168-9002(03)01368-8 2003
[29]

Jadach, S., Ward, B.F.L., Was, Z.: The Precision Monte Carlo event generator K K for two fermion final states in e+ e- collisions. Comput. Phys. Commun. 130, 260–325 (2000) https://doi.org/10.1016/S0010-4655(00)00048-5 arXiv:hep- ph/9912214

work page doi:10.1016/s0010-4655(00)00048-5 2000
[30]

Jadach, S., Ward, B.F.L., Was, Z.: Coherent exclusive exponentiation for precision Monte Carlo calculations. Phys. Rev. D63, 113009 (2001) https://doi.org/10. 1103/PhysRevD.63.113009 arXiv:hep-ph/0006359

work page arXiv 2001
[31]

Lange, D.J.: The EvtGen particle decay simulation package. Nucl. Instrum. Meth. A462, 152–155 (2001) https://doi.org/10.1016/S0168-9002(01)00089-4

work page doi:10.1016/s0168-9002(01)00089-4 2001
[32]

Ping, R.-G.: Event generators at BESIII. Chin. Phys. C32, 599 (2008) https: //doi.org/10.1088/1674-1137/32/8/001

work page doi:10.1088/1674-1137/32/8/001 2008
[33]

Navas, S.,et al.: Review of particle physics. Phys. Rev. D110(3), 030001 (2024) https://doi.org/10.1103/PhysRevD.110.030001

work page doi:10.1103/physrevd.110.030001 2024
[34]

Chen, J.C., Huang, G.S., Qi, X.R., Zhang, D.H., Zhu, Y.S.: Event generator for J / psi and psi (2S) decay. Phys. Rev. D62, 034003 (2000) https://doi.org/10. 1103/PhysRevD.62.034003

2000
[35]

Yang, R.-L., Ping, R.-G., Chen, H.: Tuning and Validation of the Lundcharm Model withJ/ψDecays. Chin. Phys. Lett.31, 061301 (2014) https://doi.org/10. 1088/0256-307X/31/6/061301

2014
[36]

Barberio, E., Eijk, B., Was, Z.: PHOTOS: A Universal Monte Carlo for QED radiative corrections in decays. Comput. Phys. Commun.66, 115–128 (1991) https://doi.org/10.1016/0010-4655(91)90012-A

work page doi:10.1016/0010-4655(91)90012-a 1991
[37]

Yuan, W.-L., Ai, X.-C., Ji, X.-B., Chen, S.-J., Zhang, Y., Wu, L.-H., Wang, L.- L., Yuan, Y.: Study of tracking efficiency and its systematic uncertainty from J/ψ→p pπ+π− at BESIII. Chin. Phys. C40(2), 026201 (2016) https://doi.org/ 10.1088/1674-1137/40/2/026201 arXiv:1507.03453 [hep-ex]

work page doi:10.1088/1674-1137/40/2/026201 2016
[38]

Liu, F.,et al.: Study of the tracking efficiency of charged pions at BESIII. Radiat. Detect. Technol. Methods9(3), 390–395 (2025) https://doi.org/10.1007/ s41605-025-00530-y arXiv:2412.00469 [hep-ex]

work page arXiv 2025
[39]

Chai, X., Wang, M., Ji, X., Sun, S., Wang, D.: Studies of the tracking and iden- tification efficiencies of electrons and positrons at BESIII (2025) https://doi.org/ 10.1007/s41605-025-00609-6 arXiv:2509.09963 [hep-ex]

work page doi:10.1007/s41605-025-00609-6 2025
[40]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Zhupu-AI: GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models 38 (2025). https://arxiv.org/abs/2508.06471

work page internal anchor Pith review arXiv 2025
[41]

Nature645(8081), 633–638 (2025) https://doi.org/10.1038/ s41586-025-09422-z

DeepSeek-AI: Deepseek-r1 incentivizes reasoning in llms through reinforce- ment learning. Nature645(8081), 633–638 (2025) https://doi.org/10.1038/ s41586-025-09422-z

2025
[42]

GPT-4o System Card

OpenAI: GPT-4o System Card (2024). https://arxiv.org/abs/2410.21276 39

work page internal anchor Pith review arXiv 2024