Recognition: unknown
Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations
Pith reviewed 2026-05-13 22:29 UTC · model grok-4.3
The pith
A multi-agent AI system with built-in verifiers turns unreliable LLM outputs into correct quantum many-body simulations on a new benchmark of 100 real research tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PhysVEC seamlessly integrates programming and scientific verifiers to guarantee coding correctness and principle-based physical validity, yielding interpretable evidence and error correction at each step, significantly outperforms existing LLM baselines on various scenarios in QMP-Bench, and presents a favorable inference-time scaling that transforms unreliable AI generations into accurate physical reproductions.
What carries the argument
The PhysVEC multi-agent framework that couples programming verifiers for code checks with scientific verifiers for physical-principle checks.
Load-bearing premise
The verifiers can catch and fix both coding mistakes and physical-law violations on every task without missing real errors or introducing new systematic mistakes of their own.
What would settle it
A new set of quantum many-body tasks where the verifiers accept a simulation that violates a conservation law or produces results inconsistent with known analytic limits.
read the original abstract
While large language models (LLMs) promise to revolutionize automated scientific discovery, their application in rigorous real-world physical research is stalled by two critical barriers: a lack of realistic evaluation benchmarks and systemic LLM hallucinations. Here, we address both problems. We introduce QMP-Bench, a pioneering end-to-end research-level benchmark in quantum many-body simulation consisting of $100$ tasks extracted from $21$ high-impact prestigious journals, presenting a challenge even for current frontier LLMs. To establish a paradigm for reliable and transparent AI physicists, we present PhysVEC, a multi-agent framework that enforces self-verifiable and error correction in AI research. PhysVEC seamlessly integrates programming and scientific verifiers to guarantee coding correctness and principle-based physical validity, yielding interpretable evidence and error correction at each step. PhysVEC significantly outperforms existing LLM baselines on various scenarios in QMP-Bench and presents a favorable inference-time scaling, successfully transforming unreliable AI generations into accurate physical reproductions, paving a robust and trustworthy path towards future automated scientific discovery.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces QMP-Bench, an end-to-end benchmark of 100 quantum many-body simulation tasks extracted from 21 high-impact journals, and proposes PhysVEC, a multi-agent framework that integrates programming verifiers and scientific verifiers to enforce coding correctness and principle-based physical validity. It claims that PhysVEC significantly outperforms existing LLM baselines across scenarios in QMP-Bench, exhibits favorable inference-time scaling, and transforms unreliable generations into accurate physical reproductions.
Significance. If the performance claims and verifier robustness hold, the work supplies a concrete engineering contribution toward reliable AI-assisted discovery in physics by addressing hallucinations through explicit, interpretable verification steps. The benchmark itself is a useful addition for the field, as it moves beyond synthetic tasks to journal-derived problems; the multi-agent design with dual verifiers offers a reproducible template that could be extended to other domains.
major comments (2)
- [Abstract and evaluation section] The central performance claim (outperformance on QMP-Bench with favorable scaling) is load-bearing yet unsupported by any quantitative metrics, error bars, per-task breakdown, or ablation results in the abstract or evaluation description; without these, the assertion that PhysVEC 'significantly outperforms' baselines cannot be assessed.
- [Scientific verifiers description] The self-correction guarantee rests on the scientific verifiers catching violations of physical principles (e.g., broken symmetries, incorrect conservation laws, or invalid approximations in many-body Hamiltonians) across all 100 tasks; the manuscript provides no concrete specification of the principle list, how LLM-driven verifiers avoid false negatives on subtle inconsistencies that appear only in long-time dynamics or specific parameter regimes, or validation against ground-truth solutions where available.
minor comments (2)
- [Framework architecture] Clarify the exact interaction protocol among agents and the decision thresholds used by the verifiers to trigger correction loops.
- [Evaluation] Add a table or figure summarizing baseline models, exact QMP-Bench task categories, and success criteria to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive comments and positive assessment of the potential significance of QMP-Bench and PhysVEC. We address the major comments point by point below. We will incorporate revisions to strengthen the presentation of quantitative results and the specification of the scientific verifiers.
read point-by-point responses
-
Referee: [Abstract and evaluation section] The central performance claim (outperformance on QMP-Bench with favorable scaling) is load-bearing yet unsupported by any quantitative metrics, error bars, per-task breakdown, or ablation results in the abstract or evaluation description; without these, the assertion that PhysVEC 'significantly outperforms' baselines cannot be assessed.
Authors: We agree that the abstract would benefit from explicit quantitative support. The full evaluation section already contains the requested metrics, error bars, per-task breakdowns, and ablation studies showing PhysVEC's outperformance and scaling behavior. We will revise the abstract to include key numerical results (e.g., success rates and scaling trends) and add explicit cross-references in the evaluation description to these detailed results. revision: yes
-
Referee: [Scientific verifiers description] The self-correction guarantee rests on the scientific verifiers catching violations of physical principles (e.g., broken symmetries, incorrect conservation laws, or invalid approximations in many-body Hamiltonians) across all 100 tasks; the manuscript provides no concrete specification of the principle list, how LLM-driven verifiers avoid false negatives on subtle inconsistencies that appear only in long-time dynamics or specific parameter regimes, or validation against ground-truth solutions where available.
Authors: We acknowledge the need for greater transparency here. We will expand the scientific verifiers section to provide an explicit enumerated list of enforced physical principles, describe the prompting and checking procedures used by the LLM-driven verifiers to detect violations (including checks for long-time dynamics and parameter-specific regimes), and report validation results against available ground-truth solutions for the subset of tasks where they exist. revision: yes
Circularity Check
No circularity: empirical benchmark and framework evaluation
full rationale
The paper introduces QMP-Bench (100 tasks from 21 external journals) and PhysVEC (multi-agent verifiers for code and physics principles) as an engineering system. Claims of outperformance are direct empirical comparisons to LLM baselines on this independently defined benchmark, with no equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations that reduce the central result to its inputs by construction. The derivation chain consists of framework description plus external evaluation and is self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Programming and scientific verifiers can be integrated to guarantee both coding correctness and principle-based physical validity
invented entities (1)
-
PhysVEC multi-agent framework
no independent evidence
Forward citations
Cited by 1 Pith paper
-
The Agentification of Scientific Research: A Physicist's Perspective
AI will evolve from a research tool into a collaborator, fundamentally reshaping scientific collaboration, discovery, publishing, and evaluation while requiring continuous learning and idea diversity for original cont...
Reference graph
Works this paper leans on
-
[1]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Lu, C., Lu, C., Lange, R.T., Foerster, J., Clune, J., Ha, D.: The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery (2024). https://arxiv.org/abs/2408.06292
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)
Tang, J., Xia, L., Li, Z., Huang, C.: AI-researcher: Autonomous scientific innovation. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025). https://openreview.net/forum?id=kQWyOYUAC4
work page 2025
-
[3]
https: //arxiv.org/abs/2512.07921
Li, Z., Li, Z., Guo, Z., Ren, X., Huang, C.: DeepCode: Open Agentic Coding (2025). https: //arxiv.org/abs/2512.07921
-
[4]
Nature624(7990), 86–91 (2023) https://doi.org/10.1038/s41586-023-06734-w
Szymanski, N.J., Rendy, B., Fei, Y., Kumar, R.E., He, T., Milsted, D., McDermott, M.J., Gallant, M., Cubuk, E.D., Merchant, A., Kim, H., Jain, A., Bartel, C.J., Persson, K., Zeng, Y., Ceder, G.: An autonomous laboratory for the accelerated synthesis of inorganic materials. Nature624(7990), 86–91 (2023) https://doi.org/10.1038/s41586-023-06734-w
-
[5]
Nature Communications16(1), 9104 (2025) https://doi.org/10
Mandal, I., Soni, J., Zaki, M., Smedskjaer, M.M., Wondraczek, K., Wondraczek, L., Gos- vami, N.N., Krishnan, N.M.A.: Evaluating large language model agents for automation of atomic force microscopy. Nature Communications16(1), 9104 (2025) https://doi.org/10. 1038/s41467-025-64105-7
work page 2025
-
[6]
Nature Communications17(1), 204 (2025) https://doi.org/10.1038/s41467-025-66916-0
Desai, S., Addamane, S., Tsao, J.Y., Brener, I., Dingreville, R., Iyer, P.P.: Self-driving lab discovers principles for steering spontaneous emission beyond conventional fourier optics. Nature Communications17(1), 204 (2025) https://doi.org/10.1038/s41467-025-66916-0
-
[7]
Patterns6(10), 101372 (2025) https://doi.org/10.1016/j.patter.2025.101372
Cao, S., Zhang, Z., Alghadeer, M., Fasciati, S.D., Piscitelli, M., Bakr, M., Leek, P., Aspuru- Guzik, A.: Automating quantum computing laboratory experiments with an agent-based ai framework. Patterns6(10), 101372 (2025) https://doi.org/10.1016/j.patter.2025.101372
-
[8]
https://arxiv.org/ abs/2508.05421
Sha, R., Wang, B., Yang, J., Ma, X., Wu, C., Yan, L., Zhou, C., Liu, J., Wang, G., Yan, S., Zhu, L.: LLM-based Multi-Agent Copilot for Quantum Sensor (2025). https://arxiv.org/ abs/2508.05421
-
[9]
https://arxiv.org/abs/ 2601.14288
Peng, Z.-Y., Yuan, H.-S., Lai, Q., Jiang, J.-Q., Ye, G., Zhang, J., Piao, Y.-S.: DeepInflation: an AI agent for research and model discovery of inflation (2026). https://arxiv.org/abs/ 2601.14288
-
[10]
Llm-feynman: leveraging large language models for universal scientific formula and theory discovery
Song, Z., Zhou, Q., Ren, C., Ling, C., Ju, M., Wang, J.: LLM-Feynman: Leveraging Large Language Models for Universal Scientific Formula and Theory Discovery (2025). https: //arxiv.org/abs/2503.06512
-
[11]
https: //arxiv.org/abs/2504.14557 15
Campbell, C., Chen, H.M., Luk, W., Fan, H.: Enhancing LLM-based Quantum Code Generation with Multi-Agent Optimization and Quantum Error Correction (2025). https: //arxiv.org/abs/2504.14557 15
-
[12]
Yang, R., Wang, Z., Gu, Y., Liang, Y., Li, T.: QCircuitbench: A large-scale dataset for benchmarking quantum algorithm design. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2025). https://openreview.net/forum?id=NkiLldW2bi
work page 2025
-
[13]
https://arxiv.org/abs/2512.18847
Gustin, I., Calder´ on, L.M., P´ erez-S´ anchez, J.B., Gonthier, J.F., Nakamura, Y., Panicker, K., Ramprasad, M., Zhang, Z., Zou, Y., Bernales, V., Aspuru-Guzik, A.: El Agente Cu´ antico: Automating quantum simulations (2026). https://arxiv.org/abs/2512.18847
-
[14]
https://arxiv.org/abs/2601.10194
Li, W., Ren, J., Cheng, L., Gong, C.: Autonomous Quantum Simulation through Large Language Model Agents (2026). https://arxiv.org/abs/2601.10194
-
[15]
https://arxiv.org/abs/2512.19799
Miao, T., Dai, J., Liu, J., Tan, J., Zhang, M., Jin, W., Du, Y., Jin, T., Pang, X., Liu, Z., Guo, T., Zhang, Z., Huang, Y., Chen, S., Ye, R., Zhang, Y., Zhang, L., Chen, K., Wang, W., E, W., Chen, S.: PhysMaster: Building an Autonomous AI Physicist for Theoretical and Computational Physics Research (2025). https://arxiv.org/abs/2512.19799
-
[16]
In: Advances in Neural Information Processing Systems, vol
Wei, J., Wang, X., Schuurmans, D., Bosma, M., ichter, b., Xia, F., Chi, E., Le, Q.V., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models. In: Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837 (2022). https://openreview.net/forum?id= VjQlMeSB J
work page 2022
-
[17]
In: Advances in Neural Information Processing Systems, vol
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. In: Advances in Neural Information Processing Systems, vol. 36, pp. 11809–11822 (2023). https://openreview.net/forum?id=1hflw0tjM8
work page 2023
-
[18]
DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , volume=
Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z.F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Xu, H....
-
[19]
In: Advances in Neural Information Processing Systems, vol
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K¨ uttler, H., Lewis, 16 M., Yih, W.-t., Rockt¨ aschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge-intensive nlp tasks. In: Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474 (2020).https://openreview.net/forum?id=KnVuuSvtIm1
work page 2020
-
[20]
Asai, A., Wu, Z., Wang, Y., Sil, A., Hajishirzi, H.: Self-RAG: Learning to retrieve, generate, and critique through self-reflection. In: The Twelfth International Conference on Learning Representations (2024).https://openreview.net/forum?id=hSyW5go0v8
work page 2024
-
[21]
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: React: Synergizing reasoning and acting in language models. In: The Eleventh International Conference on Learning Representations (2023).https://openreview.net/forum?id=WE vluYUL-X
work page 2023
-
[22]
Pal: Program-aided language models,
Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y., Callan, J., Neubig, G.: PAL: Program-aided Language Models (2023). https://arxiv.org/abs/2211.10435
-
[23]
In: NeurIPS (2023).https://openreview.net/forum?id=43rnkOcpI1
Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Gupta, S., Majumder, B.P., Hermann, K., Welleck, S., Yaz- danbakhsh, A., Clark, P.: Self-refine: Iterative refinement with self-feedback. In: NeurIPS (2023).https://openreview.net/forum?id=43rnkOcpI1
work page 2023
-
[24]
In: NeurIPS (2024).https://openreview.net/forum?id=30hggYAY0Z
Yang, J., Jimenez, C.E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., Press, O.: Swe-agent: Agent-computer interfaces enable automated software engineering. In: NeurIPS (2024).https://openreview.net/forum?id=30hggYAY0Z
work page 2024
-
[25]
Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions
Hou, X., Zhao, Y., Wang, S., Wang, H.: Model context protocol (mcp): Landscape, security threats, and future research directions. CoRRabs/2503.23278(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Li, Y., Huang, Y., Wang, T., Fan, C., Cai, X., Hu, S., Liu, X., Shi, C., Xu, M., Wang, Z., Wang, Y., Jin, X., Zhang, T., Zhang, L., Wang, L., Deng, Y., Zhang, P., Sun, W., Li, X., E, W., Zhang, L., Yao, Z., Chen, K.: Inverse knowledge search over verifiable reasoning: Synthesizing a scientific encyclopedia from a long chains-of-thought knowledge base. CoR...
-
[27]
arXiv preprint arXiv:2307.10635
Wang, X., Hu, Z., Lu, P., Zhu, Y., Zhang, J., Subramaniam, S., Loomba, A.R., Zhang, S., Sun, Y., Wang, W.: SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models (2024). https://arxiv.org/abs/2307.10635
-
[28]
In: First Conference on Language Modeling (2024).https://openreview.net/forum?id=Ti67584b98
Rein, D., Hou, B.L., Stickland, A.C., Petty, J., Pang, R.Y., Dirani, J., Michael, J., Bow- man, S.R.: GPQA: A graduate-level google-proof q&a benchmark. In: First Conference on Language Modeling (2024).https://openreview.net/forum?id=Ti67584b98
work page 2024
-
[29]
Qiu, J., Shi, J., Juan, X., Zhao, Z., Geng, J., Liu, S., Wang, H., Wu, S., Wang, M.: Physics supernova: Ai agent matches elite gold medalists at ipho 2025. CoRRabs/2509.01659 (2025)
-
[30]
Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark
Zhu, M., Tian, M., Yang, X., Zhou, T., Yuan, L., Zhu, P., Chertkov, E., Liu, S., Du, Y., Ji, Z., Das, I., Cao, J., Du, Y., Yu, J., Wu, P., He, J., Su, Y., Jiang, Y., Zhang, Y., Liu, C., Huang, Z.-M., Jia, W., Wang, Y., Jafarpour, F., Zhao, Y., Chen, X., Shelton, J., Young, A.W., Bartolotta, J., Xu, W., Sun, Y., Chu, A., Colussi, V., Akers, C., Brooks, N.,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Tian, M., Gao, L., Zhang, S.D., Chen, X., Fan, C., Guo, X., Haas, R., Ji, P., Krongchon, K., Li, Y., Liu, S., Luo, D., Ma, Y., Tong, H., Trinh, K., Tian, C., Wang, Z., Wu, B., Xiong, Y., Yin, S., Zhu, M., Lieret, K., Lu, Y., Liu, G., Du, Y., Tao, T., Press, O., Callan, J., Huerta, E.A., Peng, H.: Scicode: A research coding benchmark curated by scientists....
-
[32]
https://arxiv.org/abs/2502.15815
Chung, D.J.H., Gao, Z., Kvasiuk, Y., Li, T., M¨ unchmeyer, M., Rudolph, M., Sala, F., Tadepalli, S.C.: Theoretical Physics Benchmark (TPBench) – a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics (2025). https://arxiv.org/abs/2502.15815
-
[33]
Weng, Y., Zhu, M., Xie, Q., Sun, Q., Lin, Z., Liu, S., Zhang, Y.: Deepscientist: Advancing frontier-pushing scientific findings progressively. In: The Fourteenth International Confer- ence on Learning Representations (2026).https://openreview.net/forum?id=cZFgsLq8Gs
work page 2026
-
[34]
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J.E., Stoica, I.: Judging LLM-as-a-judge with MT-bench and chatbot arena. In: Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2023).https://openreview.net/forum?id=uccHPGDlao
work page 2023
-
[35]
Paperbench: Evaluating ai’s ability to replicate ai research, 2025
Starace, G., Jaffe, O., Sherburn, D., Aung, J., Chan, J.S., Maksin, L., Dias, R., Mays, E., Kinsella, B., Thompson, W., Heidecke, J., Glaese, A., Patwardhan, T.: PaperBench: Evaluating AI’s Ability to Replicate AI Research (2025). https://arxiv.org/abs/2504.01848
-
[36]
Fishman, M., White, S.R., Stoudenmire, E.M.: The ITensor Software Library for Ten- sor Network Calculations. SciPost Phys. Codebases, 4 (2022) https://doi.org/10.21468/ SciPostPhysCodeb.4
work page 2022
-
[37]
SoftwareX10, 100311 (2019) https://doi.org/10.1016/j.softx.2019.100311
Carleo, G., Choo, K., Hofmann, D., Smith, J.E.T., Westerhout, T., Alet, F., Davis, E.J., Efthymiou, S., Glasser, I., Lin, S.-H., Mauri, M., Mazzola, G., Mendl, C.B., van Nieuwen- burg, E., O’Reilly, O., Th´ eveniaut, H., Torlai, G., Vicentini, F., Wietek, A.: Netket: A machine learning toolkit for many-body quantum systems. SoftwareX10, 100311 (2019) http...
-
[38]
Vicentini, F., Hofmann, D., Szab´ o, A., Wu, D., Roth, C., Giuliani, C., Pescia, G., Nys, J., Vargas-Calder´ on, V., Astrakhantsev, N., Carleo, G.: NetKet 3: Machine Learning Toolbox for Many-Body Quantum Systems. SciPost Phys. Codebases, 7 (2022) https://doi.org/10. 21468/SciPostPhysCodeb.7
work page 2022
-
[39]
https://doi.org/10.5281/zenodo.2562111
Aleksandrowicz, G., Alexander, T., Barkoutsos, P., Bello, L., Ben-Haim, Y., Bucher, D., Cabrera-Hern´ andez, F.J., Carballo-Franquis, J., Chen, A., Chen, C.-F., Chow, J.M., C´ orcoles-Gonzales, A.D., Cross, A.J., Cross, A., Cruz-Benito, J., Culver, C., Gonz´ alez, S.D.L.P., Torre, E.D.L., Ding, D., Dumitrescu, E., Duran, I., Eendebak, P., Everitt, M., Ser...
-
[40]
Neese, F.: The orca program system. Wiley Interdiscip. Rev. Comput. Mol. Sci.2(1), 73–78 (2012) https://doi.org/10.1002/wcms.81
-
[41]
In: The Thirteenth International Conference on Learning Representations (2025)
Snell, C.V., Lee, J., Xu, K., Kumar, A.: Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. In: The Thirteenth International Conference on Learning Representations (2025). https://openreview.net/forum?id=4FWAwZtd2n
work page 2025
-
[42]
Tezuka, M., Ueda, M.: Density-matrix renormalization group study of trapped imbal- anced fermi condensates. Phys. Rev. Lett.100, 110403 (2008) https://doi.org/10.1103/ PhysRevLett.100.110403 19
work page 2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.