arxiv: 2604.00149 · v2 · submitted 2026-03-31 · ⚛️ physics.comp-ph

Recognition: unknown

Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations

Chen Mo, Di Luo, Guijing Duan, Jize Han, Junkun Huang, Ken Deng, Ling Qian, Runqing Zhang, Xiangfei Wang, Zhiguo Huang

Pith reviewed 2026-05-13 22:29 UTC · model grok-4.3

classification ⚛️ physics.comp-ph

keywords quantum many-body simulationAI verificationself-correctionmulti-agent frameworkLLM benchmarkphysics automation

0 comments

The pith

A multi-agent AI system with built-in verifiers turns unreliable LLM outputs into correct quantum many-body simulations on a new benchmark of 100 real research tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents QMP-Bench, a collection of 100 tasks drawn directly from published papers in quantum many-body physics, to test AI systems on realistic research problems. It then describes PhysVEC, a multi-agent setup that adds separate programming and scientific verifiers to check code correctness and adherence to physical laws at every step. These verifiers produce explicit evidence of errors and trigger corrections before the final output. Tests show PhysVEC beats standard large language models across the benchmark tasks and improves further when given more inference time. The work aims to make AI-generated physics results verifiable rather than merely plausible.

Core claim

PhysVEC seamlessly integrates programming and scientific verifiers to guarantee coding correctness and principle-based physical validity, yielding interpretable evidence and error correction at each step, significantly outperforms existing LLM baselines on various scenarios in QMP-Bench, and presents a favorable inference-time scaling that transforms unreliable AI generations into accurate physical reproductions.

What carries the argument

The PhysVEC multi-agent framework that couples programming verifiers for code checks with scientific verifiers for physical-principle checks.

Load-bearing premise

The verifiers can catch and fix both coding mistakes and physical-law violations on every task without missing real errors or introducing new systematic mistakes of their own.

What would settle it

A new set of quantum many-body tasks where the verifiers accept a simulation that violates a conservation law or produces results inconsistent with known analytic limits.

read the original abstract

While large language models (LLMs) promise to revolutionize automated scientific discovery, their application in rigorous real-world physical research is stalled by two critical barriers: a lack of realistic evaluation benchmarks and systemic LLM hallucinations. Here, we address both problems. We introduce QMP-Bench, a pioneering end-to-end research-level benchmark in quantum many-body simulation consisting of $100$ tasks extracted from $21$ high-impact prestigious journals, presenting a challenge even for current frontier LLMs. To establish a paradigm for reliable and transparent AI physicists, we present PhysVEC, a multi-agent framework that enforces self-verifiable and error correction in AI research. PhysVEC seamlessly integrates programming and scientific verifiers to guarantee coding correctness and principle-based physical validity, yielding interpretable evidence and error correction at each step. PhysVEC significantly outperforms existing LLM baselines on various scenarios in QMP-Bench and presents a favorable inference-time scaling, successfully transforming unreliable AI generations into accurate physical reproductions, paving a robust and trustworthy path towards future automated scientific discovery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New benchmark extracted from real journal papers plus a multi-agent verifier framework for quantum many-body tasks, but the abstract supplies no metrics so the outperformance claim stays untested.

read the letter

The paper's main contribution is QMP-Bench, a set of 100 tasks pulled directly from 21 high-impact journal papers on quantum many-body simulations, paired with PhysVEC, a multi-agent system that routes LLM output through separate programming and scientific verifiers to catch coding errors and principle violations before final output. That combination is new enough to stand on its own. The benchmark moves away from toy problems toward something closer to actual research workflows, and the verifier loop is a practical attempt to reduce hallucinations without relying solely on prompt engineering. They also report favorable scaling with extra inference steps, which is worth checking against the data. The approach gives interpretable traces of where corrections happen, which is a step better than black-box LLM runs. The central limitation is that the abstract states outperformance and successful self-correction but shows none of the numbers, task breakdowns, ablation results, or error bars needed to evaluate those claims. Without the full evaluation section it is impossible to tell whether the verifiers actually catch subtle symmetry or conservation issues in complex many-body Hamiltonians or whether they introduce their own biases on edge cases. The stress-test concern about missing subtle physical inconsistencies therefore remains open until the results are visible. This work is aimed at groups building automated tools for physics discovery who need harder test sets than current LLM benchmarks provide. The benchmark alone could be adopted by others even if the full framework needs tightening. I would send it to peer review because the engineering setup and the external benchmark are concrete enough to merit referee time, though the paper will need to deliver the quantitative evidence in the next round.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces QMP-Bench, an end-to-end benchmark of 100 quantum many-body simulation tasks extracted from 21 high-impact journals, and proposes PhysVEC, a multi-agent framework that integrates programming verifiers and scientific verifiers to enforce coding correctness and principle-based physical validity. It claims that PhysVEC significantly outperforms existing LLM baselines across scenarios in QMP-Bench, exhibits favorable inference-time scaling, and transforms unreliable generations into accurate physical reproductions.

Significance. If the performance claims and verifier robustness hold, the work supplies a concrete engineering contribution toward reliable AI-assisted discovery in physics by addressing hallucinations through explicit, interpretable verification steps. The benchmark itself is a useful addition for the field, as it moves beyond synthetic tasks to journal-derived problems; the multi-agent design with dual verifiers offers a reproducible template that could be extended to other domains.

major comments (2)

[Abstract and evaluation section] The central performance claim (outperformance on QMP-Bench with favorable scaling) is load-bearing yet unsupported by any quantitative metrics, error bars, per-task breakdown, or ablation results in the abstract or evaluation description; without these, the assertion that PhysVEC 'significantly outperforms' baselines cannot be assessed.
[Scientific verifiers description] The self-correction guarantee rests on the scientific verifiers catching violations of physical principles (e.g., broken symmetries, incorrect conservation laws, or invalid approximations in many-body Hamiltonians) across all 100 tasks; the manuscript provides no concrete specification of the principle list, how LLM-driven verifiers avoid false negatives on subtle inconsistencies that appear only in long-time dynamics or specific parameter regimes, or validation against ground-truth solutions where available.

minor comments (2)

[Framework architecture] Clarify the exact interaction protocol among agents and the decision thresholds used by the verifiers to trigger correction loops.
[Evaluation] Add a table or figure summarizing baseline models, exact QMP-Bench task categories, and success criteria to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments and positive assessment of the potential significance of QMP-Bench and PhysVEC. We address the major comments point by point below. We will incorporate revisions to strengthen the presentation of quantitative results and the specification of the scientific verifiers.

read point-by-point responses

Referee: [Abstract and evaluation section] The central performance claim (outperformance on QMP-Bench with favorable scaling) is load-bearing yet unsupported by any quantitative metrics, error bars, per-task breakdown, or ablation results in the abstract or evaluation description; without these, the assertion that PhysVEC 'significantly outperforms' baselines cannot be assessed.

Authors: We agree that the abstract would benefit from explicit quantitative support. The full evaluation section already contains the requested metrics, error bars, per-task breakdowns, and ablation studies showing PhysVEC's outperformance and scaling behavior. We will revise the abstract to include key numerical results (e.g., success rates and scaling trends) and add explicit cross-references in the evaluation description to these detailed results. revision: yes
Referee: [Scientific verifiers description] The self-correction guarantee rests on the scientific verifiers catching violations of physical principles (e.g., broken symmetries, incorrect conservation laws, or invalid approximations in many-body Hamiltonians) across all 100 tasks; the manuscript provides no concrete specification of the principle list, how LLM-driven verifiers avoid false negatives on subtle inconsistencies that appear only in long-time dynamics or specific parameter regimes, or validation against ground-truth solutions where available.

Authors: We acknowledge the need for greater transparency here. We will expand the scientific verifiers section to provide an explicit enumerated list of enforced physical principles, describe the prompting and checking procedures used by the LLM-driven verifiers to detect violations (including checks for long-time dynamics and parameter-specific regimes), and report validation results against available ground-truth solutions for the subset of tasks where they exist. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark and framework evaluation

full rationale

The paper introduces QMP-Bench (100 tasks from 21 external journals) and PhysVEC (multi-agent verifiers for code and physics principles) as an engineering system. Claims of outperformance are direct empirical comparisons to LLM baselines on this independently defined benchmark, with no equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations that reduce the central result to its inputs by construction. The derivation chain consists of framework description plus external evaluation and is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that verifiers can be constructed to enforce physical validity; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption Programming and scientific verifiers can be integrated to guarantee both coding correctness and principle-based physical validity
This assumption underpins the entire PhysVEC error-correction loop.

invented entities (1)

PhysVEC multi-agent framework no independent evidence
purpose: To enforce self-verifiable and error correction in AI research for quantum simulations
Newly proposed system whose reliability is demonstrated only within the paper's own benchmark.

pith-pipeline@v0.9.0 · 5503 in / 1239 out tokens · 48577 ms · 2026-05-13T22:29:12.896425+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Agentification of Scientific Research: A Physicist's Perspective
cs.AI 2026-04 unverdicted novelty 3.0

AI will evolve from a research tool into a collaborator, fundamentally reshaping scientific collaboration, discovery, publishing, and evaluation while requiring continuous learning and idea diversity for original cont...

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Lu, C., Lu, C., Lange, R.T., Foerster, J., Clune, J., Ha, D.: The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery (2024). https://arxiv.org/abs/2408.06292

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

Tang, J., Xia, L., Li, Z., Huang, C.: AI-researcher: Autonomous scientific innovation. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025). https://openreview.net/forum?id=kQWyOYUAC4

work page 2025
[3]

https: //arxiv.org/abs/2512.07921

Li, Z., Li, Z., Guo, Z., Ren, X., Huang, C.: DeepCode: Open Agentic Coding (2025). https: //arxiv.org/abs/2512.07921

work page arXiv 2025
[4]

Nature624(7990), 86–91 (2023) https://doi.org/10.1038/s41586-023-06734-w

Szymanski, N.J., Rendy, B., Fei, Y., Kumar, R.E., He, T., Milsted, D., McDermott, M.J., Gallant, M., Cubuk, E.D., Merchant, A., Kim, H., Jain, A., Bartel, C.J., Persson, K., Zeng, Y., Ceder, G.: An autonomous laboratory for the accelerated synthesis of inorganic materials. Nature624(7990), 86–91 (2023) https://doi.org/10.1038/s41586-023-06734-w

work page doi:10.1038/s41586-023-06734-w 2023
[5]

Nature Communications16(1), 9104 (2025) https://doi.org/10

Mandal, I., Soni, J., Zaki, M., Smedskjaer, M.M., Wondraczek, K., Wondraczek, L., Gos- vami, N.N., Krishnan, N.M.A.: Evaluating large language model agents for automation of atomic force microscopy. Nature Communications16(1), 9104 (2025) https://doi.org/10. 1038/s41467-025-64105-7

work page 2025
[6]

Nature Communications17(1), 204 (2025) https://doi.org/10.1038/s41467-025-66916-0

Desai, S., Addamane, S., Tsao, J.Y., Brener, I., Dingreville, R., Iyer, P.P.: Self-driving lab discovers principles for steering spontaneous emission beyond conventional fourier optics. Nature Communications17(1), 204 (2025) https://doi.org/10.1038/s41467-025-66916-0

work page doi:10.1038/s41467-025-66916-0 2025
[7]

Patterns6(10), 101372 (2025) https://doi.org/10.1016/j.patter.2025.101372

Cao, S., Zhang, Z., Alghadeer, M., Fasciati, S.D., Piscitelli, M., Bakr, M., Leek, P., Aspuru- Guzik, A.: Automating quantum computing laboratory experiments with an agent-based ai framework. Patterns6(10), 101372 (2025) https://doi.org/10.1016/j.patter.2025.101372

work page doi:10.1016/j.patter.2025.101372 2025
[8]

https://arxiv.org/ abs/2508.05421

Sha, R., Wang, B., Yang, J., Ma, X., Wu, C., Yan, L., Zhou, C., Liu, J., Wang, G., Yan, S., Zhu, L.: LLM-based Multi-Agent Copilot for Quantum Sensor (2025). https://arxiv.org/ abs/2508.05421

work page arXiv 2025
[9]

https://arxiv.org/abs/ 2601.14288

Peng, Z.-Y., Yuan, H.-S., Lai, Q., Jiang, J.-Q., Ye, G., Zhang, J., Piao, Y.-S.: DeepInflation: an AI agent for research and model discovery of inflation (2026). https://arxiv.org/abs/ 2601.14288

work page arXiv 2026
[10]

Llm-feynman: leveraging large language models for universal scientific formula and theory discovery

Song, Z., Zhou, Q., Ren, C., Ling, C., Ju, M., Wang, J.: LLM-Feynman: Leveraging Large Language Models for Universal Scientific Formula and Theory Discovery (2025). https: //arxiv.org/abs/2503.06512

work page arXiv 2025
[11]

https: //arxiv.org/abs/2504.14557 15

Campbell, C., Chen, H.M., Luk, W., Fan, H.: Enhancing LLM-based Quantum Code Generation with Multi-Agent Optimization and Quantum Error Correction (2025). https: //arxiv.org/abs/2504.14557 15

work page arXiv 2025
[12]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2025)

Yang, R., Wang, Z., Gu, Y., Liang, Y., Li, T.: QCircuitbench: A large-scale dataset for benchmarking quantum algorithm design. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2025). https://openreview.net/forum?id=NkiLldW2bi

work page 2025
[13]

https://arxiv.org/abs/2512.18847

Gustin, I., Calder´ on, L.M., P´ erez-S´ anchez, J.B., Gonthier, J.F., Nakamura, Y., Panicker, K., Ramprasad, M., Zhang, Z., Zou, Y., Bernales, V., Aspuru-Guzik, A.: El Agente Cu´ antico: Automating quantum simulations (2026). https://arxiv.org/abs/2512.18847

work page arXiv 2026
[14]

https://arxiv.org/abs/2601.10194

Li, W., Ren, J., Cheng, L., Gong, C.: Autonomous Quantum Simulation through Large Language Model Agents (2026). https://arxiv.org/abs/2601.10194

work page arXiv 2026
[15]

https://arxiv.org/abs/2512.19799

Miao, T., Dai, J., Liu, J., Tan, J., Zhang, M., Jin, W., Du, Y., Jin, T., Pang, X., Liu, Z., Guo, T., Zhang, Z., Huang, Y., Chen, S., Ye, R., Zhang, Y., Zhang, L., Chen, K., Wang, W., E, W., Chen, S.: PhysMaster: Building an Autonomous AI Physicist for Theoretical and Computational Physics Research (2025). https://arxiv.org/abs/2512.19799

work page arXiv 2025
[16]

In: Advances in Neural Information Processing Systems, vol

Wei, J., Wang, X., Schuurmans, D., Bosma, M., ichter, b., Xia, F., Chi, E., Le, Q.V., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models. In: Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837 (2022). https://openreview.net/forum?id= VjQlMeSB J

work page 2022
[17]

In: Advances in Neural Information Processing Systems, vol

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. In: Advances in Neural Information Processing Systems, vol. 36, pp. 11809–11822 (2023). https://openreview.net/forum?id=1hflw0tjM8

work page 2023
[18]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , volume=

Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z.F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Xu, H....

work page doi:10.1038/s41586-025-09422-z 2025
[19]

In: Advances in Neural Information Processing Systems, vol

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K¨ uttler, H., Lewis, 16 M., Yih, W.-t., Rockt¨ aschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge-intensive nlp tasks. In: Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474 (2020).https://openreview.net/forum?id=KnVuuSvtIm1

work page 2020
[20]

In: The Twelfth International Conference on Learning Representations (2024).https://openreview.net/forum?id=hSyW5go0v8

Asai, A., Wu, Z., Wang, Y., Sil, A., Hajishirzi, H.: Self-RAG: Learning to retrieve, generate, and critique through self-reflection. In: The Twelfth International Conference on Learning Representations (2024).https://openreview.net/forum?id=hSyW5go0v8

work page 2024
[21]

In: The Eleventh International Conference on Learning Representations (2023).https://openreview.net/forum?id=WE vluYUL-X

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: React: Synergizing reasoning and acting in language models. In: The Eleventh International Conference on Learning Representations (2023).https://openreview.net/forum?id=WE vluYUL-X

work page 2023
[22]

Pal: Program-aided language models,

Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y., Callan, J., Neubig, G.: PAL: Program-aided Language Models (2023). https://arxiv.org/abs/2211.10435

work page arXiv 2023
[23]

In: NeurIPS (2023).https://openreview.net/forum?id=43rnkOcpI1

Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Gupta, S., Majumder, B.P., Hermann, K., Welleck, S., Yaz- danbakhsh, A., Clark, P.: Self-refine: Iterative refinement with self-feedback. In: NeurIPS (2023).https://openreview.net/forum?id=43rnkOcpI1

work page 2023
[24]

In: NeurIPS (2024).https://openreview.net/forum?id=30hggYAY0Z

Yang, J., Jimenez, C.E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., Press, O.: Swe-agent: Agent-computer interfaces enable automated software engineering. In: NeurIPS (2024).https://openreview.net/forum?id=30hggYAY0Z

work page 2024
[25]

Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions

Hou, X., Zhao, Y., Wang, S., Wang, H.: Model context protocol (mcp): Landscape, security threats, and future research directions. CoRRabs/2503.23278(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

CoRR abs/2510.26854(2025)

Li, Y., Huang, Y., Wang, T., Fan, C., Cai, X., Hu, S., Liu, X., Shi, C., Xu, M., Wang, Z., Wang, Y., Jin, X., Zhang, T., Zhang, L., Wang, L., Deng, Y., Zhang, P., Sun, W., Li, X., E, W., Zhang, L., Yao, Z., Chen, K.: Inverse knowledge search over verifiable reasoning: Synthesizing a scientific encyclopedia from a long chains-of-thought knowledge base. CoR...

work page arXiv 2025
[27]

arXiv preprint arXiv:2307.10635

Wang, X., Hu, Z., Lu, P., Zhu, Y., Zhang, J., Subramaniam, S., Loomba, A.R., Zhang, S., Sun, Y., Wang, W.: SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models (2024). https://arxiv.org/abs/2307.10635

work page arXiv 2024
[28]

In: First Conference on Language Modeling (2024).https://openreview.net/forum?id=Ti67584b98

Rein, D., Hou, B.L., Stickland, A.C., Petty, J., Pang, R.Y., Dirani, J., Michael, J., Bow- man, S.R.: GPQA: A graduate-level google-proof q&a benchmark. In: First Conference on Language Modeling (2024).https://openreview.net/forum?id=Ti67584b98

work page 2024
[29]

CoRRabs/2509.01659 (2025)

Qiu, J., Shi, J., Juan, X., Zhao, Z., Geng, J., Liu, S., Wang, H., Wu, S., Wang, M.: Physics supernova: Ai agent matches elite gold medalists at ipho 2025. CoRRabs/2509.01659 (2025)

work page arXiv 2025
[30]

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

Zhu, M., Tian, M., Yang, X., Zhou, T., Yuan, L., Zhu, P., Chertkov, E., Liu, S., Du, Y., Ji, Z., Das, I., Cao, J., Du, Y., Yu, J., Wu, P., He, J., Su, Y., Jiang, Y., Zhang, Y., Liu, C., Huang, Z.-M., Jia, W., Wang, Y., Jafarpour, F., Zhao, Y., Chen, X., Shelton, J., Young, A.W., Bartolotta, J., Xu, W., Sun, Y., Chu, A., Colussi, V., Akers, C., Brooks, N.,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

CoRR abs/2407.13168(2024)

Tian, M., Gao, L., Zhang, S.D., Chen, X., Fan, C., Guo, X., Haas, R., Ji, P., Krongchon, K., Li, Y., Liu, S., Luo, D., Ma, Y., Tong, H., Trinh, K., Tian, C., Wang, Z., Wu, B., Xiong, Y., Yin, S., Zhu, M., Lieret, K., Lu, Y., Liu, G., Du, Y., Tao, T., Press, O., Callan, J., Huerta, E.A., Peng, H.: Scicode: A research coding benchmark curated by scientists....

work page arXiv 2024
[32]

https://arxiv.org/abs/2502.15815

Chung, D.J.H., Gao, Z., Kvasiuk, Y., Li, T., M¨ unchmeyer, M., Rudolph, M., Sala, F., Tadepalli, S.C.: Theoretical Physics Benchmark (TPBench) – a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics (2025). https://arxiv.org/abs/2502.15815

work page arXiv 2025
[33]

In: The Fourteenth International Confer- ence on Learning Representations (2026).https://openreview.net/forum?id=cZFgsLq8Gs

Weng, Y., Zhu, M., Xie, Q., Sun, Q., Lin, Z., Liu, S., Zhang, Y.: Deepscientist: Advancing frontier-pushing scientific findings progressively. In: The Fourteenth International Confer- ence on Learning Representations (2026).https://openreview.net/forum?id=cZFgsLq8Gs

work page 2026
[34]

In: Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2023).https://openreview.net/forum?id=uccHPGDlao

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J.E., Stoica, I.: Judging LLM-as-a-judge with MT-bench and chatbot arena. In: Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2023).https://openreview.net/forum?id=uccHPGDlao

work page 2023
[35]

Paperbench: Evaluating ai’s ability to replicate ai research, 2025

Starace, G., Jaffe, O., Sherburn, D., Aung, J., Chan, J.S., Maksin, L., Dias, R., Mays, E., Kinsella, B., Thompson, W., Heidecke, J., Glaese, A., Patwardhan, T.: PaperBench: Evaluating AI’s Ability to Replicate AI Research (2025). https://arxiv.org/abs/2504.01848

work page arXiv 2025
[36]

SciPost Phys

Fishman, M., White, S.R., Stoudenmire, E.M.: The ITensor Software Library for Ten- sor Network Calculations. SciPost Phys. Codebases, 4 (2022) https://doi.org/10.21468/ SciPostPhysCodeb.4

work page 2022
[37]

SoftwareX10, 100311 (2019) https://doi.org/10.1016/j.softx.2019.100311

Carleo, G., Choo, K., Hofmann, D., Smith, J.E.T., Westerhout, T., Alet, F., Davis, E.J., Efthymiou, S., Glasser, I., Lin, S.-H., Mauri, M., Mazzola, G., Mendl, C.B., van Nieuwen- burg, E., O’Reilly, O., Th´ eveniaut, H., Torlai, G., Vicentini, F., Wietek, A.: Netket: A machine learning toolkit for many-body quantum systems. SoftwareX10, 100311 (2019) http...

work page doi:10.1016/j.softx.2019.100311 2019
[38]

SciPost Phys

Vicentini, F., Hofmann, D., Szab´ o, A., Wu, D., Roth, C., Giuliani, C., Pescia, G., Nys, J., Vargas-Calder´ on, V., Astrakhantsev, N., Carleo, G.: NetKet 3: Machine Learning Toolbox for Many-Body Quantum Systems. SciPost Phys. Codebases, 7 (2022) https://doi.org/10. 21468/SciPostPhysCodeb.7

work page 2022
[39]

https://doi.org/10.5281/zenodo.2562111

Aleksandrowicz, G., Alexander, T., Barkoutsos, P., Bello, L., Ben-Haim, Y., Bucher, D., Cabrera-Hern´ andez, F.J., Carballo-Franquis, J., Chen, A., Chen, C.-F., Chow, J.M., C´ orcoles-Gonzales, A.D., Cross, A.J., Cross, A., Cruz-Benito, J., Culver, C., Gonz´ alez, S.D.L.P., Torre, E.D.L., Ding, D., Dumitrescu, E., Duran, I., Eendebak, P., Everitt, M., Ser...

work page doi:10.5281/zenodo.2562111
[40]

Wiley Interdiscip

Neese, F.: The orca program system. Wiley Interdiscip. Rev. Comput. Mol. Sci.2(1), 73–78 (2012) https://doi.org/10.1002/wcms.81

work page doi:10.1002/wcms.81 2012
[41]

In: The Thirteenth International Conference on Learning Representations (2025)

Snell, C.V., Lee, J., Xu, K., Kumar, A.: Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. In: The Thirteenth International Conference on Learning Representations (2025). https://openreview.net/forum?id=4FWAwZtd2n

work page 2025
[42]

Tezuka, M., Ueda, M.: Density-matrix renormalization group study of trapped imbal- anced fermi condensates. Phys. Rev. Lett.100, 110403 (2008) https://doi.org/10.1103/ PhysRevLett.100.110403 19

work page 2008