Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark
Pith reviewed 2026-06-26 19:15 UTC · model grok-4.3
The pith
DelveAgent improves physical science accuracy by 7.5 points at one-third cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that DelveAgent, a modular multi-agent framework equipped with an adaptive planning loop, dual-granularity memory, and a hierarchical physics-grounded reflection mechanism, improves accuracy by up to 7.5 percentage points while reducing inference costs to approximately one-third of the strongest baseline across four scientific benchmarks; PhySciBench simultaneously shows that even leading models achieve only 33.5 percent on expert-curated physical-science questions.
What carries the argument
DelveAgent, a modular multi-agent framework with adaptive planning loop, dual-granularity memory, and hierarchical physics-grounded reflection mechanism.
If this is right
- PhySciBench functions as a dedicated benchmark for AI systems in physical sciences.
- The three identified deficiencies (fragile reasoning chains, limited knowledge transfer, absent physics-grounded self-verification) explain why current agents underperform.
- Adaptive planning, dual-granularity memory, and physics-grounded reflection together address those deficiencies.
- Architectural specialization can raise both accuracy and efficiency of autonomous scientific research agents.
Where Pith is reading between the lines
- If PhySciBench questions track actual lab workflows, then DelveAgent-style agents could shorten iteration cycles in experimental physics and chemistry.
- The modular structure suggests the same components could be ported to adjacent domains such as materials discovery or quantum information.
- Lower inference cost makes repeated use of such agents feasible inside resource-limited research groups.
Load-bearing premise
The 200 expert-curated questions accurately represent real-world physical science research challenges and workflows, and the measured gains stem from the proposed architecture rather than prompt or implementation details.
What would settle it
Run the 200 PhySciBench questions with an ablation that removes the hierarchical physics-grounded reflection module and check whether the accuracy and cost gains disappear.
read the original abstract
Deep research agents are Large Language Model (LLM)-based systems designed for autonomous, multi-step scientific reasoning, and they hold immense potential for accelerating research in the physical sciences. However, comprehensive and in-depth evaluations of their capabilities within this domain remain lacking. To address this gap, we introduce PhySciBench, a benchmark highly relevant to physical science research, comprising 200 expert-curated questions, balanced between physics and chemistry, across six task categories that reflect real-world scientific workflows. Evaluations of state-of-the-art models and agent systems on PhySciBench reveal limited performance; even the strongest baseline, Gemini Deep Research, achieves an accuracy of only 33.5%. Analysis of failure cases identifies three recurrent deficiencies: fragility in extended reasoning chains, limited knowledge transfer across steps, and a lack of physics-grounded self-verification. Motivated by these findings, we develop DelveAgent, a modular multi-agent framework equipped with an adaptive planning loop, dual-granularity memory, and a hierarchical physics-grounded reflection mechanism. Across four scientific benchmarks, DelveAgent improves accuracy by up to 7.5 percentage points while reducing inference costs to approximately one-third of the strongest baseline. These results establish the significance of PhySciBench as a critical benchmark for evaluating AI systems in the physical sciences and demonstrate that architectural specialization can effectively enhance the reliability of autonomous scientific research. Our data and code are publicly available at https://github.com/yigengjiang/physci-deepresearch.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PhySciBench, a benchmark of 200 expert-curated questions balanced between physics and chemistry across six task categories reflecting scientific workflows. It evaluates state-of-the-art models and agents, finding limited performance (strongest baseline Gemini Deep Research at 33.5% accuracy), identifies three failure modes (fragile extended reasoning, limited knowledge transfer, lack of physics-grounded verification), and proposes DelveAgent, a multi-agent framework with adaptive planning loop, dual-granularity memory, and hierarchical physics-grounded reflection. On four scientific benchmarks, DelveAgent achieves up to 7.5 percentage point accuracy gains and reduces inference costs to ~1/3 of the strongest baseline. Code and data are released publicly.
Significance. If the empirical results hold after addressing attribution and statistical concerns, the work supplies a new, domain-relevant benchmark for physical-science AI agents and shows that targeted architectural specialization can improve both accuracy and efficiency over general baselines. The public release of code and data is a clear strength supporting reproducibility and follow-on work.
major comments (2)
- [Experimental results (across the four benchmarks)] The central performance claims (up to 7.5 pp accuracy gain and ~3× cost reduction) are presented without error bars, statistical significance tests, or explicit controls for prompt-engineering differences and baseline implementation details. This information is required to establish that the observed deltas are attributable to the three proposed components rather than implementation or prompting variations.
- [DelveAgent framework description and evaluation] No ablation studies are reported that remove or disable each of the three components (adaptive planning loop, dual-granularity memory, hierarchical physics-grounded reflection) while keeping the underlying LLM, total prompt tokens, and tool access fixed. Such ablations are load-bearing for the claim that the architectural features, rather than other factors, produce the reported gains.
minor comments (1)
- [PhySciBench introduction] The abstract and benchmark description would benefit from an explicit statement of how the 200 questions were selected and validated (e.g., inter-annotator agreement or coverage of typical research workflows) to strengthen the claim of representativeness.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive suggestions. We address each major comment below and commit to revisions that strengthen the empirical validation of our claims.
read point-by-point responses
-
Referee: [Experimental results (across the four benchmarks)] The central performance claims (up to 7.5 pp accuracy gain and ~3× cost reduction) are presented without error bars, statistical significance tests, or explicit controls for prompt-engineering differences and baseline implementation details. This information is required to establish that the observed deltas are attributable to the three proposed components rather than implementation or prompting variations.
Authors: We agree that providing error bars, statistical tests, and clearer controls for baselines would strengthen the claims. In the revised version, we will report results with standard deviations from multiple independent runs (e.g., 5 seeds), include p-values from appropriate statistical tests, and detail the prompt templates and implementation choices for all baselines to isolate the effect of our architectural components. revision: yes
-
Referee: [DelveAgent framework description and evaluation] No ablation studies are reported that remove or disable each of the three components (adaptive planning loop, dual-granularity memory, hierarchical physics-grounded reflection) while keeping the underlying LLM, total prompt tokens, and tool access fixed. Such ablations are load-bearing for the claim that the architectural features, rather than other factors, produce the reported gains.
Authors: We acknowledge the importance of ablations for attributing performance gains to specific components. We will add a dedicated ablation study section in the revised manuscript, where we systematically disable each component (adaptive planning loop, dual-granularity memory, and hierarchical physics-grounded reflection) one at a time, while controlling for LLM, token budget, and tools, and report the resulting performance drops on the benchmarks. revision: yes
Circularity Check
No circularity: empirical measurements on new benchmark
full rationale
The paper introduces PhySciBench as a new 200-question benchmark and reports measured accuracy and cost improvements for DelveAgent versus baselines across four benchmarks. No equations, first-principles derivations, or predictions are claimed; the three architectural components are motivated by observed failure modes but the reported deltas are direct experimental outcomes rather than quantities forced by construction from fitted parameters or self-referential definitions. No load-bearing self-citations or uniqueness theorems appear in the text. The central claims therefore remain independent empirical results and receive the default non-circularity finding.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
John Jumper et al. “Highly accurate protein structure prediction with AlphaFold”. In:Nature 596 (2021), pp. 583–589.doi:10.1038/s41586-021-03819-2
-
[2]
URL https://doi.org/10.1038/s41586-023-06735-9
Amil Merchant et al. “Scaling deep learning for materials discovery”. In:Nature624 (2023), pp. 80–85.doi:10.1038/s41586-023-06735-9
-
[3]
An autonomous laboratory for the accelerated synthesis of novel materials
Nathan J. Szymanski et al. “An autonomous laboratory for the accelerated synthesis of novel materials”. In:Nature624 (2023), pp. 86–91.doi:10.1038/s41586-023-06734-w
-
[4]
Anubhav Jain et al. “Commentary: The Materials Project: A materials genome approach to accelerating materials innovation”. In:APL Materials1 (2013), p. 011002.doi:10.1063/1. 4812323
work page doi:10.1063/1 2013
-
[5]
Mapping cellular interactions from spatially resolved transcriptomics data
James Zhu, Yunguan Wang, et al. “Mapping cellular interactions from spatially resolved transcriptomics data”. In:Nature Methods(2024).doi:10.1038/s41592-024-02408-1
-
[6]
URL http://dx.doi.org/10.1038/s41586-023-067 92-0
Daniil A. Boiko et al. “Autonomous chemical research with large language models”. In:Nature 624 (2023), pp. 570–578.doi:10.1038/s41586-023-06792-0
-
[7]
Andrés M. Bran et al. “Augmenting large language models with chemistry tools”. In:Nature Machine Intelligence6 (2024), pp. 525–535.doi:10.1038/s42256-024-00832-8
-
[8]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao et al. “ReAct: Synergizing Reasoning and Acting in Language Models”. In:Interna- tional Conference on Learning Representations (ICLR). 2023. arXiv:2210.03629
Pith/arXiv arXiv 2023
-
[9]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei et al. “Chain-of-thought prompting elicits reasoning in large language models”. In: Advances in Neural Information Processing Systems (NeurIPS). Vol. 35. 2022, pp. 24824–24837. arXiv:2201.11903
Pith/arXiv arXiv 2022
-
[10]
https://openai.com/index/introducing-deep- research
OpenAI.Introducing Deep Research. https://openai.com/index/introducing-deep- research. 2026
2026
-
[11]
https://gemini.google/us/overview/deep-research
Google.GeminiDeepResearch. https://gemini.google/us/overview/deep-research. 2025
2025
-
[12]
Towards Autonomous Mathematics Research
Tony Feng et al. “Towards Autonomous Mathematics Research”. In:arXiv preprint arXiv:2602.10177(2026).url:https://arxiv.org/abs/2602.10177
arXiv 2026
-
[13]
Accelerating Scientific Research with Gemini: Case Studies and Common Techniques
David P. Woodruff et al. “Accelerating Scientific Research with Gemini: Case Studies and Common Techniques”. In:arXiv preprint arXiv:2602.03837(2026).url:https://arxiv. org/abs/2602.03837
arXiv 2026
-
[14]
Juraj Gottweis et al. “Towards an AI co-scientist”. In:arXiv preprint arXiv:2502.18864(2025). url:https://arxiv.org/abs/2502.18864
Pith/arXiv arXiv 2025
-
[15]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Chris Lu et al. “The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery”. In:arXiv preprint arXiv:2408.06292(2024).url: https://arxiv.org/abs/2408.06292
Pith/arXiv arXiv 2024
-
[16]
PhysMaster: Building an Autonomous AI Physicist for Theoretical and Computational Physics Research
Tingjia Miao et al. “PhysMaster: Building an Autonomous AI Physicist for Theoretical and Computational Physics Research”. In:arXiv preprint arXiv:2512.19799(2025)
arXiv 2025
-
[17]
Physics supernova: Ai agent matches elite gold medalists at ipho 2025
Jiahao Qiu et al. “Physics supernova: Ai agent matches elite gold medalists at ipho 2025”. In: arXiv preprint arXiv:2509.01659(2025)
arXiv 2025
-
[18]
Agentic AI for multi-stage physics experiments at a large-scale user facility particle accelerator
Thorsten Hellert et al. “Agentic AI for multi-stage physics experiments at a large-scale user facility particle accelerator”. In:arXiv preprint arXiv:2509.17255(2025)
Pith/arXiv arXiv 2025
-
[19]
From ai for science to agentic science: A survey on autonomous scientific discovery
Jiaqi Wei et al. “From ai for science to agentic science: A survey on autonomous scientific discovery”. In:arXiv preprint arXiv:2508.14111(2025). 17
arXiv 2025
-
[20]
Random compressed coding with neurons
Simone Blanco Malerba et al. “Random compressed coding with neurons”. In:Cell Reports (2025).doi:10.1016/j.celrep.2025.115412
-
[21]
Elbeheiry, María Victoria Gil, Christina Glaubitz, Maximilian Greiner, Caroline T
Adrian Mirza et al. “A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists”. In:Nature Chemistry(2025).doi: 10.1038/s41557-025-01815-x
-
[22]
LAB-Bench: Measuring Capabilities of Language Models for Biology Research
Jon M. Laurent et al. “LAB-Bench: Measuring Capabilities of Language Models for Biology Research”. In:arXiv preprint arXiv:2407.10362(2024).url: https://arxiv.org/abs/ 2407.10362
Pith/arXiv arXiv 2024
-
[23]
SciCode: A Research Coding Benchmark Curated by Scientists
Minyang Tian et al. “SciCode: A Research Coding Benchmark Curated by Scientists”. In: Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track
-
[24]
Ziru Chen et al. “ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery”. In:International Conference on Learning Representations (ICLR). 2025. arXiv:2410.05080
arXiv 2025
-
[25]
PHYBench: Holistic evaluation of physical perception and reasoning in large language models
Shi Qiu et al. “PHYBench: Holistic evaluation of physical perception and reasoning in large language models”. In:NeurIPS(2025)
2025
-
[26]
CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics
Weida Wang et al. “CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics”. In:ICLR(2026)
2026
-
[27]
CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers
Haining Pan et al. “CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers”. In:ICLR(2026)
2026
-
[28]
Long Phan et al. “Humanity’s Last Exam”. In:Nature(2025).doi:10.1038/s41586-025- 09962-4
-
[29]
Shanghai Artificial Intelligence Laboratory. “Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows”. In:CoRRabs/2512.16969 (2025).doi:10.48550/ARXIV. 2512.16969. arXiv: 2512.16969.url: https://doi.org/10.48550/arXiv.2512. 16969
work page internal anchor Pith review doi:10.48550/arxiv 2025
-
[30]
Frontierscience: Evaluating ai's ability to perform expert-level scientific tasks
Miles Wang et al. “FrontierScience: Evaluating AI’s Ability to Perform Expert-Level Scientific Tasks”. In:CoRRabs/2601.21165 (2026).doi: 10 . 48550 / ARXIV . 2601 . 21165. arXiv: 2601.21165.url:https://doi.org/10.48550/arXiv.2601.21165
-
[31]
Gemini Team, Google DeepMind.Gemini 3: Frontier Intelligence Built for Speed and Scale. 2025. url:https://deepmind.google/models/gemini/flash/
2025
-
[32]
2026.url:https://x.ai/news/grok-4-1-fast
xAI.Grok 4.1 Fast and Agent Tools API. 2026.url:https://x.ai/news/grok-4-1-fast
2026
-
[33]
Integrating physical units into high-performance AI-driven scientific computing
Chaoming Wang et al. “Integrating physical units into high-performance AI-driven scientific computing”. In:Nature Communications(2025).doi:10.1038/s41467-025-58626-4
-
[34]
Probing the limitations of multimodal language models for chemistry and materials research
Nawaf Alampara et al. “Probing the limitations of multimodal language models for chemistry and materials research”. In:Nature computational science5.10 (2025), pp. 952–961
2025
-
[35]
Towards Multimodal Data-Driven Scientific Discovery Powered by LLM Agents
Fan Liu, Xiaozhao Zeng, and Hao Liu. “Towards Multimodal Data-Driven Scientific Discovery Powered by LLM Agents”. In:The Fourteenth International Conference on Learning Representa- tions
-
[36]
No free labels: Limitations of llm-as-a-judge without human ground- ing
Michael Krumdick et al. “No free labels: Limitations of llm-as-a-judge without human ground- ing”. In:arXiv preprint arXiv:2503.05061(2025)
arXiv 2025
-
[37]
Limitationsofthellm-as-a-judgeapproachforevaluatingllmoutputs in expert knowledge tasks
AnnalisaSzymanskietal. “Limitationsofthellm-as-a-judgeapproachforevaluatingllmoutputs in expert knowledge tasks”. In:Proceedings of the 30th international conference on intelligent user interfaces. 2025, pp. 952–966. 18
2025
-
[38]
Autonomous artificial intelligence, scientific research, and human values
David B Resnik, Mohammad Hosseini, and Rico Hauswald. “Autonomous artificial intelligence, scientific research, and human values”. In:AI and Ethics6.1 (2026), p. 141
2026
-
[39]
The ethics of using artificial intelligence in scientific research: new guidance needed for a new tool
David B Resnik and Mohammad Hosseini. “The ethics of using artificial intelligence in scientific research: new guidance needed for a new tool”. In:AI and Ethics5.2 (2025), pp. 1499–1521
2025
-
[40]
https://openai.com/index/introducing-gpt-5-2/
OpenAI.Introducing GPT-5.2. https://openai.com/index/introducing-gpt-5-2/ . 2025
2025
-
[41]
https : / / blog
Google DeepMind.A new era of intelligence with Gemini 3. https : / / blog . google / products-and-platforms/products/gemini/gemini-3/. 2025
2025
-
[42]
Anthropic.Introducing Claude Opus 4.5.https://www.anthropic.com/news/claude- opus-4-5. 2025
2025
-
[43]
https : / / github
Moonshot AI.Kimi K2.5: Visual Coding Meets Agent Swarm. https : / / github . com / MoonshotAI/Kimi-K2.5. 2026
2026
-
[44]
Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale
Intern-S1-Pro Team. “Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale”. In:CoRRabs/2603.25040 (2026).doi: 10 . 48550 / ARXIV . 2603 . 25040. arXiv: 2603 . 25040.url:https://doi.org/10.48550/arXiv.2603.25040
-
[45]
DeepSeek AI.DeepSeek-V3.2 Release: DeepSeek-V3.2 & DeepSeek-V3.2-Speciale.https://api- docs.deepseek.com/news/news251201. 2025
2025
-
[46]
Shuai Bai et al. “Qwen3-vl technical report”. In:arXiv preprint arXiv:2511.21631(2025)
Pith/arXiv arXiv 2025
-
[47]
Aymeric Roucher et al.‘smolagents‘: a smol library to build great agentic systems.https : //github.com/huggingface/smolagents. 2025
2025
-
[48]
Chai Jingyi et al. “SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity’s Last Exam?” In:arXiv preprint arXiv:2507.05241 (2025)
arXiv 2025
-
[49]
Tongyi DeepResearch Technical Report
Tongyi DeepResearch Team et al. “Tongyi DeepResearch Technical Report”. In:arXiv preprint arXiv:2510.24701(2025)
Pith/arXiv arXiv 2025
-
[50]
Yuchen Shi et al.Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization. 2025. arXiv:2512.24615 [cs.AI].url:https://arxiv.org/abs/ 2512.24615
arXiv 2025
-
[51]
Owl: Optimized workforce learning for general multi-agent assistance in real-world task automation
Mengkang Hu et al. “Owl: Optimized workforce learning for general multi-agent assistance in real-world task automation”. In:NeurIPS(2025). 19
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.