arxiv: 2605.09636 · v1 · submitted 2026-05-10 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

PDEAgent-Bench: A Multi-Metric, Multi-Library Benchmark for PDE Solver Generation

Chao Yao, Dongming Jiang, Gangzong Si, Hande Dong, Han Li, Haozhong Kai, Hong Wang, huanshuo dong, Jiangqing Zhang, Jiangtao Ma, Junhui Li, Lei Liu, Pengwei Liu, Qingsong Yao, Songlin Bai, Xihang Yue, Xingyu Ren, Yang Wei, Yaomin Shen, Yushan Yashengjiang, Zhanhua Hu, Zhen Hang, Zhezheng Hao, Zikang Xu

Pith reviewed 2026-05-12 02:32 UTC · model grok-4.3

classification 💻 cs.AI

keywords PDE solver generationcode generation benchmarkfinite element methodnumerical accuracycomputational efficiencymulti-library evaluationLLM agent testingstaged evaluation

0 comments

The pith

PDEAgent-Bench is the first benchmark to show that AI-generated PDE solver code often runs but rarely meets accuracy and efficiency standards across multiple FEM libraries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces PDEAgent-Bench to test automatic generation of numerical solvers from PDE specifications. The benchmark supplies 645 instances spanning six categories and eleven PDE families, each with problem statements for agents, reference solutions on fixed grids, and explicit targets for accuracy and runtime. It evaluates generated code in three sequential stages: whether the code runs, whether it reaches the accuracy target against the reference, and whether it does so within the runtime limit, using the libraries DOLFINx, Firedrake, and deal.II. Experiments with current LLMs and code agents reveal high executability rates that collapse once accuracy and efficiency are required. A reader would care because reliable automated PDE solvers could speed up work in engineering and physics, yet the results indicate existing models still fall short on producing numerically trustworthy code.

Core claim

The paper establishes that no prior public benchmark exists for PDE-to-solver code generation and presents PDEAgent-Bench as the first multi-metric, multi-library testbed. It contains 645 instances across 6 mathematical categories and 11 PDE families, each providing an agent-facing specification, a reference solution on a prescribed grid, and case-specific accuracy and runtime targets. The benchmark applies a staged evaluation that requires generated solvers to pass executability, numerical accuracy, and computational efficiency checks in sequence using DOLFINx, Firedrake, and deal.II. Experiments show that representative LLMs and agents frequently produce runnable code, yet pass rates drop,

What carries the argument

PDEAgent-Bench, a collection of 645 PDE instances with staged checks for executability followed by accuracy against reference grids and runtime limits, implemented across three professional FEM libraries.

If this is right

Current LLMs and code agents remain limited in generating numerically reliable and efficient PDE solvers.
PDE solver generation now has a reproducible testbed that measures progress against concrete accuracy and efficiency requirements.
Future agent designs must improve handling of discretization schemes, solver configuration, and library-specific implementations to succeed on this benchmark.
The staged framework separates syntactic success from numerical correctness, allowing targeted diagnosis of model weaknesses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be adapted to measure progress in generating solvers for other simulation domains that also demand both precision and speed.
Models might improve if trained on larger corpora of verified FEM code paired with the corresponding PDE specifications.
The sharp drop in pass rates suggests that purely language-based agents may need integration with symbolic or verification tools to reach professional standards.

Load-bearing premise

The chosen staged checks, accuracy and runtime targets, reference grids, and 645 instances across the six categories and eleven PDE families sufficiently represent the real demands of numerical PDE work in professional libraries.

What would settle it

A model that consistently passes all three stages on a majority of the 645 instances while still producing solvers that match independent real-world FEM applications outside the benchmark.

Figures

Figures reproduced from arXiv: 2605.09636 by Chao Yao, Dongming Jiang, Gangzong Si, Hande Dong, Han Li, Haozhong Kai, Hong Wang, huanshuo dong, Jiangqing Zhang, Jiangtao Ma, Junhui Li, Lei Liu, Pengwei Liu, Qingsong Yao, Songlin Bai, Xihang Yue, Xingyu Ren, Yang Wei, Yaomin Shen, Yushan Yashengjiang, Zhanhua Hu, Zhen Hang, Zhezheng Hao, Zikang Xu.

**Figure 2.** Figure 2: PDEAgent-Bench evaluation pipeline from PDE case specification to sandboxed execution [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Single-shot pass rates (%) by PDE family on Firedrake (left) and deal.II (right). Solid lines [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Failure-stage breakdown by model and FEM library. Each bar is partitioned into execution [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: GPT-5.4 pass rates under single-shot and three-attempt execution-feedback settings. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Template-guided ablation for Gemini 3.1 Pro, comparing default end-to-end generation [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: DOLFINx case-level pass rate for the three top-performing models as a function of the [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: DOLFINx case-level pass rate for the three top-performing models as a function of the [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

read the original abstract

PDE-to-solver code generation aims to automatically synthesize executable numerical solvers from partial differential equation (PDE) specifications. This task requires not only understanding the mathematical structure of PDEs, but also selecting appropriate discretization schemes and solver configurations, and correctly implementing the resulting formulations in finite-element method (FEM) libraries. Existing code generation benchmarks mainly evaluate syntactic correctness, or success on predefined test cases. To our knowledge, there is currently no publicly available benchmark specifically for PDE-to-solver code generation, and general-purpose code benchmarks do not fully capture the unique challenges of numerical PDE solution, such as ensuring solver accuracy, efficiency, and compatibility with professional FEM libraries. We introduce PDEAgent-Bench, to the best of our knowledge, the first multi-metric, multi-library benchmark for PDE-to-solver code generation. PDEAgent-Bench contains 645 instances across 6 mathematical categories and 11 PDE families, with common FEM libraries for DOLFINx, Firedrake, and deal.II. Each instance provides an agent-facing problem specification, a reference solution on a prescribed evaluation grid, and case-specific accuracy and runtime targets. PDEAgent-Bench adopts a staged evaluation framework in which generated solvers must sequentially pass executability, numerical accuracy, and computational efficiency checks. Experiments with representative LLMs and code agents show that models can often produce runnable code, but their pass rate drops substantially once accuracy and efficiency requirements are enforced. These results indicate that current agents remain limited in producing numerically reliable and efficient PDE solvers, and that PDEAgent-Bench provides a reproducible testbed grounded in the practical requirements of numerical PDE solving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PDEAgent-Bench is a new testbed that requires AI-generated PDE solvers to pass numerical accuracy and efficiency checks in real FEM libraries, and the experiments show current models mostly fail those later stages.

read the letter

The paper's core move is to build PDEAgent-Bench with 645 instances spanning 6 categories and 11 PDE families, then evaluate generated code against three sequential gates: does it run, does it match a reference solution on a prescribed grid to a set tolerance, and does it meet a runtime target. This is done across DOLFINx, Firedrake, and deal.II. That staged structure is the clearest advance over existing code-generation benchmarks, which usually stop at syntax or toy test cases. The reported experiments confirm the point: models often produce runnable code, but pass rates fall sharply once accuracy and efficiency are required. That result is useful because it quantifies a limitation that many in the field already suspect but have not measured in this setting before.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces PDEAgent-Bench, the first multi-metric, multi-library benchmark for PDE-to-solver code generation. It consists of 645 instances across 6 mathematical categories and 11 PDE families, supporting DOLFINx, Firedrake, and deal.II. Each instance supplies an agent-facing specification, a reference solution on a prescribed grid, and case-specific accuracy/runtime targets. Evaluation proceeds in stages: executability, then numerical accuracy, then computational efficiency. Experiments with LLMs and code agents show that runnable code is frequently generated, but pass rates drop sharply once accuracy and efficiency gates are enforced, indicating current agents remain limited for producing numerically reliable PDE solvers.

Significance. If the benchmark construction and targets are representative, the work fills a clear gap by moving code-generation evaluation beyond syntax or toy test cases into the practical requirements of FEM-based numerical PDE solving. The staged, multi-library design and reproducible reference grids provide a concrete testbed that can drive progress on agent capabilities for scientific computing tasks. The reported collapse in pass rates under accuracy/efficiency constraints is a falsifiable, actionable finding.

major comments (2)

[§3] §3 (Benchmark Construction): The selection criteria and validation process for the 645 instances (6 categories, 11 families) are described at a high level only. Without explicit documentation of how instances were chosen to reflect realistic discretization and solver-configuration challenges, it is difficult to assess whether the accuracy and runtime targets are load-bearing for the central claim that agents are limited in producing reliable solvers.
[§4.2] §4.2 (Reference Solutions and Targets): The generation of reference solutions on prescribed grids and the exact definitions of accuracy thresholds (error norms, tolerances) and runtime targets are not specified in sufficient detail to allow independent reproduction or verification of the staged pass-rate results. These elements are central to the experimental claim that pass rates drop substantially once accuracy and efficiency are enforced.

minor comments (3)

[Abstract] The repeated phrase 'to the best of our knowledge' in the abstract could be consolidated; a single statement suffices.
[Figures] Figure captions should explicitly list the three FEM libraries and the number of instances per category to improve immediate readability.
[Related Work] The related-work section should include a brief comparison table against existing code-generation benchmarks (e.g., HumanEval, APPS) to highlight the unique PDE-specific metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the significance of PDEAgent-Bench in addressing a gap in evaluating AI-generated PDE solvers. We address each major comment below and will revise the manuscript to improve clarity and reproducibility.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The selection criteria and validation process for the 645 instances (6 categories, 11 families) are described at a high level only. Without explicit documentation of how instances were chosen to reflect realistic discretization and solver-configuration challenges, it is difficult to assess whether the accuracy and runtime targets are load-bearing for the central claim that agents are limited in producing reliable solvers.

Authors: We agree that more explicit documentation of the instance selection and validation process is required to substantiate the benchmark's design and the load-bearing nature of the targets. In the revised manuscript we will expand §3 with a dedicated subsection detailing the selection criteria (including coverage of discretization schemes, solver configurations, and problem complexities across the 6 categories and 11 families), the validation steps performed to confirm realism and diversity, and the rationale for the resulting distribution of the 645 instances. revision: yes
Referee: [§4.2] §4.2 (Reference Solutions and Targets): The generation of reference solutions on prescribed grids and the exact definitions of accuracy thresholds (error norms, tolerances) and runtime targets are not specified in sufficient detail to allow independent reproduction or verification of the staged pass-rate results. These elements are central to the experimental claim that pass rates drop substantially once accuracy and efficiency are enforced.

Authors: We acknowledge that the current description of reference-solution generation and the precise definitions of accuracy and runtime thresholds lack the granularity needed for full reproducibility. In the revision we will augment §4.2 with explicit descriptions of the reference-solution procedure (including the numerical methods, grid specifications, and libraries employed to obtain ground-truth values), the exact error norms and tolerance values used for the accuracy gate, and the case-specific runtime targets. These additions will directly support verification of the staged evaluation results. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper constructs a new benchmark (PDEAgent-Bench) with 645 instances, staged evaluation gates, reference grids, and accuracy/runtime targets, then runs existing LLMs/agents against it. All load-bearing claims (first-of-kind status, pass-rate collapse under accuracy/efficiency) rest on external model evaluations and the explicit benchmark definition rather than any self-definition, fitted-parameter renaming, or self-citation chain that reduces the result to its inputs. The derivation is a standard empirical benchmark paper and remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a benchmark introduction paper, there are no free parameters fitted to data, no new axioms postulated, and no invented entities. The contribution rests on the design of the benchmark and the empirical observation from experiments on existing models.

pith-pipeline@v0.9.0 · 5683 in / 1275 out tokens · 59276 ms · 2026-05-12T02:32:30.474680+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
PDEAgent-Bench adopts a staged evaluation framework in which generated solvers must sequentially pass executability, numerical accuracy, and computational efficiency checks.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
We introduce PDEAgent-Bench, to the best of our knowledge, the first multi-metric, multi-library benchmark for PDE-to-solver code generation.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 6 internal anchors

[1]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021. URLhttps://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Measuring coding challenge competence with APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/ forum?i...

work page 2021
[4]

Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022

work page 2022
[5]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces

Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[7]

DOLFINx: the next generation FEniCS problem solving environment

Igor A. Baratta, Joseph P. Dean, Jørgen S. Dokken, Michal Habera, Jack S. Hale, Chris N. Richardson, Marie E. Rognes, Matthew W. Scroggs, Nathan Sime, and Garth N. Wells. DOLFINx: the next generation FEniCS problem solving environment, 2023. URL https: //doi.org/10.5281/zenodo.10447666

work page doi:10.5281/zenodo.10447666 2023
[8]

Ham, Paul H

David A. Ham, Paul H. J. Kelly, Lawrence Mitchell, Colin J. Cotter, Robert C. Kirby, Koki Sagiyama, Nacime Bouziani, Sophia V orderwuelbecke, Thomas J. Gregory, Jack Betteridge, Daniel R. Shapero, Reuben W. Nixon-Hill, Connor J. Ward, Patrick E. Farrell, Pablo D. Brubeck, India Marsden, Thomas H. Gibson, Miklós Homolya, Tianjiao Sun, Andrew T. T. McRae, F...

work page 2023
[9]

deal.ii—a general-purpose object- oriented finite element library.ACM Transactions on Mathematical Software, 33(4):24–es,

Wolfgang Bangerth, Ralf Hartmann, and Guido Kanschat. deal.ii—a general-purpose object- oriented finite element library.ACM Transactions on Mathematical Software, 33(4):24–es,

work page
[10]

doi: 10.1145/1268776.1268779

work page doi:10.1145/1268776.1268779
[11]

Codepde: An inference framework for llm-driven pde solver generation

Shanda Li, Tanya Marwah, Junhong Shen, Weiwei Sun, Andrej Risteski, Yiming Yang, and Ameet Talwalkar. Codepde: An inference framework for llm-driven pde solver generation. arXiv preprint arXiv:2505.08783, 2025. 10

work page arXiv 2025
[12]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated soft- ware engineering. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://arxiv.org/abs/2405.15793

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Self-attention to operator learning-based 3d-ic thermal simulation

Zhen Huang, Hong Wang, Wenkai Yang, Muxi Tang, Depeng Xie, Ting-Jung Lin, Yu Zhang, Wei W Xing, and Lei He. Self-attention to operator learning-based 3d-ic thermal simulation. In 2025 62nd ACM/IEEE Design Automation Conference (DAC), pages 1–7. IEEE, 2025

work page 2025
[15]

Pdebench: An extensive benchmark for scientific machine learning.Advances in neural information processing systems, 35:1596–1611, 2022

Makoto Takamoto, Timothy Praditia, Raphael Leiteritz, Daniel MacKinlay, Francesco Alesiani, Dirk Pflüger, and Mathias Niepert. Pdebench: An extensive benchmark for scientific machine learning.Advances in neural information processing systems, 35:1596–1611, 2022

work page 2022
[16]

Towards multi-spatiotemporal-scale generalized pde modeling.arXiv preprint arXiv:2209.15616, 2022

Jayesh K Gupta and Johannes Brandstetter. Towards multi-spatiotemporal-scale generalized pde modeling.arXiv preprint arXiv:2209.15616, 2022

work page arXiv 2022
[17]

Cfdbench: A large-scale benchmark for machine learning methods in fluid dynamics.arXiv preprint arXiv:2310.05963, 2023

Yining Luo, Yingfa Chen, and Zhen Zhang. Cfdbench: A large-scale benchmark for machine learning methods in fluid dynamics.arXiv preprint arXiv:2310.05963, 2023

work page arXiv 2023
[18]

https://arxiv.org/abs/2307.10635

Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem-solving abilities of large language models.arXiv preprint arXiv:2307.10635, 2023

work page arXiv 2023
[19]

Scicode: A research coding benchmark curated by scientists.Advances in Neural Information Processing Systems, 37:30624–30650, 2024

Minyang Tian, Luyu Gao, Shizhuo D Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, et al. Scicode: A research coding benchmark curated by scientists.Advances in Neural Information Processing Systems, 37:30624–30650, 2024

work page 2024
[20]

Realpdebench: A benchmark for complex physical systems with real-world data.arXiv preprint arXiv:2601.01829, 2026

Peiyan Hu, Haodong Feng, Hongyuan Liu, Tongtong Yan, Wenhao Deng, Tianrun Gao, Rong Zheng, Haoren Zheng, Chenglei Yu, Chuanrui Wang, et al. Realpdebench: A benchmark for complex physical systems with real-world data.arXiv preprint arXiv:2601.01829, 2026

work page arXiv 2026
[21]

Automated code development for pde solvers using large language models, 2025

Haoyang Wu, Xinxin Zhang, and Lailai Zhu. Automated code development for pde solvers using large language models, 2025. URLhttps://arxiv.org/abs/2509.25194

work page arXiv 2025
[22]

Re4: Scientific computing agent with rewriting, resolution, review and revision, 2026

Ao Cheng, Lei Zhang, and Guowei He. Re4: Scientific computing agent with rewriting, resolution, review and revision, 2026. URLhttps://arxiv.org/abs/2508.20729

work page arXiv 2026
[23]

Pde- controller: Llms for autoformalization and reasoning of pdes.arXiv preprint arXiv:2502.00963, 2025

Mauricio Soroco, Jialin Song, Mengzhou Xia, Kye Emond, Weiran Sun, and Wuyang Chen. Pde- controller: Llms for autoformalization and reasoning of pdes.arXiv preprint arXiv:2502.00963, 2025

work page arXiv 2025
[24]

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Vijay Bharadwaj, Jeff Holm, Raja Aluri, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. Swe-bench pro: Can ai agents solve long-ho...

work page internal anchor Pith review arXiv 2025
[25]

Gittaskbench: A benchmark for code agents solving real-world tasks through code repository leveraging

Ziyi Ni, Huacan Wang, Shuo Zhang, Shuo Lu, Ziyang He, Zhenheng Tang, Sen Hu, Bo Li, Chen Hu, Binxing Jiao, et al. Gittaskbench: A benchmark for code agents solving real-world tasks through code repository leveraging. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 32564–32572, 2026

work page 2026
[26]

Da-code: Agent data science code generation benchmark for large language models

Yiming Huang, Jianwen Luo, Yan Yu, Yitong Zhang, Fangyu Lei, Yifan Wei, Shizhu He, Lifu Huang, Xiao Liu, Jun Zhao, et al. Da-code: Agent data science code generation benchmark for large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 13487–13521, 2024. 11

work page 2024
[27]

A benchmark for evaluating repository-level code agents with intermediate reasoning on feature addition task,

Shuhan Liu, Zhiyi Zhao, Xing Hu, Kui Liu, Xiaohu Yang, and Xin Xia. A benchmark for evaluating repository-level code agents with intermediate reasoning on feature addition task,

work page
[28]

URLhttps://arxiv.org/abs/2603.26337

work page arXiv
[29]

Agentbench: Evaluating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating LLMs as agents. InThe Twelfth International Conference on Learning R...

work page 2024
[30]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=oKn9c6ytLx

work page 2024
[31]

OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InThe Thirty-eight Conference on N...

work page 2024
[32]

MCP-bench: Benchmarking tool-using LLM agents with complex real-world tasks via MCP servers

Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, and Eugene Siow. MCP-bench: Benchmarking tool-using LLM agents with complex real-world tasks via MCP servers. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=fe8mzHwMxN

work page 2026
[33]

ML-bench: Evaluating large language models and agents for machine learning tasks on repository-level code

Xiangru Tang, Yuliang Liu, Zefan Cai, Yanjun Shao, Junjie Lu, Yichi Zhang, Zexuan Deng, Helan Hu, Kaikai An, Ruijun Huang, Shuzheng Si, Chen Sheng, Haozhe Zhao, Liang Chen, Tianyu Liu, Yujia Qin, Wangchunshu Zhou, Yilun Zhao, Zhiwei Jiang, Baobao Chang, Arman Cohan, and Mark Gerstein. ML-bench: Evaluating large language models and agents for machine learn...

work page 2025
[34]

Truong, Weixin Liang, Fan- Yun Sun, and Nick Haber

Tianyu Hua, Harper Hua, Violet Xiang, Benjamin Klieger, Sang T. Truong, Weixin Liang, Fan- Yun Sun, and Nick Haber. Researchcodebench: Benchmarking LLMs on implementing novel machine learning research code. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026. URLhttps://openreview.net/ forum?id...

work page 2026
[35]

MLR-bench: Evaluating AI agents on open-ended machine learning research

Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, and Bryan Hooi. MLR-bench: Evaluating AI agents on open-ended machine learning research. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026. URLhttps://openreview.net/forum?id=JX9DE6colf

work page 2026
[36]

Efficient benchmarking of AI agents

Franck Ndzomga. Efficient benchmarking of ai agents, 2026. URL https://arxiv.org/ abs/2603.23749

work page arXiv 2026
[37]

Phybench: Holistic evaluation of physical percep- tion and reasoning in large language models.arXiv preprint arXiv:2504.16074, 2025

Shi Qiu, Shaoyang Guo, Zhuo-Yang Song, Yunbo Sun, Zeyu Cai, Jiashen Wei, Tianyu Luo, Yixuan Yin, Haoxu Zhang, Yi Hu, et al. PHYBench: Holistic evaluation of physical perception and reasoning in large language models.arXiv preprint arXiv:2504.16074, 2025

work page arXiv 2025
[38]

PINNacle: A comprehensive benchmark of physics-informed neural networks for solving PDEs.Advances in Neural Information Processing Systems, 37: 76721–76774, 2024

Zhongkai Hao, Jiachen Yao, Chang Su, Hang Su, Ziao Wang, Fanzhi Lu, Zeyu Xia, Yichi Zhang, Songming Liu, Lu Lu, et al. PINNacle: A comprehensive benchmark of physics-informed neural networks for solving PDEs.Advances in Neural Information Processing Systems, 37: 76721–76774, 2024

work page 2024
[39]

Towards scientific intelligence: A survey of llm-based scientific agents, 2026

Shuo Ren, Can Xie, Pu Jian, Zhenjiang Ren, Chunlin Leng, and Jiajun Zhang. Towards scientific intelligence: A survey of llm-based scientific agents, 2026. URL https://arxiv.org/abs/ 2503.24047. 12

work page arXiv 2026
[40]

Reasoning-to-simulation: An agentic framework for discovery of electrolyte materials

Jur ‘gis Ruža and Rafael Gomez-Bombarelli. Reasoning-to-simulation: An agentic framework for discovery of electrolyte materials. InAI for Accelerated Materials Design - ICLR 2026,

work page 2026
[41]

URLhttps://openreview.net/forum?id=MEXl18VhBL

work page
[42]

Jiajun Bao, Nicolas Boullé, Toni J. B. Liu, Raphaël Sarfati, and Christopher J. Earls. Text- trained llms can zero-shot extrapolate pde dynamics, revealing a three-stage in-context learning mechanism, 2026. URLhttps://arxiv.org/abs/2509.06322

work page arXiv 2026
[43]

Explain like i’m five: Using LLMs to improve PDE surrogate models with text, 2024

Cooper Lorsung and Amir Barati Farimani. Explain like i’m five: Using LLMs to improve PDE surrogate models with text, 2024. URL https://openreview.net/forum?id=D3iJmVAmT7

work page 2024
[44]

PINNsagent: Automated PDE surrogation with large language models

Qingpo Wuwu, Chonghan Gao, Tianyu Chen, Yihang Huang, Yuekai Zhang, Jianing Wang, Jianxin Li, Haoyi Zhou, and Shanghang Zhang. PINNsagent: Automated PDE surrogation with large language models. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=RO5OGOzs6M

work page 2025
[45]

How do large language models perform on pde discovery: A coarse-to-fine perspective

Xiao Luo, Changhu Wang, Yizhou Sun, and Wei Wang. How do large language models perform on pde discovery: A coarse-to-fine perspective. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 2684–2697, 2025

work page 2025
[46]

Opinf-LLM: Parametric PDE solving with LLMs via operator inference

Zhuoyuan Wang, Hanjiang Hu, Xiyu Deng, Saviz Mowlavi, and Yorie Nakahira. Opinf-LLM: Parametric PDE solving with LLMs via operator inference. InAI&PDE: ICLR 2026 Workshop on AI and Partial Differential Equations, 2026. URL https://openreview.net/forum? id=6FocHf1tCE

work page 2026
[47]

arXiv:1803.09010 (2018)

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for datasets, 2021. URL https: //arxiv.org/abs/1803.09010. 13 A Related Work A.1 Code Generation Benchmarks and Code Agents Code generation benchmarks have become standard for evaluating code agents, often using executio...

work page arXiv 2021
[48]

Candidate cases are sampled from these axes to cover distinct numerical regimes rather than to create redundant resolution sweeps

Design matrix.For each PDE family, we define a small set of variation axes that correspond to interpretable numerical challenges, such as coefficient contrast, source-term regularity, solution smoothness, boundary-condition type, geometry, time horizon, Reynolds or Péclet regime, non- linearity strength, and stiffness. Candidate cases are sampled from the...

work page
[49]

This record defines the information available to the agent at generation time

Problem specification.The contributor writes the agent-facing case_spec, including the PDE family, coefficients, forcing, domain, boundary conditions, initial conditions when applicable, tem- poral parameters, output field, and prescribed evaluation grid. This record defines the information available to the agent at generation time

work page
[50]

When possible, the case uses a manufactured solution, from which source terms, boundary data, and initial conditions are derived analytically

Reference construction path.The contributor specifies how the reference solution is obtained. When possible, the case uses a manufactured solution, from which source terms, boundary data, and initial conditions are derived analytically. When an analytic manufactured solution is unavailable, the contributor provides a high-fidelity reference_config that ge...

work page
[51]

The calibration run is not used as the scoring reference; instead, it is compared with the reference solution to compute ebase and to derive the case-specific accuracy threshold

Calibration baseline.The contributor provides a separate calibration_config. The calibration run is not used as the scoring reference; instead, it is compared with the reference solution to compute ebase and to derive the case-specific accuracy threshold. Its runtime is used to derive the runtime threshold. This separation prevents the benchmark from requ...

work page
[52]

Agent-facing numerical-decision fields.The case record may list high-level numerical decisions that an agent is expected to make, such as mesh resolution, element degree, time-step size, nonlinear iteration strategy, stabilization choice, or solver/preconditioner configuration. These fields describe theroleof each decision but do not include recommended n...

work page
[53]

It also verifies that cases without manufactured solutions provide a completereference_config

Static validation.The schema validator checks identifier uniqueness, required-field presence, con- sistency between pde.type and equation_family, domain and boundary-tag validity, output- grid validity, supported-library declarations, and symbolic parsability of expression fields. It also verifies that cases without manufactured solutions provide a comple...

work page
[54]

Trial execution and calibration.The build script executes the reference-generation and calibration procedures in the evaluation harness. It verifies that the reference solution is produced successfully, that output-grid sampling is valid, that the calibration error is finite, that runtime measurement succeeds, and that the derived thresholds (τacc, τtime)...

work page
[55]

"" Solve the PDE d e s c r i b e d by c a s e _ s p e c and write the re qu ire d a r t i f a c t s to the working d i r e c t o r y

Packaging.Validated cases are exported into the released JSONL files. Agent-visible fields are separated from evaluator-only metadata, reference artifacts are stored under evaluator-controlled paths, and materialized thresholds are recorded for reproducible scoring. The release package also includes schema files, evaluator scripts, container recipes, and ...

work page 2000