pith. sign in

arxiv: 2509.20374 · v3 · submitted 2025-09-19 · 💻 cs.CL · cs.AI

CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics

Pith reviewed 2026-05-18 15:01 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords large language modelscomputational fluid dynamicsbenchmarknumerical reasoningcode generationscientific computingevaluation framework
0
0 comments X

The pith

A new benchmark suite with three components evaluates how well large language models perform on graduate-level computational fluid dynamics knowledge, reasoning, and code implementation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models perform well on many language tasks yet their capacity to automate labor-intensive numerical experiments on complex physical systems like fluid flow has received little direct testing. The paper creates CFDLLMBench to fill this gap with three linked parts that together check knowledge recall, physical and numerical reasoning, and the ability to write working CFD code in realistic settings. Evaluation rests on measurable outcomes such as whether generated code runs, produces accurate solutions, and reaches numerical convergence. A reader would care because CFD remains the main tool for simulating flows in engineering and science, so reliable LLM assistance could cut the time spent on repetitive setup and debugging. The benchmark draws its tasks from actual CFD practice to make the scores meaningful for future automation efforts.

Core claim

The paper presents CFDLLMBench as a benchmark suite that contains CFDQuery for graduate-level CFD knowledge, CFDCodeBench for numerical and physical reasoning, and FoamBench for context-dependent CFD workflow implementation, all supported by a task taxonomy and an evaluation framework that tracks code executability, solution accuracy, and numerical convergence behavior.

What carries the argument

The CFDLLMBench suite, whose three complementary components together test distinct LLM competencies using tasks drawn from real-world CFD practices and a consistent set of reproducibility-focused metrics.

If this is right

  • LLM performance in automating CFD numerical experiments can now be measured systematically across knowledge, reasoning, and implementation stages.
  • Developers gain concrete scores on code executability, solution accuracy, and convergence that can guide model improvement.
  • A reusable foundation exists for building and checking LLM tools that assist with complex physical-system simulations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same three-competency structure could be adapted to create benchmarks for other simulation-heavy fields such as heat transfer or structural analysis.
  • Strong benchmark results might indicate which models are ready for iterative, feedback-driven CFD workflows that combine LLM suggestions with live solver output.
  • Extending the benchmark with time-dependent or multi-physics problems would test whether current LLM reasoning scales to more demanding CFD scenarios.

Load-bearing premise

The chosen tasks and metrics in CFDQuery, CFDCodeBench, and FoamBench faithfully represent the main difficulties and everyday practices of computational fluid dynamics without major biases or missing areas.

What would settle it

If models that score well on the benchmark produce code that fails to run, yields wrong answers, or diverges when applied to standard CFD test cases drawn from outside the benchmark, the claim that the suite measures genuine CFD capability would be weakened.

Figures

Figures reproduced from arXiv: 2509.20374 by Anurag Acharya, Ling Yue, Nithin Somasekharan, Patrick Emami, Pochinapeddi Sai Bhargav, Shaowu Pan, Weichao Li, Xingyu Xie, Yadi Cao.

Figure 1
Figure 1. Figure 1: Overview of CFDLLMBench: As the first ever LLM benchmark designed to holistically evaluate LLM’s capabilities for CFD, it consists of three different tasks and datasets. (1) CFDQuery: Graduate-level CFD QA. (2) CFDCodeBench: Coding questions about solving common linear/nonlinear PDEs encountered in CFD. (3) FoamBench: Configuring OpenFOAM case files for simulating realistic engineering scenarios such as in… view at source ↗
Figure 2
Figure 2. Figure 2: Success Rate comparison of different models across the three tasks. Success Rate is the fraction of cases in the benchmark that produce physically accurate results (higher is better). The detailed definition of Success Rate for each benchmark task can be found in section 3.3. The results for FoamBench are produced using the Foam-Agent framework with RAG, Reviewer, and Sonnet 3.5. There is a steep drop in p… view at source ↗
Figure 3
Figure 3. Figure 3: Average metric score and Success Rate for CFDCodeBench. The Success Rate for even the best performing models are around 14%, suggesting the challenging nature of the problems in this benchmark [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average metric score and Success Rate for different models on FoamBench using Foam-Agent framework with RAG and reviewer. The Success Rate for even the best performing model (Sonnet 3.5) is 34% in basic dataset and 25% in the advanced dataset [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of the geometry and mesh generated by the Foam-Agent [54] (RAG and Reviewer) with Sonnet 3.5 for the doubleSquare case against human expert. Spatial reasoning The CFD simulation workflows in FoamBench have preprocessing steps where a correct geometry and mesh file must be generated by the LLM. To handle real-world workflows, LLMs should be able to extrapolate to novel geometries. We highlight a … view at source ↗
Figure 6
Figure 6. Figure 6: OpenFoam reference case files defining the initial and boundary conditions. Finally the prompt that is input to the frameworks is shown below. Since they are tutorial problems, we do not describe the geometry in great lengths and assume the RAG should be able to pick it out based on the description. Prompt Do a laminar, compressible flow over a forward-facing step using the rhoCentralFoam solver. Boundary … view at source ↗
Figure 7
Figure 7. Figure 7: OpenFoam reference case files defining the physical propereties and turbulence models [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: OpenFoam reference case files defining the solver configurations, geometry and mesh. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Common Reasons for execution failure found in MetaOpenFoam and Foam-Agent with RAG and Reviewer and using Sonnet 3.5 as the prompt model. B.3 Token Usage The token usage statistics of the two frameworks in combination with the different models is shown in [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Solution comparison at the final time step for 1D Burgers equation [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: X direction velocity (u) comparison at the final time step for 2D Convection equation [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of velocity magnitude at the final timestep for 2D Cavity case. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparison of velocity magnitude at the final timestep for 2D forwardStep case. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have demonstrated strong performance across general NLP tasks, but their utility in automating numerical experiments of complex physical system -- a critical and labor-intensive component -- remains underexplored. As the major workhorse of computational science over the past decades, Computational Fluid Dynamics (CFD) offers a uniquely challenging testbed for evaluating the scientific capabilities of LLMs. We introduce CFDLLMBench, a benchmark suite comprising three complementary components -- CFDQuery, CFDCodeBench, and FoamBench -- designed to holistically evaluate LLM performance across three key competencies: graduate-level CFD knowledge, numerical and physical reasoning of CFD, and context-dependent implementation of CFD workflows. Grounded in real-world CFD practices, our benchmark combines a detailed task taxonomy with a rigorous evaluation framework to deliver reproducible results and quantify LLM performance across code executability, solution accuracy, and numerical convergence behavior. CFDLLMBench establishes a solid foundation for the development and evaluation of LLM-driven automation of numerical experiments for complex physical systems. Code and data are available at https://github.com/NREL-Theseus/cfdllmbench/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces CFDLLMBench, a benchmark suite comprising three complementary components—CFDQuery, CFDCodeBench, and FoamBench—designed to holistically evaluate LLM performance across graduate-level CFD knowledge, numerical and physical reasoning of CFD, and context-dependent implementation of CFD workflows. Grounded in real-world CFD practices, the benchmark combines a detailed task taxonomy with a rigorous evaluation framework to deliver reproducible results quantifying LLM performance on code executability, solution accuracy, and numerical convergence behavior.

Significance. If the tasks and metrics accurately capture core CFD challenges, this benchmark could provide a valuable standardized foundation for assessing and advancing LLM capabilities in automating numerical experiments for complex physical systems. The open release of code and data at the provided GitHub repository is a clear strength that supports reproducibility and community use.

major comments (1)
  1. FoamBench component: the OpenFOAM-centric design risks measuring familiarity with one specific package's syntax and case-file structure rather than transferable context-dependent implementation skills across CFD methods (e.g., finite-element or spectral approaches). This directly affects the central claim that the three components together provide a holistic evaluation of 'context-dependent implementation of CFD workflows' grounded in general real-world practices.
minor comments (1)
  1. Abstract: the scale of the benchmark (number of tasks or queries per component) is not quantified, which would help readers assess coverage and effort required for evaluation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript introducing CFDLLMBench. We address the major comment below and have revised the manuscript to clarify the design and scope of the benchmark suite.

read point-by-point responses
  1. Referee: FoamBench component: the OpenFOAM-centric design risks measuring familiarity with one specific package's syntax and case-file structure rather than transferable context-dependent implementation skills across CFD methods (e.g., finite-element or spectral approaches). This directly affects the central claim that the three components together provide a holistic evaluation of 'context-dependent implementation of CFD workflows' grounded in general real-world practices.

    Authors: We appreciate the referee's point regarding the specificity of FoamBench. OpenFOAM was selected because it is a widely adopted, open-source finite-volume CFD platform used extensively in both academic research and industrial applications for simulating complex flows. The tasks in FoamBench focus on practical skills such as case configuration, boundary condition specification, solver parameter tuning, and ensuring numerical convergence, which reflect core elements of real-world CFD workflows. We acknowledge, however, that this design emphasizes implementation within one particular software ecosystem and does not directly assess transferable skills for alternative discretizations such as finite-element or spectral methods. To address this, we have revised the manuscript to explicitly state the rationale for the OpenFOAM focus, to temper the claim of full holism across all CFD paradigms, and to add a limitations paragraph noting that extensions to other frameworks would broaden coverage of context-dependent implementation skills. revision: yes

Circularity Check

0 steps flagged

Benchmark construction with no derivation chain or circular reductions

full rationale

The paper introduces CFDLLMBench as a new benchmark suite with three components (CFDQuery, CFDCodeBench, FoamBench) to evaluate LLM competencies in CFD knowledge, reasoning, and implementation. No mathematical derivations, equations, fitted parameters, or predictions are present that could reduce to inputs by construction. The design is presented as grounded in real-world practices with a task taxonomy and evaluation framework; claims about holistic evaluation are supported directly by the described components rather than self-citations, ansatzes, or uniqueness theorems. This is a standard benchmark paper whose central contribution is self-contained construction, warranting no circularity flags.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that the chosen tasks represent authentic CFD practice and that the three metrics sufficiently quantify LLM utility for numerical experiments.

axioms (1)
  • domain assumption The tasks in CFDQuery, CFDCodeBench, and FoamBench accurately reflect real-world CFD practices and challenges
    Explicitly stated in the abstract as 'Grounded in real-world CFD practices'

pith-pipeline@v0.9.0 · 5763 in / 1227 out tokens · 44932 ms · 2026-05-18T15:01:27.758013+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

    cs.AI 2026-05 unverdicted novelty 7.0

    SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechan...

  2. SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

    cs.AI 2026-04 unverdicted novelty 7.0

    LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.

  3. ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents

    cs.CR 2026-05 conditional novelty 6.0

    Clarification-seeking in LLM agents amplifies prompt injection attack success from ~2% to over 30% across ten frontier models in a new 728-scenario benchmark.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 3 Pith papers · 12 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Claude 3.5 sonnet model card addendum

    Anthropic. Claude 3.5 sonnet model card addendum. https://www-cdn.anthropic.com/ fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf,

  3. [3]

    Accessed: 2025-05-03

  4. [4]

    Program Synthesis with Large Language Models

    Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  5. [5]

    Barba, L. A. and Forsyth, G. F. Cfd python: the 12 steps to navier-stokes equations.Journal of Open Source Education, 2(16):21, 2018

  6. [6]

    & Cohan, A

    Beltagy, I., Lo, K., and Cohan, A. Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676, 2019

  7. [7]

    Blocken, B. Computational fluid dynamics for urban physics: Importance, scales, possibilities, limitations and ten tips and tricks towards accurate and reliable simulations.Building and Environment, 91:219–245, 2015

  8. [8]

    Blocken, B., Stathopoulos, T., Carmeliet, J., and Hensen, J. L. Application of computational fluid dynamics in building performance simulation for the outdoor environment: an overview. Journal of building performance simulation, 4(2):157–184, 2011

  9. [9]

    Super: Evaluating agents on setting up and executing tasks from research repositories

    Bogin, B., Yang, K., Gupta, S., Richardson, K., Bransom, E., Clark, P., Sabharwal, A., and Khot, T. Super: Evaluating agents on setting up and executing tasks from research repositories. arXiv preprint arXiv:2409.07440, 2024

  10. [10]

    A., MacKnight, R., Kline, B., and Gomes, G

    Boiko, D. A., MacKnight, R., Kline, B., and Gomes, G. Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

  11. [11]

    ChemCrow: Augmenting large-language models with chemistry tools

    Bran, A. M., Cox, S., Schilter, O., Baldassari, C., White, A. D., and Schwaller, P. Chemcrow: Augmenting large-language models with chemistry tools.arXiv preprint arXiv:2304.05376, 2023

  12. [12]

    J., Vasil, G

    Burns, K. J., Vasil, G. M., Oishi, J. S., Lecoanet, D., and Brown, B. P. Dedalus: A flexible framework for numerical simulations with spectral methods.Physical Review Research, 2(2): 023068, 2020

  13. [13]

    Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., C...

  14. [14]

    Metaopenfoam: an llm-based multi-agent framework for cfd

    Chen, Y ., Zhu, X., Zhou, H., and Ren, Z. Metaopenfoam: an llm-based multi-agent framework for cfd.arXiv preprint arXiv:2407.21320, 2024. 10

  15. [15]

    Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery

    Chen, Z., Chen, S., Ning, Y ., Zhang, Q., Wang, B., Yu, B., Li, Y ., Liao, Z., Wei, C., Lu, Z., et al. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery.arXiv preprint arXiv:2410.05080, 2024

  16. [16]

    LLMPhy: Parameter-Identifiable Physical Reasoning Combining Large Language Models and Physics Engines

    Cherian, A., Corcodel, R., Jain, S., and Romeres, D. Llmphy: Complex physical reasoning using large language models and world models.arXiv preprint arXiv:2411.08027, 2024

  17. [17]

    Curie: Eval- uating llms on multitask scientific long context understanding and reasoning.arXiv preprint arXiv:2503.13517, 2025

    Cui, H., Shamsi, Z., Cheon, G., Ma, X., Li, S., Tikhanovskaya, M., Norgaard, P., Mudur, N., Plomecka, M., Raccuglia, P., et al. Curie: Evaluating llms on multitask scientific long context understanding and reasoning.arXiv preprint arXiv:2503.13517, 2025

  18. [18]

    Start building with gemini 2.5 flash

    DeepMind, G. Start building with gemini 2.5 flash. https://developers.googleblog. com/en/start-building-with-gemini-25-flash/?utm_source=deepmind.google& utm_medium=referral&utm_campaign=gdm&utm_content=, 2025. Accessed: 2025-05-03

  19. [19]

    FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

    Glazer, E., Erdil, E., Besiroglu, T., Chicharro, D., Chen, E., Gunning, A., Olsson, C. F., Denain, J.-S., Ho, A., Santos, E. d. O., et al. Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai.arXiv preprint arXiv:2411.04872, 2024

  20. [20]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  21. [21]

    Jacobs, P. F. and Pollice, R. Developing large language models for quantum chemistry simulation input generation.Digital Discovery, 2025

  22. [22]

    and Farimani, A

    Jadhav, Y . and Farimani, A. B. Large language model agent as a mechanical designer.arXiv preprint arXiv:2404.17525, 2024

  23. [23]

    Openfoam: A c++ library for complex physics simulations

    Jasak, H., Jemcov, A., Tukovic, Z., et al. Openfoam: A c++ library for complex physics simulations. InInternational workshop on coupled methods in numerical dynamics, volume 1000, pp. 1–20. IUC Dubrovnik Croatia, 2007

  24. [24]

    Eplus-llm: A large language model-based computing platform for automated building energy modeling.Applied Energy, 367:123431, 2024

    Jiang, G., Ma, Z., Zhang, L., and Chen, J. Eplus-llm: A large language model-based computing platform for automated building energy modeling.Applied Energy, 367:123431, 2024

  25. [25]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

  26. [26]

    Kumar, V ., Gleyzer, L., Kahana, A., Shukla, K., and Karniadakis, G. E. Mycrunchgpt: A llm assisted framework for scientific machine learning.Journal of Machine Learning for Modeling and Computing, 4(4), 2023

  27. [27]

    Engr 491: Computational fluid dynamics

    Lab, O. Engr 491: Computational fluid dynamics. https://github.com/okcfdlab/engr491,

  28. [28]

    Accessed: 2025-05-16

  29. [29]

    Ds-1000: A natural and reliable benchmark for data science code generation

    Lai, Y ., Li, C., Wang, Y ., Zhang, T., Zhong, R., Zettlemoyer, L., Yih, W.-t., Fried, D., Wang, S., and Yu, T. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pp. 18319–18345. PMLR, 2023

  30. [30]

    H., Michelis, M

    Lee, J. H., Michelis, M. Y ., Katzschmann, R., and Manchester, Z. Aquarium: A fully differen- tiable fluid-structure interaction solver for robotics applications. In2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 11272–11279. IEEE, 2023

  31. [31]

    Qasa: advanced question answering on scientific articles

    Lee, Y ., Lee, K., Park, S., Hwang, D., Kim, J., Lee, H.-i., and Lee, M. Qasa: advanced question answering on scientific articles. InInternational Conference on Machine Learning, pp. 19036–19052. PMLR, 2023

  32. [32]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V ., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

  33. [33]

    Fea- bench: A benchmark for evaluating repository-level code generation for feature implementation

    Li, W., Zhang, X., Guo, Z., Mao, S., Luo, W., Peng, G., Huang, Y ., Wang, H., and Li, S. Fea- bench: A benchmark for evaluating repository-level code generation for feature implementation. arXiv preprint arXiv:2503.06680, 2025. 11

  34. [34]

    Rouge: A package for automatic evaluation of summaries

    Lin, C.-Y . Rouge: A package for automatic evaluation of summaries. pp. 10, 01 2004

  35. [35]

    Biogpt: generative pre-trained transformer for biomedical text generation and mining.Briefings in bioinformatics, 23(6):bbac409, 2022

    Luo, R., Sun, L., Xia, Y ., Qin, T., Zhang, S., Poon, H., and Liu, T.-Y . Biogpt: generative pre-trained transformer for biomedical text generation and mining.Briefings in bioinformatics, 23(6):bbac409, 2022

  36. [36]

    10 Preprint

    Majumder, B. P., Surana, H., Agarwal, D., Mishra, B. D., Meena, A., Prakhar, A., V ora, T., Khot, T., Sabharwal, A., and Clark, P. Discoverybench: Towards data-driven discovery with large language models.arXiv preprint arXiv:2407.01725, 2024

  37. [37]

    Laurent, Alex Andonian, Benjamin Tenmann, Siddharth Narayanan, Geemi P

    Mitchener, L., Laurent, J. M., Tenmann, B., Narayanan, S., Wellawatte, G. P., White, A., Sani, L., and Rodriques, S. G. Bixbench: a comprehensive benchmark for llm-based agents in computational biology.arXiv preprint arXiv:2503.00096, 2025

  38. [38]

    Braza, Ryan-Rhys Griffiths, Manu Ponnapati, Albert Bou, Jon Laurent, Ori Kabeli, Geemi Wellawatte, Sam Cox, Samuel G

    Narayanan, S., Braza, J. D., Griffiths, R.-R., Ponnapati, M., Bou, A., Laurent, J., Kabeli, O., Wellawatte, G., Cox, S., Rodriques, S. G., et al. Aviary: training language agents on challenging scientific tasks.arXiv preprint arXiv:2412.21154, 2024

  39. [39]

    Hello gpt-4o

    OpenAI. Hello gpt-4o. https://openai.com/index/hello-gpt-4o/, 2024. Accessed: 2025-05-03

  40. [40]

    Openai o3-mini

    OpenAI. Openai o3-mini. https://openai.com/index/openai-o3-mini/, 2024. Accessed: 2025-05-03

  41. [41]

    Openfoamgpt: A retrieval-augmented large language model (llm) agent for openfoam-based computational fluid dynamics.Physics of Fluids, 37(3), 2025

    Pandey, S., Xu, R., Wang, W., and Chu, X. Openfoamgpt: A retrieval-augmented large language model (llm) agent for openfoam-based computational fluid dynamics.Physics of Fluids, 37(3), 2025

  42. [42]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Qin, Y ., Liang, S., Ye, Y ., Zhu, K., Yan, L., Lu, Y ., Lin, Y ., Cong, X., Tang, X., Qian, B., et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789, 2023

  43. [43]

    L., Stickland, A

    Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y ., Dirani, J., Michael, J., and Bowman, S. R. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024

  44. [44]

    E., Turner, R., and Flay, R

    Shah, M., Norris, S. E., Turner, R., and Flay, R. G. A review of computational fluid dynamics application to investigate tropical cyclone wind speeds.Natural Hazards, 117(1):897–915, 2023

  45. [45]

    Neural lander: Stable drone landing control using learned dynamics

    Shi, G., Shi, X., O’Connell, M., Yu, R., Azizzadenesheli, K., Anandkumar, A., Yue, Y ., and Chung, S.-J. Neural lander: Stable drone landing control using learned dynamics. In2019 international conference on robotics and automation (icra), pp. 9784–9790. IEEE, 2019

  46. [46]

    Core-bench: Fos- tering the credibility of published research through a computational reproducibility agent benchmark.arXiv preprint arXiv:2409.11363, 2024

    Siegel, Z. S., Kapoor, S., Nagdir, N., Stroebl, B., and Narayanan, A. Core-bench: Fostering the credibility of published research through a computational reproducibility agent benchmark. arXiv preprint arXiv:2409.11363, 2024

  47. [47]

    R., Cole-Lewis, H., et al

    Singhal, K., Tu, T., Gottweis, J., Sayres, R., Wulczyn, E., Amin, M., Hou, L., Clark, K., Pfohl, S. R., Cole-Lewis, H., et al. Toward expert-level medical question answering with large language models.Nature Medicine, pp. 1–8, 2025

  48. [48]

    P., Khodadoust, A., Alonso, J., Darmofal, D., Gropp, W., Lurie, E., and Mavriplis, D

    Slotnick, J. P., Khodadoust, A., Alonso, J., Darmofal, D., Gropp, W., Lurie, E., and Mavriplis, D. J. Cfd vision 2030 study: a path to revolutionary computational aerosciences. Technical report, 2014

  49. [49]

    PaperBench: Evaluating AI's Ability to Replicate AI Research

    Starace, G., Jaffe, O., Sherburn, D., Aung, J., Chan, J. S., Maksin, L., Dias, R., Mays, E., Kinsella, B., Thompson, W., et al. Paperbench: Evaluating ai’s ability to replicate ai research. arXiv preprint arXiv:2504.01848, 2025

  50. [50]

    Galactica: A Large Language Model for Science

    Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn, A., Saravia, E., Poulton, A., Kerkez, V ., and Stojnic, R. Galactica: A large language model for science.arXiv preprint arXiv:2211.09085, 2022. 12

  51. [51]

    Team, G. Gemma. 2024. doi: 10.34740/KAGGLE/M/3301. URL https://www.kaggle.com/ m/3301

  52. [52]

    Scicode: A research coding benchmark curated by scientists.Advances in Neural Information Processing Systems, 37:30624–30650, 2024

    Tian, M., Gao, L., Zhang, S., Chen, X., Fan, C., Guo, X., Haas, R., Ji, P., Krongchon, K., Li, Y ., et al. Scicode: A research coding benchmark curated by scientists.Advances in Neural Information Processing Systems, 37:30624–30650, 2024

  53. [54]

    SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

    Wang, X., Hu, Z., Lu, P., Zhu, Y ., Zhang, J., Subramaniam, S., Loomba, A. R., Zhang, S., Sun, Y ., and Wang, W. Scibench: Evaluating college-level scientific problem-solving abilities of large language models.arXiv preprint arXiv:2307.10635, 2023

  54. [55]

    G., Tabor, G., Jasak, H., and Fureby, C

    Weller, H. G., Tabor, G., Jasak, H., and Fureby, C. A tensorial approach to computational continuum mechanics using object-oriented techniques.Computers in physics, 12(6):620–631, 1998

  55. [56]

    Foam-agent: Towards automated intelligent cfd workflows

    Yue, L., Somasekharan, N., Cao, Y ., and Pan, S. Foam-agent: Towards automated intelligent cfd workflows.arXiv preprint arXiv:2505.04997, 2025

  56. [57]

    Physreason: A comprehensive benchmark towards physics-based reasoning

    Zhang, X., Dong, Y ., Wu, Y ., Huang, J., Jia, C., Fernando, B., Shou, M. Z., Zhang, L., and Liu, J. Physreason: A comprehensive benchmark towards physics-based reasoning.arXiv preprint arXiv:2502.12054, 2025. 13 A Dataset Curation A.1 CFDQuery This Question and Answer dataset spans a broad spectrum of PDEs, numerical methods and error- analysis topics. I...

  57. [58]

    ∂u ∂t +a ∂u ∂x = a∆x2 6 ∂3u ∂x3 +O(∆x 3)

  58. [59]

    ∂u ∂t +a ∂u ∂x = a∆x2 2 ∂2u ∂x2 +O(∆x 3)

  59. [60]

    ∂u ∂t +a ∂u ∂x =− a∆x2 6 ∂3u ∂x3 + a∆t2 6 ∂3u ∂t3 +O(∆x 3)

  60. [61]

    Models such as o3-mini, Haiku 3.5 and Gemini 2.5 Flash is able to closely match the ground truth solution for the 1D Burgers equation

    ∂u ∂t +a ∂u ∂x = a∆x2 6 ∂3u ∂x3 − a3∆t2 6 ∂3u ∂x3 +O(∆x 3) Correct Answer:Option 4 Model Responses: •Sonnet 3.5:Option 4✓ •o3-mini:Option 3✗ •Gemini 2.5 Flash:Option 4✓ •Haiku 3.5:Option 1✗ •GPT-4o:Option 3✗ •Gemma-2-9B-IT:Option 1✗ C.2 CFDCodeBench The visual comparison of the model produced results and the ground truth solution at the final timestep for...