CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics

Anurag Acharya; Ling Yue; Nithin Somasekharan; Patrick Emami; Pochinapeddi Sai Bhargav; Shaowu Pan; Weichao Li; Xingyu Xie; Yadi Cao

arxiv: 2509.20374 · v3 · submitted 2025-09-19 · 💻 cs.CL · cs.AI

CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics

Nithin Somasekharan , Ling Yue , Yadi Cao , Weichao Li , Patrick Emami , Pochinapeddi Sai Bhargav , Anurag Acharya , Xingyu Xie

show 1 more author

Shaowu Pan

This is my paper

Pith reviewed 2026-05-18 15:01 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords large language modelscomputational fluid dynamicsbenchmarknumerical reasoningcode generationscientific computingevaluation framework

0 comments

The pith

A new benchmark suite with three components evaluates how well large language models perform on graduate-level computational fluid dynamics knowledge, reasoning, and code implementation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models perform well on many language tasks yet their capacity to automate labor-intensive numerical experiments on complex physical systems like fluid flow has received little direct testing. The paper creates CFDLLMBench to fill this gap with three linked parts that together check knowledge recall, physical and numerical reasoning, and the ability to write working CFD code in realistic settings. Evaluation rests on measurable outcomes such as whether generated code runs, produces accurate solutions, and reaches numerical convergence. A reader would care because CFD remains the main tool for simulating flows in engineering and science, so reliable LLM assistance could cut the time spent on repetitive setup and debugging. The benchmark draws its tasks from actual CFD practice to make the scores meaningful for future automation efforts.

Core claim

The paper presents CFDLLMBench as a benchmark suite that contains CFDQuery for graduate-level CFD knowledge, CFDCodeBench for numerical and physical reasoning, and FoamBench for context-dependent CFD workflow implementation, all supported by a task taxonomy and an evaluation framework that tracks code executability, solution accuracy, and numerical convergence behavior.

What carries the argument

The CFDLLMBench suite, whose three complementary components together test distinct LLM competencies using tasks drawn from real-world CFD practices and a consistent set of reproducibility-focused metrics.

If this is right

LLM performance in automating CFD numerical experiments can now be measured systematically across knowledge, reasoning, and implementation stages.
Developers gain concrete scores on code executability, solution accuracy, and convergence that can guide model improvement.
A reusable foundation exists for building and checking LLM tools that assist with complex physical-system simulations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same three-competency structure could be adapted to create benchmarks for other simulation-heavy fields such as heat transfer or structural analysis.
Strong benchmark results might indicate which models are ready for iterative, feedback-driven CFD workflows that combine LLM suggestions with live solver output.
Extending the benchmark with time-dependent or multi-physics problems would test whether current LLM reasoning scales to more demanding CFD scenarios.

Load-bearing premise

The chosen tasks and metrics in CFDQuery, CFDCodeBench, and FoamBench faithfully represent the main difficulties and everyday practices of computational fluid dynamics without major biases or missing areas.

What would settle it

If models that score well on the benchmark produce code that fails to run, yields wrong answers, or diverges when applied to standard CFD test cases drawn from outside the benchmark, the claim that the suite measures genuine CFD capability would be weakened.

Figures

Figures reproduced from arXiv: 2509.20374 by Anurag Acharya, Ling Yue, Nithin Somasekharan, Patrick Emami, Pochinapeddi Sai Bhargav, Shaowu Pan, Weichao Li, Xingyu Xie, Yadi Cao.

**Figure 1.** Figure 1: Overview of CFDLLMBench: As the first ever LLM benchmark designed to holistically evaluate LLM’s capabilities for CFD, it consists of three different tasks and datasets. (1) CFDQuery: Graduate-level CFD QA. (2) CFDCodeBench: Coding questions about solving common linear/nonlinear PDEs encountered in CFD. (3) FoamBench: Configuring OpenFOAM case files for simulating realistic engineering scenarios such as in… view at source ↗

**Figure 2.** Figure 2: Success Rate comparison of different models across the three tasks. Success Rate is the fraction of cases in the benchmark that produce physically accurate results (higher is better). The detailed definition of Success Rate for each benchmark task can be found in section 3.3. The results for FoamBench are produced using the Foam-Agent framework with RAG, Reviewer, and Sonnet 3.5. There is a steep drop in p… view at source ↗

**Figure 3.** Figure 3: Average metric score and Success Rate for CFDCodeBench. The Success Rate for even the best performing models are around 14%, suggesting the challenging nature of the problems in this benchmark [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Average metric score and Success Rate for different models on FoamBench using Foam-Agent framework with RAG and reviewer. The Success Rate for even the best performing model (Sonnet 3.5) is 34% in basic dataset and 25% in the advanced dataset [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of the geometry and mesh generated by the Foam-Agent [54] (RAG and Reviewer) with Sonnet 3.5 for the doubleSquare case against human expert. Spatial reasoning The CFD simulation workflows in FoamBench have preprocessing steps where a correct geometry and mesh file must be generated by the LLM. To handle real-world workflows, LLMs should be able to extrapolate to novel geometries. We highlight a … view at source ↗

**Figure 6.** Figure 6: OpenFoam reference case files defining the initial and boundary conditions. Finally the prompt that is input to the frameworks is shown below. Since they are tutorial problems, we do not describe the geometry in great lengths and assume the RAG should be able to pick it out based on the description. Prompt Do a laminar, compressible flow over a forward-facing step using the rhoCentralFoam solver. Boundary … view at source ↗

**Figure 7.** Figure 7: OpenFoam reference case files defining the physical propereties and turbulence models [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: OpenFoam reference case files defining the solver configurations, geometry and mesh. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Common Reasons for execution failure found in MetaOpenFoam and Foam-Agent with RAG and Reviewer and using Sonnet 3.5 as the prompt model. B.3 Token Usage The token usage statistics of the two frameworks in combination with the different models is shown in [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Solution comparison at the final time step for 1D Burgers equation [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗

**Figure 11.** Figure 11: X direction velocity (u) comparison at the final time step for 2D Convection equation [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

**Figure 12.** Figure 12: Comparison of velocity magnitude at the final timestep for 2D Cavity case. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Comparison of velocity magnitude at the final timestep for 2D forwardStep case. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have demonstrated strong performance across general NLP tasks, but their utility in automating numerical experiments of complex physical system -- a critical and labor-intensive component -- remains underexplored. As the major workhorse of computational science over the past decades, Computational Fluid Dynamics (CFD) offers a uniquely challenging testbed for evaluating the scientific capabilities of LLMs. We introduce CFDLLMBench, a benchmark suite comprising three complementary components -- CFDQuery, CFDCodeBench, and FoamBench -- designed to holistically evaluate LLM performance across three key competencies: graduate-level CFD knowledge, numerical and physical reasoning of CFD, and context-dependent implementation of CFD workflows. Grounded in real-world CFD practices, our benchmark combines a detailed task taxonomy with a rigorous evaluation framework to deliver reproducible results and quantify LLM performance across code executability, solution accuracy, and numerical convergence behavior. CFDLLMBench establishes a solid foundation for the development and evaluation of LLM-driven automation of numerical experiments for complex physical systems. Code and data are available at https://github.com/NREL-Theseus/cfdllmbench/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper builds a three-part benchmark for LLMs on CFD tasks, with the implementation side tied closely to OpenFOAM, which narrows its reach.

read the letter

The main thing here is a new benchmark suite called CFDLLMBench that splits evaluation into CFDQuery for graduate-level knowledge, CFDCodeBench for numerical and physical reasoning, and FoamBench for putting workflows into practice. The authors release code and data, which makes the work immediately usable by others who want to test models on actual simulation steps rather than abstract questions. That release and the focus on metrics like executability, solution accuracy, and convergence give the effort a practical edge over purely theoretical proposals. The task taxonomy they outline also tries to map onto real CFD steps, which is a reasonable way to move beyond general NLP benchmarks. The stress-test concern about OpenFOAM-centric design holds up on the description: if FoamBench mainly involves editing case files, boundary conditions, and solver settings for that one package, then high scores may reflect pattern matching to OpenFOAM syntax more than transferable reasoning across finite-volume, finite-element, or other methods. This creates a coverage limit for the claim of holistic evaluation of context-dependent implementation. The paper does not appear to include broad comparisons to other CFD codes, so the generality of the implementation component stays narrower than the abstract suggests. No load-bearing math or fitted parameters are involved, so there is little circularity risk. The work shows coherent structure and honest engagement with the goal of testing scientific capabilities in a domain-specific setting. It is aimed at groups working on AI-assisted engineering simulations or researchers who need concrete testbeds for domain LLMs. A reader looking for ready-to-use evaluation resources in scientific computing would find the released materials and framework worth examining. It deserves peer review because the benchmark construction itself is a concrete, shareable output that can be tested and extended, even if revisions should address the scope of the implementation tasks.

Referee Report

1 major / 1 minor

Summary. The paper introduces CFDLLMBench, a benchmark suite comprising three complementary components—CFDQuery, CFDCodeBench, and FoamBench—designed to holistically evaluate LLM performance across graduate-level CFD knowledge, numerical and physical reasoning of CFD, and context-dependent implementation of CFD workflows. Grounded in real-world CFD practices, the benchmark combines a detailed task taxonomy with a rigorous evaluation framework to deliver reproducible results quantifying LLM performance on code executability, solution accuracy, and numerical convergence behavior.

Significance. If the tasks and metrics accurately capture core CFD challenges, this benchmark could provide a valuable standardized foundation for assessing and advancing LLM capabilities in automating numerical experiments for complex physical systems. The open release of code and data at the provided GitHub repository is a clear strength that supports reproducibility and community use.

major comments (1)

FoamBench component: the OpenFOAM-centric design risks measuring familiarity with one specific package's syntax and case-file structure rather than transferable context-dependent implementation skills across CFD methods (e.g., finite-element or spectral approaches). This directly affects the central claim that the three components together provide a holistic evaluation of 'context-dependent implementation of CFD workflows' grounded in general real-world practices.

minor comments (1)

Abstract: the scale of the benchmark (number of tasks or queries per component) is not quantified, which would help readers assess coverage and effort required for evaluation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript introducing CFDLLMBench. We address the major comment below and have revised the manuscript to clarify the design and scope of the benchmark suite.

read point-by-point responses

Referee: FoamBench component: the OpenFOAM-centric design risks measuring familiarity with one specific package's syntax and case-file structure rather than transferable context-dependent implementation skills across CFD methods (e.g., finite-element or spectral approaches). This directly affects the central claim that the three components together provide a holistic evaluation of 'context-dependent implementation of CFD workflows' grounded in general real-world practices.

Authors: We appreciate the referee's point regarding the specificity of FoamBench. OpenFOAM was selected because it is a widely adopted, open-source finite-volume CFD platform used extensively in both academic research and industrial applications for simulating complex flows. The tasks in FoamBench focus on practical skills such as case configuration, boundary condition specification, solver parameter tuning, and ensuring numerical convergence, which reflect core elements of real-world CFD workflows. We acknowledge, however, that this design emphasizes implementation within one particular software ecosystem and does not directly assess transferable skills for alternative discretizations such as finite-element or spectral methods. To address this, we have revised the manuscript to explicitly state the rationale for the OpenFOAM focus, to temper the claim of full holism across all CFD paradigms, and to add a limitations paragraph noting that extensions to other frameworks would broaden coverage of context-dependent implementation skills. revision: yes

Circularity Check

0 steps flagged

Benchmark construction with no derivation chain or circular reductions

full rationale

The paper introduces CFDLLMBench as a new benchmark suite with three components (CFDQuery, CFDCodeBench, FoamBench) to evaluate LLM competencies in CFD knowledge, reasoning, and implementation. No mathematical derivations, equations, fitted parameters, or predictions are present that could reduce to inputs by construction. The design is presented as grounded in real-world practices with a task taxonomy and evaluation framework; claims about holistic evaluation are supported directly by the described components rather than self-citations, ansatzes, or uniqueness theorems. This is a standard benchmark paper whose central contribution is self-contained construction, warranting no circularity flags.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that the chosen tasks represent authentic CFD practice and that the three metrics sufficiently quantify LLM utility for numerical experiments.

axioms (1)

domain assumption The tasks in CFDQuery, CFDCodeBench, and FoamBench accurately reflect real-world CFD practices and challenges
Explicitly stated in the abstract as 'Grounded in real-world CFD practices'

pith-pipeline@v0.9.0 · 5763 in / 1227 out tokens · 44932 ms · 2026-05-18T15:01:27.758013+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FoamBench: Configuring OpenFOAM case files for simulating realistic engineering scenarios such as incompressible flow over obstacles, supersonic flow with shockwaves, Rayleigh-Benard convection

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science
cs.AI 2026-05 unverdicted novelty 7.0

SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechan...
SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?
cs.AI 2026-04 unverdicted novelty 7.0

LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.
ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents
cs.CR 2026-05 conditional novelty 6.0

Clarification-seeking in LLM agents amplifies prompt injection attack success from ~2% to over 30% across ten frontier models in a new 728-scenario benchmark.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 3 Pith papers · 12 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Claude 3.5 sonnet model card addendum

Anthropic. Claude 3.5 sonnet model card addendum. https://www-cdn.anthropic.com/ fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf,

work page
[3]

Accessed: 2025-05-03

work page 2025
[4]

Program Synthesis with Large Language Models

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Barba, L. A. and Forsyth, G. F. Cfd python: the 12 steps to navier-stokes equations.Journal of Open Source Education, 2(16):21, 2018

work page 2018
[6]

& Cohan, A

Beltagy, I., Lo, K., and Cohan, A. Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676, 2019

work page arXiv 1903
[7]

Blocken, B. Computational fluid dynamics for urban physics: Importance, scales, possibilities, limitations and ten tips and tricks towards accurate and reliable simulations.Building and Environment, 91:219–245, 2015

work page 2015
[8]

Blocken, B., Stathopoulos, T., Carmeliet, J., and Hensen, J. L. Application of computational fluid dynamics in building performance simulation for the outdoor environment: an overview. Journal of building performance simulation, 4(2):157–184, 2011

work page 2011
[9]

Super: Evaluating agents on setting up and executing tasks from research repositories

Bogin, B., Yang, K., Gupta, S., Richardson, K., Bransom, E., Clark, P., Sabharwal, A., and Khot, T. Super: Evaluating agents on setting up and executing tasks from research repositories. arXiv preprint arXiv:2409.07440, 2024

work page arXiv 2024
[10]

A., MacKnight, R., Kline, B., and Gomes, G

Boiko, D. A., MacKnight, R., Kline, B., and Gomes, G. Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

work page 2023
[11]

ChemCrow: Augmenting large-language models with chemistry tools

Bran, A. M., Cox, S., Schilter, O., Baldassari, C., White, A. D., and Schwaller, P. Chemcrow: Augmenting large-language models with chemistry tools.arXiv preprint arXiv:2304.05376, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

J., Vasil, G

Burns, K. J., Vasil, G. M., Oishi, J. S., Lecoanet, D., and Brown, B. P. Dedalus: A flexible framework for numerical simulations with spectral methods.Physical Review Research, 2(2): 023068, 2020

work page 2020
[13]

Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., C...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Metaopenfoam: an llm-based multi-agent framework for cfd

Chen, Y ., Zhu, X., Zhou, H., and Ren, Z. Metaopenfoam: an llm-based multi-agent framework for cfd.arXiv preprint arXiv:2407.21320, 2024. 10

work page arXiv 2024
[15]

Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery

Chen, Z., Chen, S., Ning, Y ., Zhang, Q., Wang, B., Yu, B., Li, Y ., Liao, Z., Wei, C., Lu, Z., et al. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery.arXiv preprint arXiv:2410.05080, 2024

work page arXiv 2024
[16]

LLMPhy: Parameter-Identifiable Physical Reasoning Combining Large Language Models and Physics Engines

Cherian, A., Corcodel, R., Jain, S., and Romeres, D. Llmphy: Complex physical reasoning using large language models and world models.arXiv preprint arXiv:2411.08027, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Curie: Eval- uating llms on multitask scientific long context understanding and reasoning.arXiv preprint arXiv:2503.13517, 2025

Cui, H., Shamsi, Z., Cheon, G., Ma, X., Li, S., Tikhanovskaya, M., Norgaard, P., Mudur, N., Plomecka, M., Raccuglia, P., et al. Curie: Evaluating llms on multitask scientific long context understanding and reasoning.arXiv preprint arXiv:2503.13517, 2025

work page arXiv 2025
[18]

Start building with gemini 2.5 flash

DeepMind, G. Start building with gemini 2.5 flash. https://developers.googleblog. com/en/start-building-with-gemini-25-flash/?utm_source=deepmind.google& utm_medium=referral&utm_campaign=gdm&utm_content=, 2025. Accessed: 2025-05-03

work page 2025
[19]

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

Glazer, E., Erdil, E., Besiroglu, T., Chicharro, D., Chen, E., Gunning, A., Olsson, C. F., Denain, J.-S., Ho, A., Santos, E. d. O., et al. Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai.arXiv preprint arXiv:2411.04872, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Jacobs, P. F. and Pollice, R. Developing large language models for quantum chemistry simulation input generation.Digital Discovery, 2025

work page 2025
[22]

and Farimani, A

Jadhav, Y . and Farimani, A. B. Large language model agent as a mechanical designer.arXiv preprint arXiv:2404.17525, 2024

work page arXiv 2024
[23]

Openfoam: A c++ library for complex physics simulations

Jasak, H., Jemcov, A., Tukovic, Z., et al. Openfoam: A c++ library for complex physics simulations. InInternational workshop on coupled methods in numerical dynamics, volume 1000, pp. 1–20. IUC Dubrovnik Croatia, 2007

work page 2007
[24]

Eplus-llm: A large language model-based computing platform for automated building energy modeling.Applied Energy, 367:123431, 2024

Jiang, G., Ma, Z., Zhang, L., and Chen, J. Eplus-llm: A large language model-based computing platform for automated building energy modeling.Applied Energy, 367:123431, 2024

work page 2024
[25]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Kumar, V ., Gleyzer, L., Kahana, A., Shukla, K., and Karniadakis, G. E. Mycrunchgpt: A llm assisted framework for scientific machine learning.Journal of Machine Learning for Modeling and Computing, 4(4), 2023

work page 2023
[27]

Engr 491: Computational fluid dynamics

Lab, O. Engr 491: Computational fluid dynamics. https://github.com/okcfdlab/engr491,

work page
[28]

Accessed: 2025-05-16

work page 2025
[29]

Ds-1000: A natural and reliable benchmark for data science code generation

Lai, Y ., Li, C., Wang, Y ., Zhang, T., Zhong, R., Zettlemoyer, L., Yih, W.-t., Fried, D., Wang, S., and Yu, T. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pp. 18319–18345. PMLR, 2023

work page 2023
[30]

H., Michelis, M

Lee, J. H., Michelis, M. Y ., Katzschmann, R., and Manchester, Z. Aquarium: A fully differen- tiable fluid-structure interaction solver for robotics applications. In2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 11272–11279. IEEE, 2023

work page 2023
[31]

Qasa: advanced question answering on scientific articles

Lee, Y ., Lee, K., Park, S., Hwang, D., Kim, J., Lee, H.-i., and Lee, M. Qasa: advanced question answering on scientific articles. InInternational Conference on Machine Learning, pp. 19036–19052. PMLR, 2023

work page 2023
[32]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V ., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

work page 2020
[33]

Fea- bench: A benchmark for evaluating repository-level code generation for feature implementation

Li, W., Zhang, X., Guo, Z., Mao, S., Luo, W., Peng, G., Huang, Y ., Wang, H., and Li, S. Fea- bench: A benchmark for evaluating repository-level code generation for feature implementation. arXiv preprint arXiv:2503.06680, 2025. 11

work page arXiv 2025
[34]

Rouge: A package for automatic evaluation of summaries

Lin, C.-Y . Rouge: A package for automatic evaluation of summaries. pp. 10, 01 2004

work page 2004
[35]

Biogpt: generative pre-trained transformer for biomedical text generation and mining.Briefings in bioinformatics, 23(6):bbac409, 2022

Luo, R., Sun, L., Xia, Y ., Qin, T., Zhang, S., Poon, H., and Liu, T.-Y . Biogpt: generative pre-trained transformer for biomedical text generation and mining.Briefings in bioinformatics, 23(6):bbac409, 2022

work page 2022
[36]

10 Preprint

Majumder, B. P., Surana, H., Agarwal, D., Mishra, B. D., Meena, A., Prakhar, A., V ora, T., Khot, T., Sabharwal, A., and Clark, P. Discoverybench: Towards data-driven discovery with large language models.arXiv preprint arXiv:2407.01725, 2024

work page arXiv 2024
[37]

Laurent, Alex Andonian, Benjamin Tenmann, Siddharth Narayanan, Geemi P

Mitchener, L., Laurent, J. M., Tenmann, B., Narayanan, S., Wellawatte, G. P., White, A., Sani, L., and Rodriques, S. G. Bixbench: a comprehensive benchmark for llm-based agents in computational biology.arXiv preprint arXiv:2503.00096, 2025

work page arXiv 2025
[38]

Braza, Ryan-Rhys Griffiths, Manu Ponnapati, Albert Bou, Jon Laurent, Ori Kabeli, Geemi Wellawatte, Sam Cox, Samuel G

Narayanan, S., Braza, J. D., Griffiths, R.-R., Ponnapati, M., Bou, A., Laurent, J., Kabeli, O., Wellawatte, G., Cox, S., Rodriques, S. G., et al. Aviary: training language agents on challenging scientific tasks.arXiv preprint arXiv:2412.21154, 2024

work page arXiv 2024
[39]

Hello gpt-4o

OpenAI. Hello gpt-4o. https://openai.com/index/hello-gpt-4o/, 2024. Accessed: 2025-05-03

work page 2024
[40]

Openai o3-mini

OpenAI. Openai o3-mini. https://openai.com/index/openai-o3-mini/, 2024. Accessed: 2025-05-03

work page 2024
[41]

Openfoamgpt: A retrieval-augmented large language model (llm) agent for openfoam-based computational fluid dynamics.Physics of Fluids, 37(3), 2025

Pandey, S., Xu, R., Wang, W., and Chu, X. Openfoamgpt: A retrieval-augmented large language model (llm) agent for openfoam-based computational fluid dynamics.Physics of Fluids, 37(3), 2025

work page 2025
[42]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Qin, Y ., Liang, S., Ye, Y ., Zhu, K., Yan, L., Lu, Y ., Lin, Y ., Cong, X., Tang, X., Qian, B., et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

L., Stickland, A

Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y ., Dirani, J., Michael, J., and Bowman, S. R. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024

work page 2024
[44]

E., Turner, R., and Flay, R

Shah, M., Norris, S. E., Turner, R., and Flay, R. G. A review of computational fluid dynamics application to investigate tropical cyclone wind speeds.Natural Hazards, 117(1):897–915, 2023

work page 2023
[45]

Neural lander: Stable drone landing control using learned dynamics

Shi, G., Shi, X., O’Connell, M., Yu, R., Azizzadenesheli, K., Anandkumar, A., Yue, Y ., and Chung, S.-J. Neural lander: Stable drone landing control using learned dynamics. In2019 international conference on robotics and automation (icra), pp. 9784–9790. IEEE, 2019

work page 2019
[46]

Core-bench: Fos- tering the credibility of published research through a computational reproducibility agent benchmark.arXiv preprint arXiv:2409.11363, 2024

Siegel, Z. S., Kapoor, S., Nagdir, N., Stroebl, B., and Narayanan, A. Core-bench: Fostering the credibility of published research through a computational reproducibility agent benchmark. arXiv preprint arXiv:2409.11363, 2024

work page arXiv 2024
[47]

R., Cole-Lewis, H., et al

Singhal, K., Tu, T., Gottweis, J., Sayres, R., Wulczyn, E., Amin, M., Hou, L., Clark, K., Pfohl, S. R., Cole-Lewis, H., et al. Toward expert-level medical question answering with large language models.Nature Medicine, pp. 1–8, 2025

work page 2025
[48]

P., Khodadoust, A., Alonso, J., Darmofal, D., Gropp, W., Lurie, E., and Mavriplis, D

Slotnick, J. P., Khodadoust, A., Alonso, J., Darmofal, D., Gropp, W., Lurie, E., and Mavriplis, D. J. Cfd vision 2030 study: a path to revolutionary computational aerosciences. Technical report, 2014

work page 2030
[49]

PaperBench: Evaluating AI's Ability to Replicate AI Research

Starace, G., Jaffe, O., Sherburn, D., Aung, J., Chan, J. S., Maksin, L., Dias, R., Mays, E., Kinsella, B., Thompson, W., et al. Paperbench: Evaluating ai’s ability to replicate ai research. arXiv preprint arXiv:2504.01848, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Galactica: A Large Language Model for Science

Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn, A., Saravia, E., Poulton, A., Kerkez, V ., and Stojnic, R. Galactica: A large language model for science.arXiv preprint arXiv:2211.09085, 2022. 12

work page internal anchor Pith review Pith/arXiv arXiv 2022
[51]

Team, G. Gemma. 2024. doi: 10.34740/KAGGLE/M/3301. URL https://www.kaggle.com/ m/3301

work page doi:10.34740/kaggle/m/3301 2024
[52]

Scicode: A research coding benchmark curated by scientists.Advances in Neural Information Processing Systems, 37:30624–30650, 2024

Tian, M., Gao, L., Zhang, S., Chen, X., Fan, C., Guo, X., Haas, R., Ji, P., Krongchon, K., Li, Y ., et al. Scicode: A research coding benchmark curated by scientists.Advances in Neural Information Processing Systems, 37:30624–30650, 2024

work page 2024
[54]

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Wang, X., Hu, Z., Lu, P., Zhu, Y ., Zhang, J., Subramaniam, S., Loomba, A. R., Zhang, S., Sun, Y ., and Wang, W. Scibench: Evaluating college-level scientific problem-solving abilities of large language models.arXiv preprint arXiv:2307.10635, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

G., Tabor, G., Jasak, H., and Fureby, C

Weller, H. G., Tabor, G., Jasak, H., and Fureby, C. A tensorial approach to computational continuum mechanics using object-oriented techniques.Computers in physics, 12(6):620–631, 1998

work page 1998
[56]

Foam-agent: Towards automated intelligent cfd workflows

Yue, L., Somasekharan, N., Cao, Y ., and Pan, S. Foam-agent: Towards automated intelligent cfd workflows.arXiv preprint arXiv:2505.04997, 2025

work page arXiv 2025
[57]

Physreason: A comprehensive benchmark towards physics-based reasoning

Zhang, X., Dong, Y ., Wu, Y ., Huang, J., Jia, C., Fernando, B., Shou, M. Z., Zhang, L., and Liu, J. Physreason: A comprehensive benchmark towards physics-based reasoning.arXiv preprint arXiv:2502.12054, 2025. 13 A Dataset Curation A.1 CFDQuery This Question and Answer dataset spans a broad spectrum of PDEs, numerical methods and error- analysis topics. I...

work page arXiv 2025
[58]

∂u ∂t +a ∂u ∂x = a∆x2 6 ∂3u ∂x3 +O(∆x 3)

work page
[59]

∂u ∂t +a ∂u ∂x = a∆x2 2 ∂2u ∂x2 +O(∆x 3)

work page
[60]

∂u ∂t +a ∂u ∂x =− a∆x2 6 ∂3u ∂x3 + a∆t2 6 ∂3u ∂t3 +O(∆x 3)

work page
[61]

Models such as o3-mini, Haiku 3.5 and Gemini 2.5 Flash is able to closely match the ground truth solution for the 1D Burgers equation

∂u ∂t +a ∂u ∂x = a∆x2 6 ∂3u ∂x3 − a3∆t2 6 ∂3u ∂x3 +O(∆x 3) Correct Answer:Option 4 Model Responses: •Sonnet 3.5:Option 4✓ •o3-mini:Option 3✗ •Gemini 2.5 Flash:Option 4✓ •Haiku 3.5:Option 1✗ •GPT-4o:Option 3✗ •Gemma-2-9B-IT:Option 1✗ C.2 CFDCodeBench The visual comparison of the model produced results and the ground truth solution at the final timestep for...

work page

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Claude 3.5 sonnet model card addendum

Anthropic. Claude 3.5 sonnet model card addendum. https://www-cdn.anthropic.com/ fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf,

work page

[3] [3]

Accessed: 2025-05-03

work page 2025

[4] [4]

Program Synthesis with Large Language Models

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

Barba, L. A. and Forsyth, G. F. Cfd python: the 12 steps to navier-stokes equations.Journal of Open Source Education, 2(16):21, 2018

work page 2018

[6] [6]

& Cohan, A

Beltagy, I., Lo, K., and Cohan, A. Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676, 2019

work page arXiv 1903

[7] [7]

Blocken, B. Computational fluid dynamics for urban physics: Importance, scales, possibilities, limitations and ten tips and tricks towards accurate and reliable simulations.Building and Environment, 91:219–245, 2015

work page 2015

[8] [8]

Blocken, B., Stathopoulos, T., Carmeliet, J., and Hensen, J. L. Application of computational fluid dynamics in building performance simulation for the outdoor environment: an overview. Journal of building performance simulation, 4(2):157–184, 2011

work page 2011

[9] [9]

Super: Evaluating agents on setting up and executing tasks from research repositories

Bogin, B., Yang, K., Gupta, S., Richardson, K., Bransom, E., Clark, P., Sabharwal, A., and Khot, T. Super: Evaluating agents on setting up and executing tasks from research repositories. arXiv preprint arXiv:2409.07440, 2024

work page arXiv 2024

[10] [10]

A., MacKnight, R., Kline, B., and Gomes, G

Boiko, D. A., MacKnight, R., Kline, B., and Gomes, G. Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

work page 2023

[11] [11]

ChemCrow: Augmenting large-language models with chemistry tools

Bran, A. M., Cox, S., Schilter, O., Baldassari, C., White, A. D., and Schwaller, P. Chemcrow: Augmenting large-language models with chemistry tools.arXiv preprint arXiv:2304.05376, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

J., Vasil, G

Burns, K. J., Vasil, G. M., Oishi, J. S., Lecoanet, D., and Brown, B. P. Dedalus: A flexible framework for numerical simulations with spectral methods.Physical Review Research, 2(2): 023068, 2020

work page 2020

[13] [13]

Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., C...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[14] [14]

Metaopenfoam: an llm-based multi-agent framework for cfd

Chen, Y ., Zhu, X., Zhou, H., and Ren, Z. Metaopenfoam: an llm-based multi-agent framework for cfd.arXiv preprint arXiv:2407.21320, 2024. 10

work page arXiv 2024

[15] [15]

Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery

Chen, Z., Chen, S., Ning, Y ., Zhang, Q., Wang, B., Yu, B., Li, Y ., Liao, Z., Wei, C., Lu, Z., et al. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery.arXiv preprint arXiv:2410.05080, 2024

work page arXiv 2024

[16] [16]

LLMPhy: Parameter-Identifiable Physical Reasoning Combining Large Language Models and Physics Engines

Cherian, A., Corcodel, R., Jain, S., and Romeres, D. Llmphy: Complex physical reasoning using large language models and world models.arXiv preprint arXiv:2411.08027, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Curie: Eval- uating llms on multitask scientific long context understanding and reasoning.arXiv preprint arXiv:2503.13517, 2025

Cui, H., Shamsi, Z., Cheon, G., Ma, X., Li, S., Tikhanovskaya, M., Norgaard, P., Mudur, N., Plomecka, M., Raccuglia, P., et al. Curie: Evaluating llms on multitask scientific long context understanding and reasoning.arXiv preprint arXiv:2503.13517, 2025

work page arXiv 2025

[18] [18]

Start building with gemini 2.5 flash

DeepMind, G. Start building with gemini 2.5 flash. https://developers.googleblog. com/en/start-building-with-gemini-25-flash/?utm_source=deepmind.google& utm_medium=referral&utm_campaign=gdm&utm_content=, 2025. Accessed: 2025-05-03

work page 2025

[19] [19]

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

Glazer, E., Erdil, E., Besiroglu, T., Chicharro, D., Chen, E., Gunning, A., Olsson, C. F., Denain, J.-S., Ho, A., Santos, E. d. O., et al. Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai.arXiv preprint arXiv:2411.04872, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Jacobs, P. F. and Pollice, R. Developing large language models for quantum chemistry simulation input generation.Digital Discovery, 2025

work page 2025

[22] [22]

and Farimani, A

Jadhav, Y . and Farimani, A. B. Large language model agent as a mechanical designer.arXiv preprint arXiv:2404.17525, 2024

work page arXiv 2024

[23] [23]

Openfoam: A c++ library for complex physics simulations

Jasak, H., Jemcov, A., Tukovic, Z., et al. Openfoam: A c++ library for complex physics simulations. InInternational workshop on coupled methods in numerical dynamics, volume 1000, pp. 1–20. IUC Dubrovnik Croatia, 2007

work page 2007

[24] [24]

Eplus-llm: A large language model-based computing platform for automated building energy modeling.Applied Energy, 367:123431, 2024

Jiang, G., Ma, Z., Zhang, L., and Chen, J. Eplus-llm: A large language model-based computing platform for automated building energy modeling.Applied Energy, 367:123431, 2024

work page 2024

[25] [25]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Kumar, V ., Gleyzer, L., Kahana, A., Shukla, K., and Karniadakis, G. E. Mycrunchgpt: A llm assisted framework for scientific machine learning.Journal of Machine Learning for Modeling and Computing, 4(4), 2023

work page 2023

[27] [27]

Engr 491: Computational fluid dynamics

Lab, O. Engr 491: Computational fluid dynamics. https://github.com/okcfdlab/engr491,

work page

[28] [28]

Accessed: 2025-05-16

work page 2025

[29] [29]

Ds-1000: A natural and reliable benchmark for data science code generation

Lai, Y ., Li, C., Wang, Y ., Zhang, T., Zhong, R., Zettlemoyer, L., Yih, W.-t., Fried, D., Wang, S., and Yu, T. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pp. 18319–18345. PMLR, 2023

work page 2023

[30] [30]

H., Michelis, M

Lee, J. H., Michelis, M. Y ., Katzschmann, R., and Manchester, Z. Aquarium: A fully differen- tiable fluid-structure interaction solver for robotics applications. In2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 11272–11279. IEEE, 2023

work page 2023

[31] [31]

Qasa: advanced question answering on scientific articles

Lee, Y ., Lee, K., Park, S., Hwang, D., Kim, J., Lee, H.-i., and Lee, M. Qasa: advanced question answering on scientific articles. InInternational Conference on Machine Learning, pp. 19036–19052. PMLR, 2023

work page 2023

[32] [32]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V ., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

work page 2020

[33] [33]

Fea- bench: A benchmark for evaluating repository-level code generation for feature implementation

Li, W., Zhang, X., Guo, Z., Mao, S., Luo, W., Peng, G., Huang, Y ., Wang, H., and Li, S. Fea- bench: A benchmark for evaluating repository-level code generation for feature implementation. arXiv preprint arXiv:2503.06680, 2025. 11

work page arXiv 2025

[34] [34]

Rouge: A package for automatic evaluation of summaries

Lin, C.-Y . Rouge: A package for automatic evaluation of summaries. pp. 10, 01 2004

work page 2004

[35] [35]

Biogpt: generative pre-trained transformer for biomedical text generation and mining.Briefings in bioinformatics, 23(6):bbac409, 2022

Luo, R., Sun, L., Xia, Y ., Qin, T., Zhang, S., Poon, H., and Liu, T.-Y . Biogpt: generative pre-trained transformer for biomedical text generation and mining.Briefings in bioinformatics, 23(6):bbac409, 2022

work page 2022

[36] [36]

10 Preprint

Majumder, B. P., Surana, H., Agarwal, D., Mishra, B. D., Meena, A., Prakhar, A., V ora, T., Khot, T., Sabharwal, A., and Clark, P. Discoverybench: Towards data-driven discovery with large language models.arXiv preprint arXiv:2407.01725, 2024

work page arXiv 2024

[37] [37]

Laurent, Alex Andonian, Benjamin Tenmann, Siddharth Narayanan, Geemi P

Mitchener, L., Laurent, J. M., Tenmann, B., Narayanan, S., Wellawatte, G. P., White, A., Sani, L., and Rodriques, S. G. Bixbench: a comprehensive benchmark for llm-based agents in computational biology.arXiv preprint arXiv:2503.00096, 2025

work page arXiv 2025

[38] [38]

Braza, Ryan-Rhys Griffiths, Manu Ponnapati, Albert Bou, Jon Laurent, Ori Kabeli, Geemi Wellawatte, Sam Cox, Samuel G

Narayanan, S., Braza, J. D., Griffiths, R.-R., Ponnapati, M., Bou, A., Laurent, J., Kabeli, O., Wellawatte, G., Cox, S., Rodriques, S. G., et al. Aviary: training language agents on challenging scientific tasks.arXiv preprint arXiv:2412.21154, 2024

work page arXiv 2024

[39] [39]

Hello gpt-4o

OpenAI. Hello gpt-4o. https://openai.com/index/hello-gpt-4o/, 2024. Accessed: 2025-05-03

work page 2024

[40] [40]

Openai o3-mini

OpenAI. Openai o3-mini. https://openai.com/index/openai-o3-mini/, 2024. Accessed: 2025-05-03

work page 2024

[41] [41]

Openfoamgpt: A retrieval-augmented large language model (llm) agent for openfoam-based computational fluid dynamics.Physics of Fluids, 37(3), 2025

Pandey, S., Xu, R., Wang, W., and Chu, X. Openfoamgpt: A retrieval-augmented large language model (llm) agent for openfoam-based computational fluid dynamics.Physics of Fluids, 37(3), 2025

work page 2025

[42] [42]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Qin, Y ., Liang, S., Ye, Y ., Zhu, K., Yan, L., Lu, Y ., Lin, Y ., Cong, X., Tang, X., Qian, B., et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

L., Stickland, A

Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y ., Dirani, J., Michael, J., and Bowman, S. R. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024

work page 2024

[44] [44]

E., Turner, R., and Flay, R

Shah, M., Norris, S. E., Turner, R., and Flay, R. G. A review of computational fluid dynamics application to investigate tropical cyclone wind speeds.Natural Hazards, 117(1):897–915, 2023

work page 2023

[45] [45]

Neural lander: Stable drone landing control using learned dynamics

Shi, G., Shi, X., O’Connell, M., Yu, R., Azizzadenesheli, K., Anandkumar, A., Yue, Y ., and Chung, S.-J. Neural lander: Stable drone landing control using learned dynamics. In2019 international conference on robotics and automation (icra), pp. 9784–9790. IEEE, 2019

work page 2019

[46] [46]

Core-bench: Fos- tering the credibility of published research through a computational reproducibility agent benchmark.arXiv preprint arXiv:2409.11363, 2024

Siegel, Z. S., Kapoor, S., Nagdir, N., Stroebl, B., and Narayanan, A. Core-bench: Fostering the credibility of published research through a computational reproducibility agent benchmark. arXiv preprint arXiv:2409.11363, 2024

work page arXiv 2024

[47] [47]

R., Cole-Lewis, H., et al

Singhal, K., Tu, T., Gottweis, J., Sayres, R., Wulczyn, E., Amin, M., Hou, L., Clark, K., Pfohl, S. R., Cole-Lewis, H., et al. Toward expert-level medical question answering with large language models.Nature Medicine, pp. 1–8, 2025

work page 2025

[48] [48]

P., Khodadoust, A., Alonso, J., Darmofal, D., Gropp, W., Lurie, E., and Mavriplis, D

Slotnick, J. P., Khodadoust, A., Alonso, J., Darmofal, D., Gropp, W., Lurie, E., and Mavriplis, D. J. Cfd vision 2030 study: a path to revolutionary computational aerosciences. Technical report, 2014

work page 2030

[49] [49]

PaperBench: Evaluating AI's Ability to Replicate AI Research

Starace, G., Jaffe, O., Sherburn, D., Aung, J., Chan, J. S., Maksin, L., Dias, R., Mays, E., Kinsella, B., Thompson, W., et al. Paperbench: Evaluating ai’s ability to replicate ai research. arXiv preprint arXiv:2504.01848, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Galactica: A Large Language Model for Science

Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn, A., Saravia, E., Poulton, A., Kerkez, V ., and Stojnic, R. Galactica: A large language model for science.arXiv preprint arXiv:2211.09085, 2022. 12

work page internal anchor Pith review Pith/arXiv arXiv 2022

[51] [51]

Team, G. Gemma. 2024. doi: 10.34740/KAGGLE/M/3301. URL https://www.kaggle.com/ m/3301

work page doi:10.34740/kaggle/m/3301 2024

[52] [52]

Scicode: A research coding benchmark curated by scientists.Advances in Neural Information Processing Systems, 37:30624–30650, 2024

Tian, M., Gao, L., Zhang, S., Chen, X., Fan, C., Guo, X., Haas, R., Ji, P., Krongchon, K., Li, Y ., et al. Scicode: A research coding benchmark curated by scientists.Advances in Neural Information Processing Systems, 37:30624–30650, 2024

work page 2024

[53] [54]

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Wang, X., Hu, Z., Lu, P., Zhu, Y ., Zhang, J., Subramaniam, S., Loomba, A. R., Zhang, S., Sun, Y ., and Wang, W. Scibench: Evaluating college-level scientific problem-solving abilities of large language models.arXiv preprint arXiv:2307.10635, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[54] [55]

G., Tabor, G., Jasak, H., and Fureby, C

Weller, H. G., Tabor, G., Jasak, H., and Fureby, C. A tensorial approach to computational continuum mechanics using object-oriented techniques.Computers in physics, 12(6):620–631, 1998

work page 1998

[55] [56]

Foam-agent: Towards automated intelligent cfd workflows

Yue, L., Somasekharan, N., Cao, Y ., and Pan, S. Foam-agent: Towards automated intelligent cfd workflows.arXiv preprint arXiv:2505.04997, 2025

work page arXiv 2025

[56] [57]

Physreason: A comprehensive benchmark towards physics-based reasoning

Zhang, X., Dong, Y ., Wu, Y ., Huang, J., Jia, C., Fernando, B., Shou, M. Z., Zhang, L., and Liu, J. Physreason: A comprehensive benchmark towards physics-based reasoning.arXiv preprint arXiv:2502.12054, 2025. 13 A Dataset Curation A.1 CFDQuery This Question and Answer dataset spans a broad spectrum of PDEs, numerical methods and error- analysis topics. I...

work page arXiv 2025

[57] [58]

∂u ∂t +a ∂u ∂x = a∆x2 6 ∂3u ∂x3 +O(∆x 3)

work page

[58] [59]

∂u ∂t +a ∂u ∂x = a∆x2 2 ∂2u ∂x2 +O(∆x 3)

work page

[59] [60]

∂u ∂t +a ∂u ∂x =− a∆x2 6 ∂3u ∂x3 + a∆t2 6 ∂3u ∂t3 +O(∆x 3)

work page

[60] [61]

Models such as o3-mini, Haiku 3.5 and Gemini 2.5 Flash is able to closely match the ground truth solution for the 1D Burgers equation

∂u ∂t +a ∂u ∂x = a∆x2 6 ∂3u ∂x3 − a3∆t2 6 ∂3u ∂x3 +O(∆x 3) Correct Answer:Option 4 Model Responses: •Sonnet 3.5:Option 4✓ •o3-mini:Option 3✗ •Gemini 2.5 Flash:Option 4✓ •Haiku 3.5:Option 1✗ •GPT-4o:Option 3✗ •Gemma-2-9B-IT:Option 1✗ C.2 CFDCodeBench The visual comparison of the model produced results and the ground truth solution at the final timestep for...

work page