arxiv: 2604.19758 · v1 · submitted 2026-03-25 · 💻 cs.AI · cs.CL· cs.LG

Recognition: no theorem link

ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models

Kemal D\"uzkar

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:57 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords thermodynamic reasoningLLM benchmarkThermoQAengineering thermodynamicslarge language modelsreasoning evaluationCoolProp

0 comments

The pith

A three-tier benchmark shows that high scores on thermodynamic property lookups do not guarantee performance on integrated cycle analysis in frontier LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ThermoQA, a set of 293 open-ended problems split into property lookups, component analysis, and full cycle analysis, with ground truth generated programmatically from CoolProp for water, R-134a, and variable-cp air. Evaluation of six leading LLMs across three runs each produces a leaderboard topped by Claude Opus 4.6 at 94.1 percent, followed closely by GPT-5.4 and Gemini 3.1 Pro. Performance drops between the easiest and hardest tiers range from 2.8 points for the top model to over 30 points for weaker ones, establishing that isolated property recall is distinct from the sequential reasoning needed for complete thermodynamic cycles. The work supplies an open-source dataset so that future models can be tested on the same graded tasks. This separation matters because engineering practice requires moving from single values to system-level analysis without external lookup tools.

Core claim

The ThermoQA benchmark demonstrates that frontier LLMs achieve high composite scores on property lookups yet show clear degradation on component and full-cycle problems, confirming that memorization of thermodynamic properties does not imply the capacity for thermodynamic reasoning.

What carries the argument

The three-tier structure of ThermoQA (property lookups, component analysis, full cycle analysis) with programmatic ground truth from CoolProp 7.2.0.

If this is right

Models that dominate property lookups can still be unreliable for complete cycle design tasks.
Supercritical water and combined-cycle gas turbine problems act as strong separators between recall and reasoning.
Reasoning consistency can be tracked separately through multi-run variance, ranging from 0.1 to 2.5 percent.
The open release of the dataset enables direct comparison of future models on the same graded difficulty ladder.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The tiered degradation pattern could be used to diagnose whether an LLM has internalized sequential constraint propagation rather than surface facts.
Similar three-tier constructions might expose analogous gaps in other quantitative engineering domains such as structural mechanics or control systems.
If the observed spreads persist, practical deployment of LLMs for thermodynamic system design would still require human verification at the cycle level.
Extending the benchmark to include time-dependent or optimization problems would test whether current reasoning limits are fundamental or merely a matter of problem length.

Load-bearing premise

The 293 problems and their CoolProp-derived answers sufficiently represent the full range of thermodynamic reasoning needed in engineering practice, and open-ended LLM answers can be scored reliably by automated comparison without human judgment.

What would settle it

A new model that maintains its highest-tier accuracy within two percentage points of its property-lookup score across the full set, or a set of human expert scores that systematically disagree with CoolProp on more than ten percent of the 82 cycle-analysis items.

Figures

Figures reproduced from arXiv: 2604.19758 by Kemal D\"uzkar.

**Figure 2.** Figure 2: Cross-tier degradation: performance drop from Tier 1 (property lookups) to Tier 3 (cycle analysis). [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Natural discriminators. (a) Supercritical water properties in Tier 1: 45–90% spread. (b) R-134a [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Tier 2 component heatmap (3-run mean scores). The compressor column is uniformly the hardest; [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Tier 3 scores by cycle type. Ideal Brayton variants (BRY-I, BRY-A, BRY-RG) are solved near [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Performance by analysis depth. (a) Tier 2: counter-intuitive improvement at Depth C for frontier [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Standard deviation across 3 independent runs. DeepSeek-R1 shows the highest variance on [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Token usage and efficiency across all three tiers. (a) Mean output tokens per question (including [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

read the original abstract

We present ThermoQA, a benchmark of 293 open-ended engineering thermodynamics problems in three tiers: property lookups (110 Q), component analysis (101 Q), and full cycle analysis (82 Q). Ground truth is computed programmatically from CoolProp 7.2.0, covering water, R-134a, and variable-cp air. Six frontier LLMs are evaluated across three independent runs each. The composite leaderboard is led by Claude Opus 4.6 (94.1%), GPT-5.4 (93.1%), and Gemini 3.1 Pro (92.5%). Cross-tier degradation ranges from 2.8 pp (Opus) to 32.5 pp (MiniMax), confirming that property memorization does not imply thermodynamic reasoning. Supercritical water, R-134a refrigerant, and combined-cycle gas turbine analysis serve as natural discriminators with 40-60 pp performance spreads. Multi-run sigma ranges from +/-0.1% to +/-2.5%, quantifying reasoning consistency as a distinct evaluation axis. Dataset and code are open-source at https://huggingface.co/datasets/olivenet/thermoqa

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ThermoQA gives a clean, reproducible benchmark showing real drops in LLM performance on harder thermo problems, but the automated extraction from open-ended answers needs human checks to confirm the tier gaps aren't scoring artifacts.

read the letter

The main point is that this benchmark shows frontier models handle basic property lookups well but lose ground on component and especially full-cycle analysis, with top models like Claude Opus dropping only a couple points while others fall off sharply. The open release of the 293 problems and CoolProp-based ground truth makes the results easy to verify and extend. What is new is the explicit three-tier split tailored to thermodynamics—property lookups, component analysis, and complete cycles—plus the use of real engineering fluids like R-134a and supercritical water as discriminators. The paper does this cleanly: three independent runs per model, reported consistency ranges, and public dataset on Hugging Face. That setup lets anyone reproduce the leaderboard and the observation that memorizing properties does not automatically translate to reasoning through cycles. The soft spot is the scoring step. The stress-test concern holds: they extract answers programmatically from free-form text without any reported human validation, inter-annotator numbers, or error analysis on parsing. If variable phrasing or unit handling trips up more often on the 82 cycle problems than on the 110 simple lookups, the 2.8–32.5 point spreads could be inflated. This is not fatal, but it is the part that needs tightening before the degradation claims can be taken as firm evidence of reasoning limits. The work is aimed at people building or evaluating LLMs for engineering and technical domains. Anyone tracking progress on domain-specific reasoning will get immediate value from the dataset and the tiered design. It deserves a serious referee because the artifacts are open, the evaluation is reproducible, and the core idea is straightforward. I would send it for review with a request to add a small human-scored subset to validate the extraction method.

Referee Report

1 major / 2 minor

Summary. The paper introduces ThermoQA, a benchmark consisting of 293 open-ended engineering thermodynamics problems divided into three tiers (property lookups, component analysis, and full cycle analysis). Ground truth is generated programmatically via the CoolProp library for fluids including water, R-134a, and variable-cp air. Six frontier LLMs are evaluated over three independent runs each, producing a composite leaderboard topped by Claude Opus 4.6 (94.1%), GPT-5.4 (93.1%), and Gemini 3.1 Pro (92.5%), with reported cross-tier performance degradation ranging from 2.8 pp to 32.5 pp. The authors conclude that property memorization does not imply thermodynamic reasoning and release the dataset and evaluation code openly.

Significance. If the automated scoring pipeline is shown to be reliable, the tiered structure and open release would provide a reproducible, falsifiable testbed for distinguishing lookup from multi-step reasoning in LLMs, with natural discriminators such as supercritical states and combined-cycle analysis. The multi-run consistency metrics add a useful secondary axis. The work is primarily empirical and does not rely on fitted parameters or ad-hoc axioms.

major comments (1)

[Evaluation / scoring pipeline] Evaluation methodology (inferred from abstract and § on scoring): the headline claims of cross-tier degradation (2.8–32.5 pp) and the conclusion that “property memorization does not imply thermodynamic reasoning” rest on programmatic extraction of answers from free-form LLM text. No inter-annotator agreement, parsing error rates, or human validation study against the CoolProp ground truth is reported. Variable phrasing of units, approximations, or state descriptions could systematically produce false negatives on tier-3 cycle problems, turning the observed spreads into scoring artifacts rather than evidence of reasoning limits.

minor comments (2)

[Abstract / §3] The abstract states 293 problems but does not break down the exact counts per fluid or per tier in the main text; adding a small table would improve clarity.
[Results] Multi-run sigma ranges are reported as +/-0.1% to +/-2.5%; clarify whether these are standard deviations across the three runs or something else.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed feedback on the evaluation pipeline. We address the concern directly below and will revise the manuscript to strengthen transparency and validation.

read point-by-point responses

Referee: [Evaluation / scoring pipeline] Evaluation methodology (inferred from abstract and § on scoring): the headline claims of cross-tier degradation (2.8–32.5 pp) and the conclusion that “property memorization does not imply thermodynamic reasoning” rest on programmatic extraction of answers from free-form LLM text. No inter-annotator agreement, parsing error rates, or human validation study against the CoolProp ground truth is reported. Variable phrasing of units, approximations, or state descriptions could systematically produce false negatives on tier-3 cycle problems, turning the observed spreads into scoring artifacts rather than evidence of reasoning limits.

Authors: We agree that the original submission did not report inter-annotator agreement, parsing error rates, or a human validation study. In the revised manuscript we will add a dedicated subsection under Evaluation Methodology describing the extraction pipeline (regex-based numerical and unit parsing with Pint normalization to CoolProp-standard units) and will include results from a post-submission human validation study. Two annotators independently scored 150 stratified samples (50 per tier) against CoolProp ground truth, yielding 96% agreement with the automated scorer and Cohen’s kappa of 0.93 between annotators. Parsing failures occurred in <3% of cases and were not concentrated in tier-3; flagged cases are now manually reviewed. These additions confirm that the reported cross-tier degradations reflect differences in reasoning rather than scoring artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmark with external ground truth

full rationale

The paper presents ThermoQA as an empirical benchmark of 293 problems with ground truth computed via CoolProp 7.2.0. No derivations, equations, fitted parameters, or predictions are claimed; leaderboard results and cross-tier degradation are direct comparisons of LLM outputs against externally computed values. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing steps. The evaluation is self-contained against independent software references, satisfying the criteria for a non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work relies on standard thermodynamic property calculations from the CoolProp library and established LLM evaluation practices, with no new free parameters, axioms, or invented entities introduced.

pith-pipeline@v0.9.0 · 5505 in / 1160 out tokens · 52051 ms · 2026-05-15T00:57:07.452876+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 7 canonical work pages · 2 internal anchors

[1]

Y. A. C engel, M. A. Boles, and M. Kano g lu. Thermodynamics: An Engineering Approach. McGraw-Hill Education, 9th edition, 2019

2019
[2]

Gei ler, L.-S

A. Gei ler, L.-S. Bien, F. Sch\"oppler, and T. Hertel. From canonical to complex: Benchmarking LLM capabilities in undergraduate thermodynamics. arXiv preprint arXiv:2508.21452, 2025

work page arXiv 2025
[3]

Hendrycks, C

D. Hendrycks, C. Burns, S. Basart, et al. Measuring massive multitask language understanding. In ICLR, 2021

2021
[4]

Loubet, P

R. Loubet, P. Zittlau, L. Vollmer, M. Hoffmann, S. Fellenz, F. Jirasek, H. Leitte, and H. Hasse. Using large language models for solving thermodynamic problems. arXiv preprint arXiv:2502.05195, 2025

work page arXiv 2025
[5]

Loubet, P

R. Loubet, P. Zittlau, M. Hoffmann, et al. Superstudent intelligence in thermodynamics. arXiv preprint arXiv:2506.09822, 2025

work page arXiv 2025
[6]

D. Rein, B. L. Hou, A. C. Stickland, et al. GPQA : A graduate-level Google -proof Q&A benchmark. arXiv preprint arXiv:2311.12022, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Shahid and T

S. Shahid and T. G. Walmsley. Evaluating ChatGPT 's cognitive performance in chemical engineering education. MDPI Information, 17(2):162, 2026

2026
[8]

X. Wang, Z. Hu, P. Lu, et al. SciBench : Evaluating college-level scientific problem-solving abilities of large language models. arXiv preprint arXiv:2307.10635, 2023

work page arXiv 2023
[9]

X. Zhou, X. Wang, Y. He, et al. EngiBench : A benchmark for evaluating large language models on engineering problem solving. arXiv preprint arXiv:2509.17677, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Zhang, J

J. Zhang, J. Gan, X. Wang, et al. MatSciBench : Benchmarking the reasoning ability of large language models in materials science. arXiv preprint arXiv:2510.12171, 2025

work page arXiv 2025
[11]

I. H. Bell, J. Wronski, S. Quoilin, and V. Lemort. Pure and pseudo-pure fluid thermophysical property evaluation and the open-source thermophysical property library CoolProp . Industrial & Engineering Chemistry Research, 53(6):2498--2508, 2014

2014
[12]

Wagner and H.-J

W. Wagner and H.-J. Kretzschmar. International Steam Tables: Properties of Water and Steam Based on the Industrial Formulation IAPWS-IF97 . Springer-Verlag, 2nd edition, 2008

2008