arxiv: 2604.17966 · v1 · submitted 2026-04-20 · 💻 cs.AI

Recognition: unknown

TPS-CalcBench: A Benchmark and Diagnostic Evaluation Framework for LLM Analytical Calculation Competence in Hypersonic Thermal Protection System Engineering

Jinglai Zheng , Chuhan Qiao , Haiming Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:21 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM benchmarkhypersonic TPSanalytical calculationsreasoning qualitydiagnostic evaluationsafety-critical engineeringthermal protection systemsaerospace AI

0 comments

The pith

A new benchmark for hypersonic TPS engineering checks both LLM numerical answers and reasoning steps to catch physically invalid calculations that standard tests miss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deploying LLMs in safety-critical aerospace work requires evaluation that goes beyond final numbers because a plausible heat flux result can stem from flawed physical assumptions and produce unsafe designs. Existing scientific benchmarks test only abstract math or basic physics and ignore the engineering reasoning process. This paper presents TPS-CalcBench as a diagnostic framework built from textbook tasks across four difficulty levels and eight categories, with a dual-track system that scores accuracy separately from reasoning quality using an eight-dimension rubric audited by humans. It also supplies a human-AI data pipeline for high-confidence items, noise-sensitivity checks, and three intervention methods that target formula selection and process awareness. If the framework works as intended, it supplies a practical loop for diagnosing, evaluating, and improving LLM competence before models assist with real thermal protection calculations.

Core claim

TPS-CalcBench is the first diagnostic benchmark and intervention framework for closed-form analytical calculations in hypersonic aerodynamics and high-temperature gas dynamics that experienced TPS engineers perform without simulations. It supplies a domain-oriented task taxonomy, dual-track evaluation of result accuracy and reasoning quality through an 8-dimension rubric with human-audited calibration to flag right-answer-wrong-reasoning cases, a human-AI pipeline that yields 420 high-confidence core items, noise-sensitivity analysis, and three intervention methods of domain fine-tuning, retrieval grounding, and process-aware prompting. Tests across 13 models demonstrate wide performance差异,隐

What carries the argument

The dual-track evaluation that scores numerical accuracy separately from reasoning quality via an 8-dimension rubric calibrated by human audit to identify correct answers reached by invalid reasoning.

If this is right

Wide performance gaps appear among 13 tested models, with overall KPI scores ranging from 12.6 to 87.9.
Models display hidden defects in formula selection that the reasoning rubric detects even when the final number looks reasonable.
Noise-sensitivity analysis shows that data quality changes can shift model rankings, confirming the need for controlled item sets.
The three interventions of domain-specific fine-tuning, retrieval grounding, and process-aware prompting each produce measurable gains in both accuracy and reasoning quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar rubric-based diagnosis could be adapted to other safety-critical engineering domains where plausible numbers can mask physical inconsistencies.
The results point toward greater use of process-aware prompting when LLMs must maintain physical consistency across related calculations.
Widespread adoption might push LLM training toward explicit checks for physical validity rather than numerical plausibility alone.

Load-bearing premise

The selected textbook tasks and human-AI generated items accurately represent the closed-form analytical calculations that experienced TPS engineers perform without simulations, and the 8-dimension rubric validly measures the quality of that engineering reasoning.

What would settle it

A direct comparison in which practicing TPS engineers solve the benchmark tasks and produce methods or answers that differ substantially from the benchmark expectations, or an expert audit showing that high rubric scores still correspond to physically invalid design margins.

Figures

Figures reproduced from arXiv: 2604.17966 by Chuhan Qiao, Haiming Huang, Jinglai Zheng.

**Figure 1.** Figure 1: FIG. 1. Data construction funnel for TPS-CalcBench. The pipeline proceeds from high-recall automatic extraction to increas [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: FIG. 2. Track A Outcome Results: Exact, Acceptable, and Order-Correct bands for all evaluated models on the TPS-CalcBench [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: FIG. 3. Track B Process Quality: Mean scores across the 8-dimension engineering rubric for top-performing models. [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: FIG. 4. Track C Composite KPI: Summary of model performance across both outcome and process tracks, forming clear [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: FIG. 5. Noise Sensitivity Analysis: Performance shift between the noisy v2 set and the high-confidence v4 core set, demon [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

read the original abstract

Deploying LLMs as reasoning assistants in safety-critical aerospace engineering requires stricter evaluation criteria than general scientific benchmarks. In hypersonic thermal protection system (TPS) design, inaccurate stagnation-point heat flux or boundary-layer calculations may cause catastrophic design margin violations. Models with numerically reasonable but physically invalid answers are more dangerous than those declining to respond. Current scientific benchmarks only test abstract math and basic physics, evaluate final answers solely, ignore engineering reasoning processes, and cannot detect such critical failures. We propose TPS-CalcBench, the first diagnostic benchmark for closed-form analytical calculations in hypersonic aerodynamics and high-temperature gas dynamics that experienced TPS engineers conduct without simulations. Our contributions include domain-oriented task taxonomy with 4 difficulty levels and 8 categories from Anderson's textbook, dual-track evaluation measuring result accuracy and reasoning quality via an 8-dimension rubric and calibrated judge with human audit to identify right answer wrong reasoning issues, human-AI data pipeline producing 420 high-confidence core items and 810 noise-controlled pre-gating items from 4560 raw data, noise-sensitivity analysis measuring data quality impacts on model ranking, and three diagnostic intervention methods: DFA-TPS fine-tuning, RAG-EQ retrieval grounding and PA-CoT process-aware prompting. Tests on 13 models from 7 groups show wide performance differences (KPI 12.6-87.9), hidden formula selection defects, data-driven rank changes and effective intervention improvements, establishing a complete diagnose-evaluate-intervene framework for safety-critical engineering LLM deployment assessment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TPS-CalcBench adds a needed domain-specific diagnostic layer for LLM reasoning in TPS calculations, but its tasks and rubric still need direct checks against real engineer workflows.

read the letter

This paper's main contribution is a benchmark built specifically around closed-form analytical calculations that TPS engineers might do by hand in hypersonic work. They pulled problems from Anderson's textbook, sorted them into four difficulty levels and eight categories, ran a human-AI pipeline to get 420 solid items plus noise controls, and set up dual scoring: one track for the final number and another for reasoning quality using an eight-dimension rubric plus a calibrated judge and human audit. They also ran noise-sensitivity checks and tried three fixes—fine-tuning, retrieval grounding, and process-aware prompting—on thirteen models. That setup is new for this narrow slice of aerospace engineering and does a decent job showing that models can land on plausible numbers while using the wrong formula or missing physical constraints. The spread in KPI scores and the gains from the interventions are concrete enough to be useful for people who care about safety-critical deployment. The soft spot is the representativeness claim. The abstract and methods rest on textbook problems without evidence that these match the actual calculations experienced engineers perform in design loops, or that the rubric catches the failure modes that matter most in practice. No practitioner review or workflow mapping is described, so the diagnostic power could be narrower than advertised. The data pipeline and audit steps look careful on paper, but without that grounding the results stay tied to the chosen textbook slice. This is worth a serious referee for anyone building or evaluating LLMs for engineering domains. Readers who work on domain benchmarks or safety-critical AI will find the methods and the intervention comparisons worth looking at. I'd send it to review and ask for more on how the task set and rubric were validated against real TPS practice.

Referee Report

3 major / 2 minor

Summary. The paper introduces TPS-CalcBench as the first diagnostic benchmark and evaluation framework specifically for closed-form analytical calculations in hypersonic TPS engineering. It defines a 4-level/8-category task taxonomy drawn from Anderson's textbook, employs a human-AI pipeline to curate 420 high-confidence items (plus noise-controlled variants), applies dual-track evaluation of answer accuracy and reasoning quality via an 8-dimension rubric plus calibrated judge with human audit, conducts noise-sensitivity analysis, and evaluates three interventions (DFA-TPS fine-tuning, RAG-EQ retrieval, PA-CoT prompting) on 13 models, reporting KPI ranges of 12.6-87.9, formula-selection defects, and intervention gains.

Significance. If the tasks and rubric prove representative, the work supplies a much-needed domain-specific diagnostic tool that moves beyond final-answer scoring to detect physically invalid reasoning in safety-critical aerospace calculations. The human-AI curation pipeline, noise-sensitivity analysis, and explicit intervention methods constitute concrete strengths that could support more reliable LLM deployment assessment in engineering contexts.

major comments (3)

[Abstract and Task Taxonomy] Abstract and Task Taxonomy section: the central claim that the selected tasks represent 'closed-form analytical calculations ... that experienced TPS engineers conduct without simulations' is load-bearing for the benchmark's diagnostic value, yet the manuscript provides no practitioner consultation, workflow mapping, or expert validation of the 4-level/8-category taxonomy against real design failure modes; reliance on textbook extraction alone leaves generalizability unestablished.
[Evaluation Framework] Evaluation Framework and Rubric description: the 8-dimension rubric is presented as distinguishing 'right answer, wrong reasoning' issues, but no section reports inter-rater reliability, correlation with documented TPS physical constraints, or validation against actual engineer error patterns; without this, the rubric's ability to surface critical failures remains unproven and directly affects the dual-track evaluation results.
[Experiments] Experiments section: reported KPI spreads (12.6-87.9) and intervention improvements are interpreted as establishing a 'complete diagnose-evaluate-intervene framework,' but the absence of human-expert baseline performance on the same 420 items or comparison against existing engineering calculation benchmarks makes it impossible to calibrate the absolute severity of the observed defects.

minor comments (2)

[Abstract] Abstract: the numbers '420 high-confidence core items and 810 noise-controlled pre-gating items from 4560 raw data' are stated without a concise summary of the exact human-audit criteria or pre-gating thresholds; adding a short table or paragraph would improve reproducibility.
[Overall] Overall manuscript: several acronyms (DFA-TPS, RAG-EQ, PA-CoT) appear without immediate expansion on first use; ensure consistent definition in the main text and abstract.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments highlight important aspects of validation and calibration that we address point by point below. We propose targeted revisions to strengthen the manuscript while preserving its core contributions.

read point-by-point responses

Referee: Abstract and Task Taxonomy section: the central claim that the selected tasks represent 'closed-form analytical calculations ... that experienced TPS engineers conduct without simulations' is load-bearing for the benchmark's diagnostic value, yet the manuscript provides no practitioner consultation, workflow mapping, or expert validation of the 4-level/8-category taxonomy against real design failure modes; reliance on textbook extraction alone leaves generalizability unestablished.

Authors: We agree that explicit practitioner validation would increase confidence in generalizability. The taxonomy is extracted from John D. Anderson's Hypersonic and High-Temperature Gas Dynamics, the canonical reference for TPS analytical methods. Each of the 8 categories maps directly to standard preliminary-design calculations (e.g., stagnation-point heating, boundary-layer properties) that textbooks and design handbooks present as closed-form steps performed before CFD. In the revision we will add a dedicated paragraph in Section 3 that (i) cites the exact Anderson sections for each category, (ii) notes their routine use in NASA and industry TPS sizing workflows, and (iii) acknowledges the absence of a new expert survey as a limitation while arguing that textbook grounding provides a reproducible and field-accepted foundation. This constitutes a partial revision focused on textual clarification rather than new data collection. revision: partial
Referee: Evaluation Framework and Rubric description: the 8-dimension rubric is presented as distinguishing 'right answer, wrong reasoning' issues, but no section reports inter-rater reliability, correlation with documented TPS physical constraints, or validation against actual engineer error patterns; without this, the rubric's ability to surface critical failures remains unproven and directly affects the dual-track evaluation results.

Authors: We concur that quantitative reliability metrics would strengthen the rubric's credibility. The 8 dimensions were derived from documented error classes in hypersonic gas-dynamics literature (formula mis-selection, dimensional inconsistency, violation of thermodynamic limits, etc.). The human audit described in the paper involved two reviewers with aerospace backgrounds who examined a 20 % sample of judge outputs; disagreements were resolved by consensus. In the revised manuscript we will insert a short subsection under Evaluation Framework that (a) reports the observed agreement rate on the audited subset, (b) provides one concrete example per dimension linking the rubric criterion to a specific physical constraint (e.g., positivity of heat flux), and (c) notes that a full correlation study against logged engineer errors is left for future work. This is a partial revision that adds transparency without requiring new experiments. revision: partial
Referee: Experiments section: reported KPI spreads (12.6-87.9) and intervention improvements are interpreted as establishing a 'complete diagnose-evaluate-intervene framework,' but the absence of human-expert baseline performance on the same 420 items or comparison against existing engineering calculation benchmarks makes it impossible to calibrate the absolute severity of the observed defects.

Authors: We accept that absolute severity is difficult to judge without a human ceiling. Our primary goal was to demonstrate relative differences and the diagnostic power of the dual-track approach on TPS-specific tasks, which existing general benchmarks do not cover. In the revision we will (i) add a paragraph in the Experiments section that situates the observed KPI range against published model scores on MATH and SciBench (where even strong models rarely exceed 80 % on multi-step symbolic problems), (ii) state that expert TPS engineers are expected to achieve near-ceiling accuracy on these closed-form items, and (iii) list the lack of a measured human baseline as an explicit limitation with a suggestion for future benchmark extensions. This is a partial revision consisting of added discussion and limitation text. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark curation from external textbook with independent rubric

full rationale

The paper constructs TPS-CalcBench by selecting tasks from Anderson's external textbook, applying a human-AI pipeline to generate 420 items, and defining an 8-dimension rubric for reasoning quality. No load-bearing step reduces to self-definition, fitted inputs renamed as predictions, or self-citation chains. The central claims concern empirical coverage and diagnostic utility of the resulting benchmark; these rest on the external source material and separately stated rubric criteria rather than any equation or parameter that is equivalent to its own inputs by construction. Absence of derivations, uniqueness theorems, or ansatzes eliminates the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The benchmark rests primarily on domain assumptions about textbook problems representing real engineering practice and on the validity of a newly introduced evaluation rubric; no free parameters or invented physical entities are evident from the abstract.

axioms (1)

domain assumption Tasks from Anderson's textbook accurately represent the closed-form analytical calculations performed by experienced TPS engineers without simulations.
The entire benchmark taxonomy and core items are built from these tasks as the foundation for testing LLM competence.

invented entities (2)

8-dimension rubric for reasoning quality no independent evidence
purpose: To score not only final numerical accuracy but also the quality of the LLM's reasoning process and detect right-answer-wrong-reasoning cases.
Introduced as part of the dual-track evaluation to address limitations of answer-only benchmarks.
4 difficulty levels and 8 categories task taxonomy no independent evidence
purpose: To systematically organize benchmark items by engineering complexity and type.
New domain-oriented structure for the 420 core items.

pith-pipeline@v0.9.0 · 5582 in / 1588 out tokens · 69152 ms · 2026-05-10T05:21:00.065919+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 11 canonical work pages · 6 internal anchors

[1]

N. D. Birell and P. C. W. Davies , year = 1982, title =

1982
[2]

R. P. Feynman. Phys.\ Rev. 1954

1954
[3]

Einstein and Yu Podolsky and N

A. Einstein and Yu Podolsky and N. Rosen. Phys.\ Rev. 1935

1935
[4]

Berman, Jr., G. P. and Izrailev, Jr., F. M. Stability of nonlinear modes. Physica D. 1983

1983
[5]

E. B. Davies and L. Parns. Trapped modes in acoustic waveguides. Q. J. Mech. Appl. Math. 1988

1988
[6]

Edward Witten. 2001. hep-th/0106109

work page internal anchor Pith review arXiv 2001
[7]

E. Beutler. Williams Hematology. 1994

1994
[8]

Donald E. Knuth. Fundamental Algorithms. 1973b 1973

1973
[9]

J. S. Smith and G. W. Johnson. Philos. Trans. R. Soc. London, Ser. B. 2005

2005
[10]

W. J. Smith and T. J. Johnson and B. G. Miller. Surface chemistry and preferential crystal orientation on a silicon surface
[11]

V. K. Smith and K. Johnson and M. O. Klein. Surface chemistry and preferential crystal orientation on a silicon surface
[12]

Lower Bounds for Wishful Research Results

Ulrich \" U nderwood and Ned \ N et and Paul \= P ot. Lower Bounds for Wishful Research Results
[13]

M. P. Johnson and K. L. Miller and K. Smith. 2007

2007
[14]

AIP Conf. Proc. 2007

2007
[15]

Fifteenth Annual

Proc. Fifteenth Annual
[16]

Y. Burstyn. Proceedings of the 5th International Molecular Beam Epitaxy Conference, Santa Fe, NM. 2004

2004
[17]

Proceedings of the 2003 Particle Accelerator Conference, Portland, OR, 12-16 May 2005. 2001

2003
[18]

A. G. Agarwal. Proceedings of the Fifth Low Temperature Conference, Madison, WI, 1999. Semiconductors. 2001

1999
[19]

R. Smith. Hummingbirds are our friends
[20]

J. Smith. Proc. SPIE. 2007

2007
[21]

An O(n n / \! n) Sorting Algorithm

Tom T \' e rrific. An O(n n / \! n) Sorting Algorithm
[22]

Mastering Thesis Writing

\' E douard Masterly. Mastering Thesis Writing
[23]

S. R. Kawa and S.-J. Lin. J. Geophys. Res. 2003

2003
[24]

Phidias Phony-Baloney

F. Phidias Phony-Baloney. Fighting Fire with Fire: Festooning F rench Phrases
[25]

Donald E. Knuth. Seminumerical Algorithms. 1973c 1981

1981
[26]

Jill C. Knvth. The Programming of Computer Art
[28]

Opechowski and R

W. Opechowski and R. Guccione. Introduction to the Theory of Normal Metals. Magnetism
[29]

Opechowski and R

W. Opechowski and R. Guccione. Introduction to the Theory of Normal Metals. Magnetism. 1965

1965
[30]

J. M. Smith. Molecular Dynamics. 1980

1980
[31]

V. E. Zakharov and A. B. Shabat. Exact theory of two-dimensional self-focusing and one-dimensional self-modulation of waves in nonlinear media. Zh. Eksp. Teor. Fiz. 1971

1971
[32]

The Theory of Atom Lasers

R. Ballagh and C.M. Savage. Bose-Einstein condensation: from atomic physics to quantum fluids. Proceedings of the 13th Physics Summer School. 2000. cond-mat/0008070

work page Pith review arXiv 2000
[33]

Daniel D. Lincoll. Semigroups of Recurrences. High Speed Computer and Algorithm Organization
[34]

Oaho and Jeffrey D

Alfred V. Oaho and Jeffrey D. Ullman and Mihalis Yannakakis. On Notions of Information Transfer in VLSI Circuits. Proc. Fifteenth Annual ACM
[35]

The Definitive Computer Manual

Larry Manmaker. The Definitive Computer Manual
[36]

Anderson, J. D. (2006). Hypersonic and High-Temperature Gas Dynamics (2nd ed.). AIAA Education Series

2006
[37]

Bertin, J. J. (1994). Hypersonic Aerothermodynamics. AIAA Education Series

1994
[38]

A., Gupta, R

Gnoffo, P. A., Gupta, R. N., & Shinn, J. L. (1999). Conservation equations and physical models for hypersonic air flows in thermal and chemical nonequilibrium. NASA TP-2867

1999
[39]

GPT-4 Technical Report

OpenAI (2023). GPT-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

The Claude 3 model family

Anthropic (2024). The Claude 3 model family. Technical report

2024
[41]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Google DeepMind (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530

work page internal anchor Pith review arXiv 2024
[42]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., ... & Schulman, J. (2021). Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[43]

& Steinhardt, J

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., ... & Steinhardt, J. (2021). Measuring mathematical problem solving with the MATH dataset. NeurIPS

2021
[44]

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). Measuring massive multitask language understanding. ICLR

2021
[45]

Sun, R., et al. (2024). SciEval: A multi-level large language model evaluation benchmark for scientific research. arXiv preprint arXiv:2308.13149

work page arXiv 2024
[46]

Wang, X., et al. (2023). SciBench: Evaluating college-level scientific problem-solving abilities of large language models. arXiv preprint arXiv:2307.10635

work page arXiv 2023
[47]

Let's Verify Step by Step

Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., ... & Cobbe, K. (2023). Let's verify step by step. arXiv preprint arXiv:2305.20050

work page internal anchor Pith review arXiv 2023
[48]

Solving math word problems with process- and outcome-based feedback

Uesato, J., Kushman, N., Kumar, R., Song, F., Siegel, N., Wang, L., ... & Kalai, A. (2022). Solving math word problems with process- and outcome-based feedback. arXiv preprint arXiv:2211.14275

work page internal anchor Pith review Pith/arXiv arXiv 2022
[49]

Yang, S., et al. (2023). Rethinking benchmark and contamination for language models with rephrased samples. arXiv preprint arXiv:2311.04850

work page arXiv 2023
[50]

G., Athalye, A., & Mueller, J

Northcutt, C. G., Athalye, A., & Mueller, J. (2021). Pervasive label errors in test sets destabilize machine learning benchmarks. NeurIPS

2021
[51]

Biderman, S., et al. (2024). Lessons from the trenches on reproducible evaluation of language models. arXiv preprint arXiv:2405.14782

work page arXiv 2024
[52]

Zhang, Y., et al. (2024). Towards LLM-assisted CFD simulation: Benchmarks and evaluation. arXiv preprint

2024
[53]

American Invitational Mathematics Examination

AIME (2024). American Invitational Mathematics Examination. Mathematical Association of America

2024