EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving

Xiyuan Zhou , Xinlei Wang , Yirui He , Yang Wu , Ruixi Zou , Yuheng Cheng , Yulu Xie , Wenxuan Liu

show 4 more authors

Huan Zhao Yan Xu Jinjin Gu Junhua Zhao

Authors on Pith no claims yet

classification 💻 cs.AI

keywords engineeringreasoningperformanceengibenchllmsmodelmodelsbenchmark

0 comments

read the original abstract

Large language models (LLMs) have shown strong performance on mathematical reasoning under well-defined conditions. However, real-world engineering problems involve uncertainty, context, and open-ended settings that extend beyond symbolic computation. Existing benchmarks largely focus on well-defined or abstract reasoning and therefore fail to capture these complexities. We introduce EngiBench, a hierarchical benchmark designed to evaluate LLMs on solving engineering problems. It spans three levels of increasing difficulty (foundational knowledge retrieval, contextual reasoning, and open-ended modeling) and covers diverse engineering subfields. To facilitate a deeper understanding of model performance, we systematically rewrite each problem into three controlled variants (perturbed, knowledge-enhanced, and math abstraction), enabling us to separately evaluate the model's robustness, domain-specific knowledge, and mathematical reasoning abilities. Experimental results show clear performance stratification across difficulty levels: model accuracy declines with task complexity, degrades under minor perturbations, and remains substantially below human performance on high-level engineering tasks. These findings reveal that current LLMs still lack the high-level reasoning needed for real-world engineering, highlighting the need for future models with deeper and more reliable problem-solving capabilities. Our source code and data are available at https://github.com/AI4Engi/EngiBench.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models
cs.AI 2026-03 accept novelty 7.0

ThermoQA benchmark shows top LLMs reach 92-94% overall on thermodynamics problems but degrade sharply on full cycle analysis, confirming that property knowledge does not equal reasoning ability.
IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs
cs.AI 2026-05 conditional novelty 6.0

IndustryBench shows top LLMs score only 2.083 out of 3 on industrial QA tasks, with safety-violation checks reshuffling rankings and extended reasoning often adding unsupported unsafe details.
IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs
cs.AI 2026-05 conditional novelty 6.0

IndustryBench shows LLMs reach only modest scores on standards-grounded industrial procurement questions and that safety-violation filtering substantially changes model rankings.
IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs
cs.AI 2026-05 conditional novelty 6.0

IndustryBench is a standards-grounded Chinese benchmark that exposes LLMs' persistent gaps in industrial terminology, safety compliance, and parameter accuracy, with safety checks reshuffling model rankings.
How Far Are We? Systematic Evaluation of LLMs vs. Human Experts in Mathematical Contest in Modeling
cs.CL 2026-04 unverdicted novelty 5.0

LLMs exhibit a persistent comprehension-execution gap in end-to-end mathematical modeling tasks, with a new stage-wise framework showing better alignment to human expert judgments than prior schemes.