LLMs can execute specific technical steps in statistical proofs when given precise guidance but become unreliable for open-ended problem formulation or multi-step reasoning, relocating rather than reducing the demand for human expertise.
StatEval: A Comprehensive Benchmark for Large Language Models in Statistics
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Despite rapid advances in large language models (LLMs), statistical reasoning remains underrepresented in existing LLM benchmarks, which often do not reflect the layered, proof-driven nature of real statistical practice. To address this gap, we introduce \textbf{StatEval}, the first large-scale benchmark for statistical reasoning across curricular and research-level settings. StatEval includes over 100,000 curated problems, with 20,000+ foundational questions spanning undergraduate and graduate curricula and 80,000+ research-level proof tasks extracted from leading statistical journals. To construct StatEval, we develop \textbf{TRACE} (Topology and Reasoning-Aware Context Extractor), a multi-agent pipeline with human-in-the-loop validation that converts unstructured academic texts into self-contained theorem-level reasoning tasks. We also propose an Adaptive Process-Based Scoring Pipeline for complex statistical proofs, enabling fine-grained evaluation beyond final-answer matching. Experiments show that while LLMs perform reasonably on foundational tasks, they struggle with rigorous research-level reasoning. Beyond evaluation, StatEval serves as a resource for improving reasoning, as retrieval-augmented generation and domain-specific alignment consistently enhance performance. Together, these results establish StatEval as both a benchmark and an infrastructure for advancing statistical reasoning in LLMs.
fields
stat.OT 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Statistical Proof as a Window into Human-AI Collaboration: Practical Insights and a Community Agenda
LLMs can execute specific technical steps in statistical proofs when given precise guidance but become unreliable for open-ended problem formulation or multi-step reasoning, relocating rather than reducing the demand for human expertise.