StatEval: A Comprehensive Benchmark for Large Language Models in Statistics

· 2025 · cs.CL · arXiv 2510.09517

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Despite rapid advances in large language models (LLMs), statistical reasoning remains underrepresented in existing LLM benchmarks, which often do not reflect the layered, proof-driven nature of real statistical practice. To address this gap, we introduce \textbf{StatEval}, the first large-scale benchmark for statistical reasoning across curricular and research-level settings. StatEval includes over 100,000 curated problems, with 20,000+ foundational questions spanning undergraduate and graduate curricula and 80,000+ research-level proof tasks extracted from leading statistical journals. To construct StatEval, we develop \textbf{TRACE} (Topology and Reasoning-Aware Context Extractor), a multi-agent pipeline with human-in-the-loop validation that converts unstructured academic texts into self-contained theorem-level reasoning tasks. We also propose an Adaptive Process-Based Scoring Pipeline for complex statistical proofs, enabling fine-grained evaluation beyond final-answer matching. Experiments show that while LLMs perform reasonably on foundational tasks, they struggle with rigorous research-level reasoning. Beyond evaluation, StatEval serves as a resource for improving reasoning, as retrieval-augmented generation and domain-specific alignment consistently enhance performance. Together, these results establish StatEval as both a benchmark and an infrastructure for advancing statistical reasoning in LLMs.

representative citing papers

Statistical Proof as a Window into Human-AI Collaboration: Practical Insights and a Community Agenda

stat.OT · 2026-06-22 · unverdicted · novelty 3.0

LLMs can execute specific technical steps in statistical proofs when given precise guidance but become unreliable for open-ended problem formulation or multi-step reasoning, relocating rather than reducing the demand for human expertise.

citing papers explorer

Showing 1 of 1 citing paper.

Statistical Proof as a Window into Human-AI Collaboration: Practical Insights and a Community Agenda stat.OT · 2026-06-22 · unverdicted · none · ref 9 · internal anchor
LLMs can execute specific technical steps in statistical proofs when given precise guidance but become unreliable for open-ended problem formulation or multi-step reasoning, relocating rather than reducing the demand for human expertise.

StatEval: A Comprehensive Benchmark for Large Language Models in Statistics

fields

years

verdicts

representative citing papers

citing papers explorer