Efficient Evaluation of LLM Performance with Statistical Guarantees

Emmanuel J. Cand\`es, Skyler Wu, Yash Nair

classification 📊 stat.ML cs.LG

keywords activeinferencecoverageevaluationfinite-populationlargemodelsampling

read the original abstract

Exhaustively evaluating many large language models (LLMs) on a large suite of benchmarks is expensive. We cast benchmarking as finite-population inference and, under a fixed query budget, seek tight confidence intervals (CIs) for model accuracy with valid frequentist coverage. We propose Factorized Active Querying (FAQ), which (a) leverages historical information through a Bayesian factor model; (b) adaptively selects questions using a hybrid variance-reduction/active-learning sampling policy; and (c) maintains validity through Proactive Active Inference -- a finite-population extension of active inference (Zrnic & Cand\`es, 2024) that enables direct question selection while preserving coverage. With negligible overhead cost, FAQ delivers up to $5\times$ effective sample size gains over strong baselines on two benchmark suites, across varying historical-data missingness levels: this means that it matches the CI width of uniform sampling while using up to $5\times$ fewer queries. We release our source code and our curated datasets to support reproducible evaluation and future research.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization
cs.LG 2026-05 unverdicted novelty 6.0

Doubly robust estimators that incorporate low-rank predictions enable valid finite-sample confidence intervals for best-model identification under adaptive sampling and without-replacement example selection in LLM evaluation.
An Interpretable and Scalable Framework for Evaluating Large Language Models
stat.ML 2026-05 unverdicted novelty 6.0

A majorization-minimization framework turns IRT into scalable matrix factorization subproblems for LLM evaluation, delivering orders-of-magnitude speedups with identifiability guarantees.
Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking
stat.ML 2026-05 unverdicted novelty 6.0

SIREN corrects winner's curse bias in adaptive LLM benchmarking via selection-aware repeated splits and bootstrap for valid procedure-level confidence intervals.