Position: Don’t Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints

Sam Bowyer, Laurence Aitchison, Desi R · 2025 · arXiv 2503.01747

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

The multiply iterated law of the iterated logarithm: game-theoretic foundations of sequential detection boundaries

math.ST · 2026-06-26 · unverdicted · novelty 7.0

The multiply iterated LIL is derived as the minimax boundary of a sequential-detection game whose equalizer prior is the Jeffreys prior selected by the Erdős-Kolmogorov integral test, yielding a closed-form 3/2 coefficient correction.

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

cs.AI · 2025-10-05 · unverdicted · novelty 6.0

A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.

citing papers explorer

Showing 2 of 2 citing papers.

The multiply iterated law of the iterated logarithm: game-theoretic foundations of sequential detection boundaries math.ST · 2026-06-26 · unverdicted · none · ref 3
The multiply iterated LIL is derived as the minimax boundary of a sequential-detection game whose equalizer prior is the Jeffreys prior selected by the Erdős-Kolmogorov integral test, yielding a closed-form 3/2 coefficient correction.
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation cs.AI · 2025-10-05 · unverdicted · none · ref 46
A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.

Position: Don’t Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints

fields

years

verdicts

representative citing papers

citing papers explorer