Position: Don’t Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints

Sam Bowyer, Laurence Aitchison, Desi R · 2025 · arXiv 2503.01747

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

cs.AI · 2025-10-05 · unverdicted · novelty 6.0

A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.

The multiply iterated law of the iterated logarithm: game-theoretic foundations of sequential detection boundaries

math.ST · 2026-06-26

citing papers explorer

Showing 1 of 1 citing paper after filters.

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation cs.AI · 2025-10-05 · unverdicted · none · ref 46
A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.

Position: Don’t Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints

fields

years

verdicts

representative citing papers

citing papers explorer