A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.
Position: Don’t Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
representative citing papers
citing papers explorer
-
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation
A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.