pith. sign in

ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it
abstract

LLMs are highly sensitive to prompt phrasing, yet standard benchmarks typically report performance using a single prompt, raising concerns about the reliability of such evaluations. In this work, we argue for a stochastic method of moments evaluation over the space of meaning-preserving prompt perturbations. We introduce a formal definition of reliable evaluation that accounts for prompt sensitivity, and suggest ReliableEval - a method for estimating the number of prompt resamplings needed to obtain meaningful results. Using our framework, we stochastically evaluate five frontier LLMs and find that even top-performing models like GPT-4o and Claude-3.7-Sonnet exhibit substantial prompt sensitivity. Our approach is model-, task-, and metric-agnostic, offering a recipe for meaningful and robust LLM evaluation.

fields

cs.CL 2

years

2026 1 2025 1

verdicts

UNVERDICTED 2

representative citing papers

citing papers explorer

Showing 2 of 2 citing papers.