ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments

Avi Caciularu; Eliya Habba; Gabriel Stanovsky; Gili Lior; Shahar Levy

arxiv: 2505.22169 · v2 · pith:TMIDWGUGnew · submitted 2025-05-28 · 💻 cs.CL

ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments

Gili Lior , Eliya Habba , Shahar Levy , Avi Caciularu , Gabriel Stanovsky This is my paper

classification 💻 cs.CL

keywords promptevaluationmethodllmsmeaningfulmomentsrecipereliableeval

0 comments

read the original abstract

LLMs are highly sensitive to prompt phrasing, yet standard benchmarks typically report performance using a single prompt, raising concerns about the reliability of such evaluations. In this work, we argue for a stochastic method of moments evaluation over the space of meaning-preserving prompt perturbations. We introduce a formal definition of reliable evaluation that accounts for prompt sensitivity, and suggest ReliableEval - a method for estimating the number of prompt resamplings needed to obtain meaningful results. Using our framework, we stochastically evaluate five frontier LLMs and find that even top-performing models like GPT-4o and Claude-3.7-Sonnet exhibit substantial prompt sensitivity. Our approach is model-, task-, and metric-agnostic, offering a recipe for meaningful and robust LLM evaluation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation
cs.CL 2025-07 unverdicted novelty 6.0

PromptSuite is a modular, extensible, task-agnostic framework for automatically generating diverse prompt variations to support robust multi-prompt LLM evaluation.
Improving Multi-turn Dialogue Consistency with Self-Recall Thinking
cs.CL 2026-05 unverdicted novelty 5.0

SRT framework improves multi-turn dialogue F1 by 4.7% and cuts end-to-end latency by 14.7% via dependency construction, capability initialization, and reasoning improvement with recall tokens.