ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments

Gili Lior, Eliya Habba, Shahar Levy, Avi Caciularu, Gabriel Stanovsky · 2025 · cs.CL · arXiv 2505.22169

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

LLMs are highly sensitive to prompt phrasing, yet standard benchmarks typically report performance using a single prompt, raising concerns about the reliability of such evaluations. In this work, we argue for a stochastic method of moments evaluation over the space of meaning-preserving prompt perturbations. We introduce a formal definition of reliable evaluation that accounts for prompt sensitivity, and suggest ReliableEval - a method for estimating the number of prompt resamplings needed to obtain meaningful results. Using our framework, we stochastically evaluate five frontier LLMs and find that even top-performing models like GPT-4o and Claude-3.7-Sonnet exhibit substantial prompt sensitivity. Our approach is model-, task-, and metric-agnostic, offering a recipe for meaningful and robust LLM evaluation.

representative citing papers

PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation

cs.CL · 2025-07-20 · unverdicted · novelty 6.0

PromptSuite is a modular, extensible, task-agnostic framework for automatically generating diverse prompt variations to support robust multi-prompt LLM evaluation.

Improving Multi-turn Dialogue Consistency with Self-Recall Thinking

cs.CL · 2026-05-14 · unverdicted · novelty 5.0

SRT framework improves multi-turn dialogue F1 by 4.7% and cuts end-to-end latency by 14.7% via dependency construction, capability initialization, and reasoning improvement with recall tokens.

citing papers explorer

Showing 2 of 2 citing papers.

PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation cs.CL · 2025-07-20 · unverdicted · none · ref 14 · internal anchor
PromptSuite is a modular, extensible, task-agnostic framework for automatically generating diverse prompt variations to support robust multi-prompt LLM evaluation.
Improving Multi-turn Dialogue Consistency with Self-Recall Thinking cs.CL · 2026-05-14 · unverdicted · none · ref 16 · internal anchor
SRT framework improves multi-turn dialogue F1 by 4.7% and cuts end-to-end latency by 14.7% via dependency construction, capability initialization, and reasoning improvement with recall tokens.

ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments

fields

years

verdicts

representative citing papers

citing papers explorer