DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation
Pith reviewed 2026-05-23 01:33 UTC · model grok-4.3
The pith
A dataset of over 250 million prompt variations shows LLMs change outputs when delimiters, enumerators or wording shift, questioning single-prompt evaluations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DOVE supplies a large-scale set of multi-dimensional prompt perturbations for existing benchmarks and demonstrates through model evaluations that LLMs remain sensitive to arbitrary prompt dimensions even when several factors change at once, which undermines reliance on single-prompt testing.
What carries the argument
The DOVE dataset, which creates thousands of joint perturbations per instance by varying multiple prompt dimensions and records the resulting model predictions.
If this is right
- Prompt selection procedures can be built that pick formats shown to work well across the recorded variations.
- Adding few-shot examples tends to lower the amount by which model outputs change under prompt perturbations.
- Some benchmark items remain low-performing for a model family no matter which combination of dimensions is used.
Where Pith is reading between the lines
- Benchmark reporting could move from single scores to ranges or distributions over prompt variants as a standard practice.
- Prompt design workflows might add explicit checks for stability under format changes before claiming results.
Load-bearing premise
The specific prompt dimensions chosen for perturbation represent the main variations that affect evaluation outcomes.
What would settle it
A direct test in which model accuracy rankings on a standard benchmark stay essentially unchanged when measured by single prompts versus by averages across the corresponding DOVE perturbations.
read the original abstract
Recent work found that LLMs are sensitive to a wide range of arbitrary prompt dimensions, including the type of delimiters, answer enumerators, instruction wording, and more. This throws into question popular single-prompt evaluation practices. We present DOVE (Dataset Of Variation Evaluation) a large-scale dataset containing prompt perturbations of various evaluation benchmarks. In contrast to previous work, we examine LLM sensitivity from an holistic perspective, and assess the joint effects of perturbations along various dimensions, resulting in thousands of perturbations per instance. We evaluate several model families against DOVE, leading to several findings, including efficient methods for choosing well-performing prompts, observing that few-shot examples reduce sensitivity, and identifying instances which are inherently hard across all perturbations. DOVE consists of more than 250M prompt perturbations and model outputs, which we make publicly available to spur a community-wide effort toward meaningful, robust, and efficient evaluation. Browse the data, contribute, and more: https://slab-nlp.github.io/DOVE/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DOVE, a large-scale dataset containing more than 250 million prompt perturbations of evaluation benchmarks, generated by jointly varying multiple dimensions including delimiters, answer enumerators, instruction wording, and others. The authors evaluate several LLM families on DOVE, reporting empirical findings on LLM sensitivity to these perturbations, that few-shot examples reduce sensitivity, efficient methods for selecting well-performing prompts, and the identification of instances that remain hard across all perturbations. The full dataset and model outputs are released publicly to support more robust LLM evaluation practices.
Significance. If the reported sensitivity findings and dataset construction hold, the work is significant as a large-scale empirical resource that enables holistic, multi-dimensional analysis of prompt effects, directly challenging single-prompt evaluation norms. The public release of 250M+ perturbations and outputs, along with the joint perturbation design, provides a concrete community asset that prior smaller-scale studies lack. The scale and reproducibility focus are clear strengths.
minor comments (2)
- [Abstract / Methods] The abstract states the dataset scale and findings but provides limited detail on perturbation generation methods, statistical controls, or how joint effects are quantified; the main text should include a dedicated methods subsection with pseudocode or explicit enumeration rules for each dimension to allow replication.
- [Introduction / Related Work] The claim that DOVE examines 'holistic' joint effects would be strengthened by an explicit comparison table showing how many perturbations per instance are generated versus prior single-dimension studies.
Simulated Author's Rebuttal
We thank the referee for their positive summary of DOVE, recognition of its significance as a large-scale resource, and recommendation for minor revision. No major comments were raised in the report.
Circularity Check
No significant circularity
full rationale
The paper is an empirical contribution that constructs and releases a large dataset of joint prompt perturbations across explicit dimensions (delimiters, enumerators, instruction wording, etc.) and reports observed LLM sensitivities on it. No equations, fitted parameters, or derivations are present; the central claims rest on direct measurement of model outputs rather than any reduction to self-defined quantities or self-citation chains. The representativeness of the chosen dimensions is treated as a scope limitation, not an internal premise that collapses into the inputs.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation
PromptSuite is a modular, extensible, task-agnostic framework for automatically generating diverse prompt variations to support robust multi-prompt LLM evaluation.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.