DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation

Eliya Habba; Elron Bandel; Gabriel Stanovsky; Itay Itzhak; Leshem Choshen; Michal Shmueli-Scheuer; Ofir Arviv; Yotam Perlitz

arxiv: 2503.01622 · v4 · submitted 2025-03-03 · 💻 cs.CL

DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation

Eliya Habba , Ofir Arviv , Itay Itzhak , Yotam Perlitz , Elron Bandel , Leshem Choshen , Michal Shmueli-Scheuer , Gabriel Stanovsky This is my paper

Pith reviewed 2026-05-23 01:33 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM evaluationprompt sensitivitybenchmark robustnessmulti-dimensional perturbationsdataset construction

0 comments

The pith

A dataset of over 250 million prompt variations shows LLMs change outputs when delimiters, enumerators or wording shift, questioning single-prompt evaluations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DOVE as a collection of prompt perturbations applied across evaluation benchmarks to measure LLM behavior under many simultaneous changes. It tests how models respond when delimiters, answer enumerators, instruction wording and related factors vary together, producing thousands of versions per original item. The work finds that single fixed prompts can give misleading pictures of capability because small arbitrary choices move the results. Access to the full set of outputs allows identification of prompts that work reliably and instances that stay hard under every variation. This approach matters because current practice often treats one prompt as representative when the data indicate otherwise.

Core claim

DOVE supplies a large-scale set of multi-dimensional prompt perturbations for existing benchmarks and demonstrates through model evaluations that LLMs remain sensitive to arbitrary prompt dimensions even when several factors change at once, which undermines reliance on single-prompt testing.

What carries the argument

The DOVE dataset, which creates thousands of joint perturbations per instance by varying multiple prompt dimensions and records the resulting model predictions.

If this is right

Prompt selection procedures can be built that pick formats shown to work well across the recorded variations.
Adding few-shot examples tends to lower the amount by which model outputs change under prompt perturbations.
Some benchmark items remain low-performing for a model family no matter which combination of dimensions is used.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmark reporting could move from single scores to ranges or distributions over prompt variants as a standard practice.
Prompt design workflows might add explicit checks for stability under format changes before claiming results.

Load-bearing premise

The specific prompt dimensions chosen for perturbation represent the main variations that affect evaluation outcomes.

What would settle it

A direct test in which model accuracy rankings on a standard benchmark stay essentially unchanged when measured by single prompts versus by averages across the corresponding DOVE perturbations.

read the original abstract

Recent work found that LLMs are sensitive to a wide range of arbitrary prompt dimensions, including the type of delimiters, answer enumerators, instruction wording, and more. This throws into question popular single-prompt evaluation practices. We present DOVE (Dataset Of Variation Evaluation) a large-scale dataset containing prompt perturbations of various evaluation benchmarks. In contrast to previous work, we examine LLM sensitivity from an holistic perspective, and assess the joint effects of perturbations along various dimensions, resulting in thousands of perturbations per instance. We evaluate several model families against DOVE, leading to several findings, including efficient methods for choosing well-performing prompts, observing that few-shot examples reduce sensitivity, and identifying instances which are inherently hard across all perturbations. DOVE consists of more than 250M prompt perturbations and model outputs, which we make publicly available to spur a community-wide effort toward meaningful, robust, and efficient evaluation. Browse the data, contribute, and more: https://slab-nlp.github.io/DOVE/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DOVE releases a large public dataset of joint multi-dimensional prompt perturbations that lets others study LLM evaluation sensitivity at scale.

read the letter

The main contribution here is the DOVE dataset itself. It takes existing benchmarks and applies perturbations across several prompt dimensions at once—delimiters, enumerators, instruction wording, and others—creating thousands of variants per instance for a total of more than 250 million prompt-model output pairs. The authors make the whole thing public with a browseable interface, which is the clearest practical step forward. They also report a few patterns from running models on it: few-shot examples appear to dampen sensitivity, some instances stay hard no matter the prompt, and there may be cheaper ways to pick prompts that perform well on average. These are the kind of observations that could inform more stable evaluation protocols. The work extends single-dimension perturbation studies by looking at joint effects, and the scale plus openness gives the community something concrete to use or extend. The motivation section lands cleanly because single-prompt evals really are common and fragile. On the softer side, the abstract gives almost no information on how the perturbations were actually generated or what statistical checks were applied, so it is difficult to judge whether the sensitivity results are robust or could be driven by how the variants were constructed. The chosen dimensions are reasonable but their coverage of real-world prompt variation is an open question rather than a proven one. This paper is aimed at researchers who work on LLM benchmarking and want raw data to run their own analyses rather than a finished theoretical claim. Someone building or auditing evaluation suites would find the dataset directly usable. It deserves a serious referee because the scale and release address a practical problem that affects many papers; referees can push on the generation methods and analysis without the core idea being flawed.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces DOVE, a large-scale dataset containing more than 250 million prompt perturbations of evaluation benchmarks, generated by jointly varying multiple dimensions including delimiters, answer enumerators, instruction wording, and others. The authors evaluate several LLM families on DOVE, reporting empirical findings on LLM sensitivity to these perturbations, that few-shot examples reduce sensitivity, efficient methods for selecting well-performing prompts, and the identification of instances that remain hard across all perturbations. The full dataset and model outputs are released publicly to support more robust LLM evaluation practices.

Significance. If the reported sensitivity findings and dataset construction hold, the work is significant as a large-scale empirical resource that enables holistic, multi-dimensional analysis of prompt effects, directly challenging single-prompt evaluation norms. The public release of 250M+ perturbations and outputs, along with the joint perturbation design, provides a concrete community asset that prior smaller-scale studies lack. The scale and reproducibility focus are clear strengths.

minor comments (2)

[Abstract / Methods] The abstract states the dataset scale and findings but provides limited detail on perturbation generation methods, statistical controls, or how joint effects are quantified; the main text should include a dedicated methods subsection with pseudocode or explicit enumeration rules for each dimension to allow replication.
[Introduction / Related Work] The claim that DOVE examines 'holistic' joint effects would be strengthened by an explicit comparison table showing how many perturbations per instance are generated versus prior single-dimension studies.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of DOVE, recognition of its significance as a large-scale resource, and recommendation for minor revision. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical contribution that constructs and releases a large dataset of joint prompt perturbations across explicit dimensions (delimiters, enumerators, instruction wording, etc.) and reports observed LLM sensitivities on it. No equations, fitted parameters, or derivations are present; the central claims rest on direct measurement of model outputs rather than any reduction to self-defined quantities or self-citation chains. The representativeness of the chosen dimensions is treated as a scope limitation, not an internal premise that collapses into the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms beyond standard domain assumptions in NLP evaluation, or invented entities are introduced; the work is an empirical dataset construction effort.

pith-pipeline@v0.9.0 · 5737 in / 1040 out tokens · 43764 ms · 2026-05-23T01:33:52.818358+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation
cs.CL 2025-07 unverdicted novelty 6.0

PromptSuite is a modular, extensible, task-agnostic framework for automatically generating diverse prompt variations to support robust multi-prompt LLM evaluation.