GAVEL: Grounded Caption Error Verification and Localization
Pith reviewed 2026-06-26 04:41 UTC · model grok-4.3
The pith
GAVEL introduces a task requiring vision-language models to verify caption errors, explain the mismatch, and localize visual evidence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GAVEL is a task that jointly addresses verification, explanation, and localization for image-text pairs, and the corresponding dataset supplies supervision that improves performance on these abilities where strong zero-shot models fall short.
What carries the argument
The GAVEL task that requires simultaneous verification of caption-image misalignment, explanation of the discrepancy, and localization of visual evidence.
If this is right
- Supervised training on the GAVEL data produces consistent gains on grounding and explanation metrics.
- Strong closed-source vision-language models still struggle when required to perform verification, explanation, and localization together.
- The dataset supports systematic measurement of the three abilities at once rather than in isolation.
Where Pith is reading between the lines
- Larger collections of similar annotations could allow scaling the supervised approach beyond the current training split.
- Better results on GAVEL may reduce the rate of ungrounded statements in downstream caption-generation applications.
Load-bearing premise
The human annotations in the training split correctly identify caption errors and their visual locations without systematic bias or noise that would change the supervised results.
What would settle it
An evaluation in which the supervised baseline shows no measurable gain over strong closed-source models on the held-out test set for grounding or explanation metrics would falsify the claim that the dataset supplies useful learnable supervision.
Figures
read the original abstract
Vision-language models (VLMs) often produce hallucinated or inconsistent outputs, where text and images are not properly aligned. Addressing this issue requires not only detecting misalignment but also explaining the discrepancy and localizing its visual evidence. We introduce GAVEL (Grounded Caption Error Verification and Localization), a task that jointly addresses verification, explanation, and localization for image-text pairs. To support systematic evaluation, we also present a corresponding dataset and benchmark. We further train a supervised baseline on the human-annotated training split to assess whether GAVEL provides learnable supervision for these abilities. Experiments show that even strong closed-source models struggle on GAVEL, while the supervised baseline yields consistent improvements across grounding and explanation metrics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GAVEL, a task requiring joint verification of caption errors, generation of explanations, and localization of visual evidence in image-text pairs produced by VLMs. It contributes a human-annotated dataset and benchmark, then trains a supervised baseline on the training split and reports that closed-source models struggle while the baseline shows consistent gains on grounding and explanation metrics.
Significance. If the human annotations prove reliable, GAVEL would supply a much-needed benchmark that moves beyond binary hallucination detection to require explicit grounding and explanation, directly supporting development of more trustworthy VLMs. The empirical result that even strong models underperform a supervised baseline would be a useful existence proof that the task supplies learnable signal.
major comments (3)
- [Dataset construction] Dataset construction section: no inter-annotator agreement, adjudication protocol, or noise analysis is reported for the error-type labels or bounding-box localizations. Because the supervised baseline's reported gains rest entirely on these human annotations being accurate, the absence of reliability statistics makes it impossible to rule out that improvements reflect fitting to annotation artifacts rather than genuine visual-textual misalignment structure.
- [Experiments] Experiments section: the comparison between the supervised baseline and closed-source models does not specify the exact prompting format, number of few-shot examples, or output parsing procedure used for the closed-source models. Without these details it is unclear whether the reported performance gap is intrinsic to the models or an artifact of evaluation setup.
- [Benchmark metrics] Benchmark metrics section: the definitions of the grounding and explanation metrics are not accompanied by any human validation or correlation study showing that automatic scores align with human judgments of localization accuracy or explanation quality. This weakens the claim that the baseline yields 'consistent improvements' on these abilities.
minor comments (2)
- [Abstract] The abstract states that the supervised baseline 'yields consistent improvements across grounding and explanation metrics' but does not quantify the magnitude or statistical significance of those gains; a table of per-metric deltas with confidence intervals would strengthen the claim.
- [Task definition] Notation for error types and localization formats is introduced without an explicit legend or example in the main text; readers must consult the appendix to understand the label space.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript. We address each major comment below and indicate the planned revisions.
read point-by-point responses
-
Referee: [Dataset construction] Dataset construction section: no inter-annotator agreement, adjudication protocol, or noise analysis is reported for the error-type labels or bounding-box localizations. Because the supervised baseline's reported gains rest entirely on these human annotations being accurate, the absence of reliability statistics makes it impossible to rule out that improvements reflect fitting to annotation artifacts rather than genuine visual-textual misalignment structure.
Authors: We agree that annotation reliability details are important. We will expand the dataset construction section to describe the full annotation protocol, including how error types were assigned and how bounding-box localizations were collected, along with any available noise analysis. Inter-annotator agreement was not computed at annotation time, so those specific statistics cannot be reported; we will explicitly note this limitation. revision: partial
-
Referee: [Experiments] Experiments section: the comparison between the supervised baseline and closed-source models does not specify the exact prompting format, number of few-shot examples, or output parsing procedure used for the closed-source models. Without these details it is unclear whether the reported performance gap is intrinsic to the models or an artifact of evaluation setup.
Authors: We agree these details are required for reproducibility. The revised manuscript will include the exact prompting templates, the number of few-shot examples used, and the output parsing procedure for the closed-source models. revision: yes
-
Referee: [Benchmark metrics] Benchmark metrics section: the definitions of the grounding and explanation metrics are not accompanied by any human validation or correlation study showing that automatic scores align with human judgments of localization accuracy or explanation quality. This weakens the claim that the baseline yields 'consistent improvements' on these abilities.
Authors: We will revise the benchmark metrics section to provide fuller formal definitions of the grounding and explanation metrics. While a dedicated human correlation study lies outside the current scope, we will add qualitative examples in the appendix that illustrate how the automatic scores track with human judgments of localization and explanation quality. revision: partial
- Inter-annotator agreement statistics for error-type labels and bounding-box annotations, as these were not collected during dataset creation.
Circularity Check
No circularity: empirical benchmark proposal with no derivations
full rationale
The paper introduces a new task (GAVEL), a dataset, and benchmark evaluations of VLMs plus a standard supervised baseline. No equations, parameters, or derivation chains exist that could reduce to inputs by construction. The supervised baseline is ordinary training on human annotations and does not constitute a 'prediction' that is forced by fitting. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked. This is a self-contained empirical contribution whose central claims rest on external model performance and annotation quality rather than internal definitional loops.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Dense and aligned captions (dac) promote compositional reasoning in vl models , author=
-
[2]
Teaching structured vision & language concepts to vision & language models , author=
-
[3]
arXiv preprint arXiv:2602.12281 , year=
Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment , author=. arXiv preprint arXiv:2602.12281 , year=
-
[4]
Alfred: A benchmark for interpreting grounded instructions for everyday tasks , author=
-
[5]
Emerging Properties in Unified Multimodal Pretraining
Emerging properties in unified multimodal pretraining , author=. arXiv preprint arXiv:2505.14683 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Open-vocabulary semantic segmentation with mask-adapted clip , author=
-
[7]
Grounded language-image pre-training , author=
-
[8]
Visionllm: Large language model is also an open-ended decoder for vision-centric tasks , author=
-
[9]
Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
Open-vocabulary object detection via vision and language knowledge distillation , author=. arXiv preprint arXiv:2104.13921 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models , author=
-
[11]
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages=
Referitgame: Referring to objects in photographs of natural scenes , author=. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages=
2014
-
[12]
2026 , eprint=
HalDec-Bench: Benchmarking Hallucination Detector in Image Captioning , author=. 2026 , eprint=
2026
-
[13]
Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
arXiv preprint arXiv:2412.19531 , year=
Is Your Text-to-Image Model Robust to Caption Noise? , author=. arXiv preprint arXiv:2412.19531 , year=
-
[15]
Alip: Adaptive language-image pre-training with synthetic caption , author=
-
[16]
Synthesize diagnose and optimize: Towards fine-grained vision-language understanding , author=
-
[17]
Qwen3-vl technical report , author=. arXiv preprint arXiv:2511.21631 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
FirstName Alpher , title =
-
[19]
Journal of Foo , volume = 13, number = 1, pages =
FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =
-
[20]
Journal of Foo , volume = 14, number = 1, pages =
FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =
-
[21]
FirstName Alpher and FirstName Gamow , title =
-
[22]
Computer Vision -- ECCV 2022 , year =
2022
-
[23]
Evaluating text-to-visual generation with image-to-text generation , author=
-
[24]
What you see is what you read? improving text-image alignment evaluation , author=
-
[25]
2022 , publisher=
Learning to prompt for vision-language models , author=. 2022 , publisher=
2022
-
[26]
Clip2scene: Towards label-efficient 3d scene understanding by clip , author=
-
[27]
The unreasonable effectiveness of CLIP features for image captioning: an experimental analysis , author=
-
[28]
Winoground: Probing vision and language models for visio-linguistic compositionality , author=
-
[29]
Hugging Face , author=
-
[30]
Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality , author=
-
[31]
Teaching clip to count to ten , author=
-
[32]
Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling , author=. arXiv preprint arXiv:2412.05271 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
COLM , year=
Fine-grained hallucination detection and editing for language models , author=. COLM , year=
-
[34]
Cogvlm: Visual expert for pretrained language models , author=
-
[35]
LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion , author=. 2023
2023
-
[36]
EMNLP , year=
An empirical study of translation hypothesis ensembling with large language models , author=. EMNLP , year=
-
[37]
Llm evaluators recognize and favor their own generations , author=
-
[38]
Stable Diffusion 3.5 Large , author=
-
[39]
2014 , _organization=
Microsoft coco: Common objects in context , author=. 2014 , _organization=
2014
-
[40]
Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts , author=
-
[41]
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , author=
-
[42]
ACL , year=
ALOHa: A new measure for hallucination in captioning models , author=. ACL , year=
-
[43]
Cambrian-1: A fully open, vision-centric exploration of multimodal llms , author=
-
[44]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts , author=. arXiv preprint arXiv:2310.02255 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
Seed-bench: Benchmarking multimodal large language models , author=
-
[46]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , author=. arXiv preprint arXiv:2306.13394 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models , author=
-
[48]
Eyes wide shut? exploring the visual shortcomings of multimodal llms , author=
-
[49]
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras , author=. arXiv preprint arXiv:2503.01743 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi , author=
-
[51]
ACL , year=
Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark , author=. ACL , year=
-
[52]
Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[53]
Pixtral 12B , author=. arXiv preprint arXiv:2410.07073 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[54]
The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation , author=
-
[55]
LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild , url=
Li, Bo and Zhang, Kaichen and Zhang, Hao and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Yuanhan and Liu, Ziwei and Li, Chunyuan , _month=. LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild , url=
-
[56]
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites , author=. arXiv preprint arXiv:2404.16821 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[57]
2024 , journal=
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding , author=. 2024 , journal=
2024
-
[58]
Chain-of-thought prompting elicits reasoning in large language models , author=
-
[59]
EMNLP , year=
Clair: Evaluating image captions with large language models , author=. EMNLP , year=
-
[60]
Gemini 3 Pro , author=
-
[61]
Gemini 2.0 Flash , author=
-
[62]
Gemini: A Family of Highly Capable Multimodal Models
Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[63]
Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[64]
arXiv preprint arXiv:2310.16656 , year=
A picture is worth a thousand words: Principled recaptioning improves image generation , author=. arXiv preprint arXiv:2310.16656 , year=
-
[65]
Sigmoid loss for language image pre-training , author=
-
[66]
When and why vision-language models behave like bags-of-words, and what to do about it? , author=
-
[67]
Tripletclip: Improving compositional reasoning of clip via synthetic vision-language negatives , author=
-
[68]
Align before fuse: Vision and language representation learning with momentum distillation , author=
-
[69]
2021 , _organization=
Learning transferable visual models from natural language supervision , author=. 2021 , _organization=
2021
-
[70]
2023 , url=
Improving Image Generation with Better Captions , author=. 2023 , url=
2023
-
[71]
2022 , organization=
Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation , author=. 2022 , organization=
2022
-
[72]
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , year=
Let there be a clock on the beach: Reducing object hallucination in image captioning , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , year=
-
[73]
Qwen2 technical report , author=. arXiv preprint arXiv:2407.10671 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[74]
Journal of Machine Learning Research , volume=
Scaling instruction-finetuned language models , author=. Journal of Machine Learning Research , volume=
-
[75]
2023 , journal=
Mistral 7B , author=. 2023 , journal=
2023
-
[76]
The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[77]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[78]
2023 , organization=
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. 2023 , organization=
2023
-
[79]
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Minigpt-v2: large language model as a unified interface for vision-language multi-task learning , author=. arXiv preprint arXiv:2310.09478 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[80]
Improved baselines with visual instruction tuning , author=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.