VIDA: A dataset for Visually Dependent Ambiguity in Multimodal Machine Translation
Pith reviewed 2026-06-30 23:54 UTC · model grok-4.3
The pith
A dataset of 2,500 instances tests whether multimodal translation models require images to resolve ambiguous source spans.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VIDA supplies 2,500 curated instances in which resolving each annotated source span requires visual evidence rather than textual cues alone. The accompanying Disambiguation-Centric Metrics apply an LLM-as-a-judge classifier to verify span-level resolution. Experiments demonstrate that supervised fine-tuning improves overall translation quality while chain-of-thought supervised fine-tuning yields stronger out-of-distribution disambiguation, indicating that explicit disambiguation guidance improves generalization across ambiguity types.
What carries the argument
The VIDA dataset of 2,500 instances that annotate source spans whose correct meaning depends on the paired image, together with the LLM-as-a-judge metrics that score span-level disambiguation.
If this is right
- Supervised fine-tuning on the dataset raises overall translation quality for the tested models.
- Chain-of-thought supervised fine-tuning produces better results than standard fine-tuning on ambiguity types absent from the training data.
- Providing explicit disambiguation guidance during training helps models handle a wider range of ambiguity patterns.
Where Pith is reading between the lines
- The same training pattern may help models in other multimodal settings where language must be grounded in visual context.
- The dataset offers a template for constructing similar test sets that isolate the contribution of vision in other language tasks.
- Larger collections built on the same principle could reveal which ambiguity categories benefit most from visual input.
Load-bearing premise
The 2,500 instances are built so that resolving each annotated span genuinely requires the image rather than text context alone, and the LLM judge accurately measures correct resolution without its own biases.
What would settle it
A model that reaches high scores on the disambiguation metrics when given only the text input, without any image, would show that the annotated spans do not actually depend on visual evidence.
Figures
read the original abstract
Ambiguity resolution is a key challenge in multimodal machine translation (MMT), where models must genuinely leverage visual input to map an ambiguous expression to its intended meaning. Although prior work has proposed disambiguation-oriented benchmarks probing the role of vision, we observe that existing benchmarks remain limited by task-format mismatch, narrow ambiguity coverage, or insufficient visual-dependency validation. Moreover, existing ambiguity evaluations are not well suited to diverse ambiguity types in open-ended translation. To address these limitations, we present VIDA (Visually-Dependent Ambiguity), a dataset of 2,500 carefully curated instances in which resolving an annotated source span requires visual evidence. We further propose Disambiguation-Centric Metrics that use an LLM-as-a-judge classifier to verify whether annotated ambiguous expressions are resolved correctly at the span level. Experiments with two state-of-the-art LVLMs show that supervised fine-tuning (SFT) improves overall translation quality, while chain-of-thought SFT (CoT-SFT) yields stronger out-of-distribution disambiguation, suggesting that explicit disambiguation guidance improves generalization to diverse ambiguity types.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VIDA, a dataset of 2,500 curated instances targeting visually dependent ambiguity in multimodal machine translation (MMT). It proposes Disambiguation-Centric Metrics that employ an LLM-as-a-judge classifier to assess correct span-level resolution of annotated ambiguous expressions. Experiments on two state-of-the-art LVLMs demonstrate that supervised fine-tuning (SFT) improves overall translation quality while chain-of-thought SFT (CoT-SFT) yields stronger out-of-distribution disambiguation performance, suggesting benefits from explicit disambiguation guidance.
Significance. If the dataset instances are verifiably visually dependent and the LLM judge is shown to be reliable, VIDA would fill a gap in existing MMT benchmarks by providing broader ambiguity coverage and explicit visual-dependency validation. The reported differential between SFT and CoT-SFT on OOD sets could inform training practices for vision-language models, provided the empirical claims rest on rigorous human validation and ablation evidence.
major comments (3)
- [Dataset Curation] Dataset construction section: the claim that each of the 2,500 instances requires visual evidence (rather than textual cues) is central to attributing any SFT/CoT-SFT gains to visual disambiguation, yet no inter-annotator agreement scores, human validation of visual necessity, or image-ablation results are reported to support this premise.
- [Metrics] Disambiguation-Centric Metrics section: the LLM-as-a-judge classifier is used to verify span-level resolution without reported human-LLM agreement rates, error analysis stratified by ambiguity type, or bias checks, which directly undermines the reliability of the OOD disambiguation results.
- [Experiments] Experiments section: the headline finding that CoT-SFT improves OOD generalization rests on the two unvalidated premises above; without quantitative evidence that visual dependency holds and the judge is unbiased, the differential between SFT and CoT-SFT cannot be confidently attributed to explicit disambiguation guidance.
minor comments (2)
- [Abstract] Abstract and introduction: the phrase 'carefully curated' is used without accompanying quantitative descriptors; replace with explicit statements of curation criteria and validation statistics once added.
- [Tables] Figure and table captions: ensure all tables reporting translation metrics include the exact prompt templates and judge instructions used for the LLM evaluator.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the validation of our claims.
read point-by-point responses
-
Referee: [Dataset Curation] Dataset construction section: the claim that each of the 2,500 instances requires visual evidence (rather than textual cues) is central to attributing any SFT/CoT-SFT gains to visual disambiguation, yet no inter-annotator agreement scores, human validation of visual necessity, or image-ablation results are reported to support this premise.
Authors: We acknowledge that the original manuscript does not report inter-annotator agreement (IAA) scores or explicit human validation of visual necessity. The curation involved multiple annotators with guidelines emphasizing visual dependency, but these details were omitted. In the revised version, we will add IAA scores from the annotation process, results from a targeted human study validating that resolving the annotated spans requires the image (rather than text alone), and image-ablation experiments on a representative subset to quantify the performance drop without visual input. revision: yes
-
Referee: [Metrics] Disambiguation-Centric Metrics section: the LLM-as-a-judge classifier is used to verify span-level resolution without reported human-LLM agreement rates, error analysis stratified by ambiguity type, or bias checks, which directly undermines the reliability of the OOD disambiguation results.
Authors: We agree that reliability of the LLM judge requires further substantiation. The revised manuscript will include human-LLM agreement rates computed on a held-out sample of translations, an error analysis stratified by ambiguity type (e.g., lexical, syntactic, referential), and bias checks examining judge consistency across ambiguity categories and model outputs. revision: yes
-
Referee: [Experiments] Experiments section: the headline finding that CoT-SFT improves OOD generalization rests on the two unvalidated premises above; without quantitative evidence that visual dependency holds and the judge is unbiased, the differential between SFT and CoT-SFT cannot be confidently attributed to explicit disambiguation guidance.
Authors: Once the dataset curation and metrics sections are augmented with the requested quantitative evidence, the experiments section will be updated to present the SFT vs. CoT-SFT comparison alongside these validations. This will allow readers to assess the attribution of OOD gains to explicit disambiguation guidance with greater confidence. revision: yes
Circularity Check
No circularity: empirical dataset construction with no derivations or self-referential reductions.
full rationale
The paper constructs the VIDA dataset of 2,500 instances and Disambiguation-Centric Metrics via LLM-as-a-judge, then reports SFT/CoT-SFT experiments on LVLMs. No equations, fitted parameters renamed as predictions, self-citation load-bearing premises, uniqueness theorems, or ansatzes appear. All claims rest on direct curation, annotation, and empirical measurement rather than reducing to inputs by construction; the work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Resolving the annotated source spans in the selected instances requires visual evidence that cannot be obtained from text alone.
Forward citations
Cited by 1 Pith paper
-
IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products
The paper presents the first benchmark for multi-image industrial product attribute extraction, finding that MLLMs achieve high precision but only 49.9% recall at product level due to multi-image completeness gaps.
Reference graph
Works this paper leans on
-
[1]
LVP-M3: Language-aware visual prompt for multilingual multimodal machine translation. InPro- ceedings of the 2022 Conference on Empirical Meth- ods in Natural Language Processing, pages 2862– 2872, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark,...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
Describe how these elements con- nect to the text
Visual Grounding: Examine the image care- fully and identify the visual elements that cor- respond to key words or phrases in the source sentence. Describe how these elements con- nect to the text
-
[3]
Initial Translation: Generate a preliminary translation based on both the text and the grounded visual evidence
-
[4]
Ambiguity Check: Review the initial trans- lation and highlight any terms that remain ambiguous—those whose meanings are un- clear or context-dependent when relying on text alone
-
[5]
While visual grounding establishes a mapping between the image and the text, the initial translation can still leave some ambiguities un- resolved
Visual Disambiguation: This step is critical. While visual grounding establishes a mapping between the image and the text, the initial translation can still leave some ambiguities un- resolved. The model explicitly revisits the im- age, not only to strengthen the connection be- tween ambiguous terms and their correspond- ing visual evidence, but also to r...
2024
-
[6]
This constraint prevents unnecessary modifications to the sen- tence structure and helps maintain overall translation fluency
Localized Refinement: Update only the am- biguous parts of the initial translation while keeping the rest unchanged. This constraint prevents unnecessary modifications to the sen- tence structure and helps maintain overall translation fluency
-
[7]
object", which requires a concrete translation (
Repeat Check: Reassess the updated transla- tion. If ambiguities remain, iterate steps 3–5 until the translation is fully disambiguated. An example is provided in Figure 2. F Qualitative Analysis As discussed in section 5, CoT-SFT exhibits a strong ability to enhance disambiguation perfor- mance, particularly on challenging OOD subsets Figure 2: Example o...
-
[8]
Determine whether a given English caption contains any ambiguity when interpreted without any additional context or images
-
[9]
- The reason it is ambiguous (how multiple interpretations can arise)
If ambiguity exists, explain: - The type of ambiguity (lexical, syntactic, pragmatic, or cultural/background). - The reason it is ambiguous (how multiple interpretations can arise). - Potential different Chinese translations reflecting these interpretations
-
[10]
bank" = financial institution vs. river bank). - Syntactic: the sentence structure permits multiple interpretations (e.g.,
If no ambiguity exists, respond with that conclusion. Ambiguity Definition: - Lexical: a word or phrase has multiple meanings (e.g., "bank" = financial institution vs. river bank). - Syntactic: the sentence structure permits multiple interpretations (e.g., "I saw the man with a telescope"). - Pragmatic: the context or speaker's intention is unclear (e.g.,...
-
[11]
- If two ambiguities from different models share the same type and describe the same underlying issue, merge them
Merge ambiguities by type: - Group ambiguity entries from both qwen_ambi and v3_ambi by their "type" field (lexical, syntactic, pragmatic, cultural/background). - If two ambiguities from different models share the same type and describe the same underlying issue, merge them. - If ambiguities differ substantially even under the same type, keep them separate
-
[12]
- translations: union the translation candidates from both sources, removing exact duplicates
Merging Details: - explanation: combine the explanations from both sources into a single concise paragraph. - translations: union the translation candidates from both sources, removing exact duplicates
-
[13]
en" field), extract the literal word(s) or phrase(s) that cause each ambiguity. - Save them into a new field
Extract Ambiguous Terms: - From the original English sentence (the "en" field), extract the literal word(s) or phrase(s) that cause each ambiguity. - Save them into a new field "ambiguous_terms" (a list). - Terms must be taken literally from the original English sentence. Output Format: [ { "type": "lexical", "explanation": "<combined explanation>", "tran...
-
[14]
A single English caption (text)
-
[15]
One image showing the real-world scene the caption describes
-
[16]
translation_zh
A list of ambiguity notes generated by a previous model. Your task: - Look at BOTH the text and the image. - Disambiguate the caption and produce the most accurate, fluent Chinese translation. - Briefly state which ambiguity was resolved by the visual evidence. Output JSON (Chinese UTF-8): { "translation_zh": "<final Chinese translation>", "resolved_ambig...
-
[17]
The English source sentence
-
[18]
The Chinese translation under evaluation
-
[19]
The ambiguous term or phrase in the source (ambi_term)
-
[20]
The gold sense (gold_sense), expressed in Chinese, describing the meaning we expect the ambiguous term to convey in the given context
-
[21]
Correct". - If the meaning is missing or distorted, return
A reference Chinese translation. Task: Based on the gold sense (4) and the reference translation (5), judge whether the Chinese translation under evaluation (2) accurately expresses this meaning. - If yes, return "Correct". - If the meaning is missing or distorted, return "Incorrect". Output format (strictly two lines): Correct/Incorrect, brief reason Eng...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.