VIDA: A dataset for Visually Dependent Ambiguity in Multimodal Machine Translation

Chris Biemann; Jingheng Pan; Liang Ding; Longyue Wang; Weihua Luo; Xintong Wang

arxiv: 2605.02035 · v2 · pith:5WKWUVIInew · submitted 2026-05-03 · 💻 cs.CL · cs.AI

VIDA: A dataset for Visually Dependent Ambiguity in Multimodal Machine Translation

Jingheng Pan , Xintong Wang , Longyue Wang , Liang Ding , Weihua Luo , Chris Biemann This is my paper

Pith reviewed 2026-06-30 23:54 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multimodal machine translationvisual ambiguitydisambiguation datasetchain-of-thought fine-tuningspan-level evaluationlarge vision-language models

0 comments

The pith

A dataset of 2,500 instances tests whether multimodal translation models require images to resolve ambiguous source spans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates VIDA, a collection of 2,500 translation examples in which each marked ambiguous span in the source sentence can be resolved only by consulting the accompanying image. It also defines Disambiguation-Centric Metrics that use an LLM judge to check whether a model has correctly interpreted the span. Experiments on two large vision-language models show that ordinary supervised fine-tuning raises overall translation scores, while the same fine-tuning performed with chain-of-thought reasoning produces noticeably stronger results on ambiguity cases the model has not encountered before.

Core claim

VIDA supplies 2,500 curated instances in which resolving each annotated source span requires visual evidence rather than textual cues alone. The accompanying Disambiguation-Centric Metrics apply an LLM-as-a-judge classifier to verify span-level resolution. Experiments demonstrate that supervised fine-tuning improves overall translation quality while chain-of-thought supervised fine-tuning yields stronger out-of-distribution disambiguation, indicating that explicit disambiguation guidance improves generalization across ambiguity types.

What carries the argument

The VIDA dataset of 2,500 instances that annotate source spans whose correct meaning depends on the paired image, together with the LLM-as-a-judge metrics that score span-level disambiguation.

If this is right

Supervised fine-tuning on the dataset raises overall translation quality for the tested models.
Chain-of-thought supervised fine-tuning produces better results than standard fine-tuning on ambiguity types absent from the training data.
Providing explicit disambiguation guidance during training helps models handle a wider range of ambiguity patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same training pattern may help models in other multimodal settings where language must be grounded in visual context.
The dataset offers a template for constructing similar test sets that isolate the contribution of vision in other language tasks.
Larger collections built on the same principle could reveal which ambiguity categories benefit most from visual input.

Load-bearing premise

The 2,500 instances are built so that resolving each annotated span genuinely requires the image rather than text context alone, and the LLM judge accurately measures correct resolution without its own biases.

What would settle it

A model that reaches high scores on the disambiguation metrics when given only the text input, without any image, would show that the annotated spans do not actually depend on visual evidence.

Figures

Figures reproduced from arXiv: 2605.02035 by Chris Biemann, Jingheng Pan, Liang Ding, Longyue Wang, Weihua Luo, Xintong Wang.

**Figure 1.** Figure 1: Three-stage VIDA curation pipeline rule-based string matching. Furthermore, standard MT metrics such as BLEU (Papineni et al., 2002) and COMET (Rei et al., 2020) do not directly verify whether an ambiguous span has been resolved correctly, since surface-overlap metrics may penalize valid paraphrases or lexical variation and sentence-level metrics are too coarse-grained for span-level disambiguation. In t… view at source ↗

**Figure 2.** Figure 2: Example of CoT six-step reasoning resolving the ambiguity. view at source ↗

**Figure 2.** Figure 2: Example of CoT six-step reasoning resolving the ambiguity. [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Case study of CoT-SFT vs. SFT tion and recognizes the intended interpretation during ambiguity checking. However, in the later disambiguation step, it over-interprets the phrase by incorrectly linking it to "someone physically touching" mentioned in the grounding step, rather than the relevant cue about the product feature. As a result, the model revises an initially adequate interpretation into an inc… view at source ↗

read the original abstract

Ambiguity resolution is a key challenge in multimodal machine translation (MMT), where models must genuinely leverage visual input to map an ambiguous expression to its intended meaning. Although prior work has proposed disambiguation-oriented benchmarks probing the role of vision, we observe that existing benchmarks remain limited by task-format mismatch, narrow ambiguity coverage, or insufficient visual-dependency validation. Moreover, existing ambiguity evaluations are not well suited to diverse ambiguity types in open-ended translation. To address these limitations, we present VIDA (Visually-Dependent Ambiguity), a dataset of 2,500 carefully curated instances in which resolving an annotated source span requires visual evidence. We further propose Disambiguation-Centric Metrics that use an LLM-as-a-judge classifier to verify whether annotated ambiguous expressions are resolved correctly at the span level. Experiments with two state-of-the-art LVLMs show that supervised fine-tuning (SFT) improves overall translation quality, while chain-of-thought SFT (CoT-SFT) yields stronger out-of-distribution disambiguation, suggesting that explicit disambiguation guidance improves generalization to diverse ambiguity types.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VIDA introduces a targeted dataset and span-level metrics for visual ambiguity in MMT, but the abstract leaves the core validation steps unshown so the SFT/CoT-SFT claims rest on unverified premises.

read the letter

The paper's main contribution is a 2500-instance dataset of source spans whose resolution is supposed to need the image, plus LLM-as-judge metrics that score correct disambiguation at the span level rather than whole-sentence BLEU. They run SFT and CoT-SFT on two LVLMs and report that the chain-of-thought version generalizes better on out-of-distribution ambiguity types.

It does address documented shortcomings in earlier MMT ambiguity sets, such as format mismatch and insufficient checks that vision is actually required. The span-level focus is a practical move for open-ended translation.

The soft spot is the missing evidence on the two load-bearing assumptions. The abstract calls the instances "carefully curated" and the judge an "LLM-as-a-judge classifier" but supplies no inter-annotator agreement on visual necessity, no human-LLM agreement numbers, and no ablation that removes the image. Without those, any measured gain from CoT-SFT could be explained by text-only cues or judge bias rather than genuine disambiguation guidance.

This is for people building or evaluating multimodal translation systems who need tighter tests for visual reliance. A reader working on dataset construction or evaluation metrics could extract useful ideas if the full paper shows the validation steps. It deserves peer review because the targeted problem is real and the proposed metrics could be adopted once the curation and judge reliability are demonstrated.

Referee Report

3 major / 2 minor

Summary. The paper introduces VIDA, a dataset of 2,500 curated instances targeting visually dependent ambiguity in multimodal machine translation (MMT). It proposes Disambiguation-Centric Metrics that employ an LLM-as-a-judge classifier to assess correct span-level resolution of annotated ambiguous expressions. Experiments on two state-of-the-art LVLMs demonstrate that supervised fine-tuning (SFT) improves overall translation quality while chain-of-thought SFT (CoT-SFT) yields stronger out-of-distribution disambiguation performance, suggesting benefits from explicit disambiguation guidance.

Significance. If the dataset instances are verifiably visually dependent and the LLM judge is shown to be reliable, VIDA would fill a gap in existing MMT benchmarks by providing broader ambiguity coverage and explicit visual-dependency validation. The reported differential between SFT and CoT-SFT on OOD sets could inform training practices for vision-language models, provided the empirical claims rest on rigorous human validation and ablation evidence.

major comments (3)

[Dataset Curation] Dataset construction section: the claim that each of the 2,500 instances requires visual evidence (rather than textual cues) is central to attributing any SFT/CoT-SFT gains to visual disambiguation, yet no inter-annotator agreement scores, human validation of visual necessity, or image-ablation results are reported to support this premise.
[Metrics] Disambiguation-Centric Metrics section: the LLM-as-a-judge classifier is used to verify span-level resolution without reported human-LLM agreement rates, error analysis stratified by ambiguity type, or bias checks, which directly undermines the reliability of the OOD disambiguation results.
[Experiments] Experiments section: the headline finding that CoT-SFT improves OOD generalization rests on the two unvalidated premises above; without quantitative evidence that visual dependency holds and the judge is unbiased, the differential between SFT and CoT-SFT cannot be confidently attributed to explicit disambiguation guidance.

minor comments (2)

[Abstract] Abstract and introduction: the phrase 'carefully curated' is used without accompanying quantitative descriptors; replace with explicit statements of curation criteria and validation statistics once added.
[Tables] Figure and table captions: ensure all tables reporting translation metrics include the exact prompt templates and judge instructions used for the LLM evaluator.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the validation of our claims.

read point-by-point responses

Referee: [Dataset Curation] Dataset construction section: the claim that each of the 2,500 instances requires visual evidence (rather than textual cues) is central to attributing any SFT/CoT-SFT gains to visual disambiguation, yet no inter-annotator agreement scores, human validation of visual necessity, or image-ablation results are reported to support this premise.

Authors: We acknowledge that the original manuscript does not report inter-annotator agreement (IAA) scores or explicit human validation of visual necessity. The curation involved multiple annotators with guidelines emphasizing visual dependency, but these details were omitted. In the revised version, we will add IAA scores from the annotation process, results from a targeted human study validating that resolving the annotated spans requires the image (rather than text alone), and image-ablation experiments on a representative subset to quantify the performance drop without visual input. revision: yes
Referee: [Metrics] Disambiguation-Centric Metrics section: the LLM-as-a-judge classifier is used to verify span-level resolution without reported human-LLM agreement rates, error analysis stratified by ambiguity type, or bias checks, which directly undermines the reliability of the OOD disambiguation results.

Authors: We agree that reliability of the LLM judge requires further substantiation. The revised manuscript will include human-LLM agreement rates computed on a held-out sample of translations, an error analysis stratified by ambiguity type (e.g., lexical, syntactic, referential), and bias checks examining judge consistency across ambiguity categories and model outputs. revision: yes
Referee: [Experiments] Experiments section: the headline finding that CoT-SFT improves OOD generalization rests on the two unvalidated premises above; without quantitative evidence that visual dependency holds and the judge is unbiased, the differential between SFT and CoT-SFT cannot be confidently attributed to explicit disambiguation guidance.

Authors: Once the dataset curation and metrics sections are augmented with the requested quantitative evidence, the experiments section will be updated to present the SFT vs. CoT-SFT comparison alongside these validations. This will allow readers to assess the attribution of OOD gains to explicit disambiguation guidance with greater confidence. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction with no derivations or self-referential reductions.

full rationale

The paper constructs the VIDA dataset of 2,500 instances and Disambiguation-Centric Metrics via LLM-as-a-judge, then reports SFT/CoT-SFT experiments on LVLMs. No equations, fitted parameters renamed as predictions, self-citation load-bearing premises, uniqueness theorems, or ansatzes appear. All claims rest on direct curation, annotation, and empirical measurement rather than reducing to inputs by construction; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes an empirical resource rather than a derivation; its load-bearing premise is the validity of the visual-dependency annotation process.

axioms (1)

domain assumption Resolving the annotated source spans in the selected instances requires visual evidence that cannot be obtained from text alone.
This premise underpins both dataset construction and the claim that models must leverage visual input.

pith-pipeline@v0.9.1-grok · 5734 in / 1321 out tokens · 30289 ms · 2026-06-30T23:54:53.537551+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products
cs.CV 2026-06 unverdicted novelty 5.0

The paper presents the first benchmark for multi-image industrial product attribute extraction, finding that MLLMs achieve high precision but only 49.9% recall at product level due to multi-image completeness gaps.

Reference graph

Works this paper leans on

21 extracted references · 1 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

GPT-4o System Card

LVP-M3: Language-aware visual prompt for multilingual multimodal machine translation. InPro- ceedings of the 2022 Conference on Empirical Meth- ods in Natural Language Processing, pages 2862– 2872, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Describe how these elements con- nect to the text

Visual Grounding: Examine the image care- fully and identify the visual elements that cor- respond to key words or phrases in the source sentence. Describe how these elements con- nect to the text
[3]

Initial Translation: Generate a preliminary translation based on both the text and the grounded visual evidence
[4]

Ambiguity Check: Review the initial trans- lation and highlight any terms that remain ambiguous—those whose meanings are un- clear or context-dependent when relying on text alone
[5]

While visual grounding establishes a mapping between the image and the text, the initial translation can still leave some ambiguities un- resolved

Visual Disambiguation: This step is critical. While visual grounding establishes a mapping between the image and the text, the initial translation can still leave some ambiguities un- resolved. The model explicitly revisits the im- age, not only to strengthen the connection be- tween ambiguous terms and their correspond- ing visual evidence, but also to r...

2024
[6]

This constraint prevents unnecessary modifications to the sen- tence structure and helps maintain overall translation fluency

Localized Refinement: Update only the am- biguous parts of the initial translation while keeping the rest unchanged. This constraint prevents unnecessary modifications to the sen- tence structure and helps maintain overall translation fluency
[7]

object", which requires a concrete translation (

Repeat Check: Reassess the updated transla- tion. If ambiguities remain, iterate steps 3–5 until the translation is fully disambiguated. An example is provided in Figure 2. F Qualitative Analysis As discussed in section 5, CoT-SFT exhibits a strong ability to enhance disambiguation perfor- mance, particularly on challenging OOD subsets Figure 2: Example o...
[8]

Determine whether a given English caption contains any ambiguity when interpreted without any additional context or images
[9]

- The reason it is ambiguous (how multiple interpretations can arise)

If ambiguity exists, explain: - The type of ambiguity (lexical, syntactic, pragmatic, or cultural/background). - The reason it is ambiguous (how multiple interpretations can arise). - Potential different Chinese translations reflecting these interpretations
[10]

bank" = financial institution vs. river bank). - Syntactic: the sentence structure permits multiple interpretations (e.g.,

If no ambiguity exists, respond with that conclusion. Ambiguity Definition: - Lexical: a word or phrase has multiple meanings (e.g., "bank" = financial institution vs. river bank). - Syntactic: the sentence structure permits multiple interpretations (e.g., "I saw the man with a telescope"). - Pragmatic: the context or speaker's intention is unclear (e.g.,...
[11]

- If two ambiguities from different models share the same type and describe the same underlying issue, merge them

Merge ambiguities by type: - Group ambiguity entries from both qwen_ambi and v3_ambi by their "type" field (lexical, syntactic, pragmatic, cultural/background). - If two ambiguities from different models share the same type and describe the same underlying issue, merge them. - If ambiguities differ substantially even under the same type, keep them separate
[12]

- translations: union the translation candidates from both sources, removing exact duplicates

Merging Details: - explanation: combine the explanations from both sources into a single concise paragraph. - translations: union the translation candidates from both sources, removing exact duplicates
[13]

en" field), extract the literal word(s) or phrase(s) that cause each ambiguity. - Save them into a new field

Extract Ambiguous Terms: - From the original English sentence (the "en" field), extract the literal word(s) or phrase(s) that cause each ambiguity. - Save them into a new field "ambiguous_terms" (a list). - Terms must be taken literally from the original English sentence. Output Format: [ { "type": "lexical", "explanation": "<combined explanation>", "tran...
[14]

A single English caption (text)
[15]

One image showing the real-world scene the caption describes
[16]

translation_zh

A list of ambiguity notes generated by a previous model. Your task: - Look at BOTH the text and the image. - Disambiguate the caption and produce the most accurate, fluent Chinese translation. - Briefly state which ambiguity was resolved by the visual evidence. Output JSON (Chinese UTF-8): { "translation_zh": "<final Chinese translation>", "resolved_ambig...
[17]

The English source sentence
[18]

The Chinese translation under evaluation
[19]

The ambiguous term or phrase in the source (ambi_term)
[20]

The gold sense (gold_sense), expressed in Chinese, describing the meaning we expect the ambiguous term to convey in the given context
[21]

Correct". - If the meaning is missing or distorted, return

A reference Chinese translation. Task: Based on the gold sense (4) and the reference translation (5), judge whether the Chinese translation under evaluation (2) accurately expresses this meaning. - If yes, return "Correct". - If the meaning is missing or distorted, return "Incorrect". Output format (strictly two lines): Correct/Incorrect, brief reason Eng...

[1] [1]

GPT-4o System Card

LVP-M3: Language-aware visual prompt for multilingual multimodal machine translation. InPro- ceedings of the 2022 Conference on Empirical Meth- ods in Natural Language Processing, pages 2862– 2872, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark,...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

Describe how these elements con- nect to the text

Visual Grounding: Examine the image care- fully and identify the visual elements that cor- respond to key words or phrases in the source sentence. Describe how these elements con- nect to the text

[3] [3]

Initial Translation: Generate a preliminary translation based on both the text and the grounded visual evidence

[4] [4]

Ambiguity Check: Review the initial trans- lation and highlight any terms that remain ambiguous—those whose meanings are un- clear or context-dependent when relying on text alone

[5] [5]

While visual grounding establishes a mapping between the image and the text, the initial translation can still leave some ambiguities un- resolved

Visual Disambiguation: This step is critical. While visual grounding establishes a mapping between the image and the text, the initial translation can still leave some ambiguities un- resolved. The model explicitly revisits the im- age, not only to strengthen the connection be- tween ambiguous terms and their correspond- ing visual evidence, but also to r...

2024

[6] [6]

This constraint prevents unnecessary modifications to the sen- tence structure and helps maintain overall translation fluency

Localized Refinement: Update only the am- biguous parts of the initial translation while keeping the rest unchanged. This constraint prevents unnecessary modifications to the sen- tence structure and helps maintain overall translation fluency

[7] [7]

object", which requires a concrete translation (

Repeat Check: Reassess the updated transla- tion. If ambiguities remain, iterate steps 3–5 until the translation is fully disambiguated. An example is provided in Figure 2. F Qualitative Analysis As discussed in section 5, CoT-SFT exhibits a strong ability to enhance disambiguation perfor- mance, particularly on challenging OOD subsets Figure 2: Example o...

[8] [8]

Determine whether a given English caption contains any ambiguity when interpreted without any additional context or images

[9] [9]

- The reason it is ambiguous (how multiple interpretations can arise)

If ambiguity exists, explain: - The type of ambiguity (lexical, syntactic, pragmatic, or cultural/background). - The reason it is ambiguous (how multiple interpretations can arise). - Potential different Chinese translations reflecting these interpretations

[10] [10]

bank" = financial institution vs. river bank). - Syntactic: the sentence structure permits multiple interpretations (e.g.,

If no ambiguity exists, respond with that conclusion. Ambiguity Definition: - Lexical: a word or phrase has multiple meanings (e.g., "bank" = financial institution vs. river bank). - Syntactic: the sentence structure permits multiple interpretations (e.g., "I saw the man with a telescope"). - Pragmatic: the context or speaker's intention is unclear (e.g.,...

[11] [11]

- If two ambiguities from different models share the same type and describe the same underlying issue, merge them

Merge ambiguities by type: - Group ambiguity entries from both qwen_ambi and v3_ambi by their "type" field (lexical, syntactic, pragmatic, cultural/background). - If two ambiguities from different models share the same type and describe the same underlying issue, merge them. - If ambiguities differ substantially even under the same type, keep them separate

[12] [12]

- translations: union the translation candidates from both sources, removing exact duplicates

Merging Details: - explanation: combine the explanations from both sources into a single concise paragraph. - translations: union the translation candidates from both sources, removing exact duplicates

[13] [13]

en" field), extract the literal word(s) or phrase(s) that cause each ambiguity. - Save them into a new field

Extract Ambiguous Terms: - From the original English sentence (the "en" field), extract the literal word(s) or phrase(s) that cause each ambiguity. - Save them into a new field "ambiguous_terms" (a list). - Terms must be taken literally from the original English sentence. Output Format: [ { "type": "lexical", "explanation": "<combined explanation>", "tran...

[14] [14]

A single English caption (text)

[15] [15]

One image showing the real-world scene the caption describes

[16] [16]

translation_zh

A list of ambiguity notes generated by a previous model. Your task: - Look at BOTH the text and the image. - Disambiguate the caption and produce the most accurate, fluent Chinese translation. - Briefly state which ambiguity was resolved by the visual evidence. Output JSON (Chinese UTF-8): { "translation_zh": "<final Chinese translation>", "resolved_ambig...

[17] [17]

The English source sentence

[18] [18]

The Chinese translation under evaluation

[19] [19]

The ambiguous term or phrase in the source (ambi_term)

[20] [20]

The gold sense (gold_sense), expressed in Chinese, describing the meaning we expect the ambiguous term to convey in the given context

[21] [21]

Correct". - If the meaning is missing or distorted, return

A reference Chinese translation. Task: Based on the gold sense (4) and the reference translation (5), judge whether the Chinese translation under evaluation (2) accurately expresses this meaning. - If yes, return "Correct". - If the meaning is missing or distorted, return "Incorrect". Output format (strictly two lines): Correct/Incorrect, brief reason Eng...