arxiv: 2605.02035 · v1 · submitted 2026-05-03 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

A Multimodal Dataset for Visually Grounded Ambiguity in Machine Translation

Chris Biemann, Jingheng Pan, Liang Ding, Longyue Wang, Weihua Luo, Xintong Wang

Pith reviewed 2026-05-08 19:29 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multimodal machine translationvisual ambiguitydisambiguation datasetchain-of-thought fine-tuninglarge vision language modelssupervised fine-tuningevaluation metricsout-of-distribution generalization

0 comments

The pith

A dataset of 2,500 translation instances shows that chain-of-thought fine-tuning helps models use visual evidence to resolve ambiguities more consistently.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates a new dataset to test how well machine translation models use images to resolve ambiguous words or phrases in the input text. Prior benchmarks had quality problems and did not match real translation needs, so the authors curate 2,500 examples where vision is essential for disambiguation and design metrics to check resolution at the specific span. Experiments reveal that adding chain-of-thought reasoning to fine-tuning helps models disambiguate more reliably than regular fine-tuning, with particular benefits on examples unlike those seen during training. Readers should care because translation systems that properly ground meaning in visuals could reduce errors in contexts like describing images or videos.

Core claim

We introduce VIDA, a multimodal dataset consisting of 2,500 carefully curated instances in which resolving an annotated ambiguous source span requires visual evidence from the corresponding image. We propose Disambiguation-Centric Metrics that employ an LLM-as-a-judge classifier to determine whether the model has correctly resolved the ambiguous expressions at the span level. Through experiments with state-of-the-art large vision language models using vanilla inference, supervised fine-tuning, and chain-of-thought supervised fine-tuning, we find that chain-of-thought fine-tuning provides more consistent gains in disambiguation accuracy, especially on out-of-distribution subsets.

What carries the argument

The VIDA dataset of 2,500 translation instances requiring visual evidence for ambiguity resolution, evaluated with Disambiguation-Centric Metrics that use an LLM-as-a-judge to verify span-level correctness.

If this is right

Standard supervised fine-tuning improves overall translation quality but provides less consistent gains on disambiguation tasks.
Chain-of-thought supervised fine-tuning produces stronger and more reliable improvements in disambiguation accuracy.
The gains from chain-of-thought fine-tuning are especially pronounced on out-of-distribution examples.
The new metrics enable precise checking of whether models resolve specific ambiguous spans correctly rather than relying on overall sentence quality.
The approach supports evaluation across a wider range of ambiguity types than previous benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Encouraging step-by-step reasoning during training appears to help multimodal models better integrate visual context with linguistic input.
The dataset and metrics could be extended to additional languages or domains to test whether the same training pattern holds.
If the LLM judge scales reliably, it could support faster iteration on new disambiguation methods without full human evaluation.
Similar chain-of-thought fine-tuning might improve performance on other multimodal tasks involving ambiguous descriptions.

Load-bearing premise

The 2,500 instances are accurately annotated such that visual evidence is genuinely required to resolve each ambiguous span, and the LLM-as-a-judge classifier reliably measures correct span-level disambiguation without its own biases or errors.

What would settle it

An independent human review that identifies many dataset examples resolvable from source text alone without the image, or that shows the LLM judge disagrees with human judgments on resolution correctness in a substantial portion of cases.

Figures

Figures reproduced from arXiv: 2605.02035 by Chris Biemann, Jingheng Pan, Liang Ding, Longyue Wang, Weihua Luo, Xintong Wang.

**Figure 1.** Figure 1: Three-stage VIDA curation pipeline rule-based string matching. Furthermore, standard MT metrics such as BLEU (Papineni et al., 2002) and COMET (Rei et al., 2020) do not directly verify whether an ambiguous span has been resolved correctly, since surface-overlap metrics may penalize valid paraphrases or lexical variation and sentence-level metrics are too coarse-grained for span-level disambiguation. In t… view at source ↗

**Figure 2.** Figure 2: Example of CoT six-step reasoning resolving the ambiguity. view at source ↗

**Figure 3.** Figure 3: Case study of CoT-SFT vs. SFT tion and recognizes the intended interpretation during ambiguity checking. However, in the later disambiguation step, it over-interprets the phrase by incorrectly linking it to "someone physically touching" mentioned in the grounding step, rather than the relevant cue about the product feature. As a result, the model revises an initially adequate interpretation into an inc… view at source ↗

read the original abstract

Ambiguity resolution is a key challenge in multimodal machine translation (MMT), where models must genuinely leverage visual input to map an ambiguous expression to its intended meaning. Although prior work has proposed disambiguation-oriented benchmarks that provide supportive evidence for the role of vision, we observe substantial issues in data quality and a mismatch with translation scenarios. Moreover, existing ambiguity-oriented evaluations are not well suited to broader ambiguity types in open-ended translation. To address these limitations, we present VIDA (Visually-Dependent Ambiguity), a dataset of 2,500 carefully curated instances in which resolving an annotated ambiguous source span requires visual evidence. We further propose Disambiguation-Centric Metrics that use an LLM-as-a-judge classifier to verify whether annotated ambiguous expressions are resolved correctly at the span level. Experiments with two state-of-the-art Large Vision Language Models under vanilla inference, supervised fine-tuning (SFT), and our chain-of-thought SFT (CoT-SFT) show that while SFT improves overall translation quality, CoT-SFT yields more consistent gains in disambiguation accuracy, especially on out-of-distribution subsets, indicating a stronger generalization for resolving diverse ambiguity types.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VIDA gives a targeted 2500-instance dataset for cases where vision is needed to resolve MT ambiguity, but the LLM-judge metrics lack human validation.

read the letter

The main point is a new dataset called VIDA with 2500 instances where an annotated ambiguous source span actually requires the image to pick the right meaning in translation. They also introduce disambiguation-centric metrics that rely on an LLM judge to score whether the span is resolved correctly in the output. Experiments compare vanilla inference, standard SFT, and their chain-of-thought SFT on two large vision-language models, with the claim that CoT-SFT gives more consistent gains on out-of-distribution subsets.

Referee Report

1 major / 1 minor

Summary. The paper introduces the VIDA dataset of 2,500 instances in which resolving an annotated ambiguous source span in machine translation requires visual evidence. It proposes Disambiguation-Centric Metrics that employ an LLM-as-a-judge classifier to assess whether the ambiguous expression is correctly resolved at the span level. Experiments on two state-of-the-art large vision-language models compare vanilla inference, standard supervised fine-tuning (SFT), and chain-of-thought SFT (CoT-SFT), concluding that CoT-SFT produces more consistent gains in disambiguation accuracy, particularly on out-of-distribution subsets.

Significance. If the dataset curation ensures genuine visual dependence and the LLM judge is shown to be reliable, the work would supply a needed benchmark for evaluating visual grounding in multimodal MT and would demonstrate a practical benefit of explicit reasoning traces for handling diverse ambiguity types beyond existing datasets.

major comments (1)

[Disambiguation-Centric Metrics and Experiments] The Disambiguation-Centric Metrics section relies on an LLM-as-a-judge classifier to produce the primary outcome measure (span-level disambiguation accuracy), yet no validation against human judgments, no quantification of judge accuracy or bias, and no ablation on judge reliability are reported. This directly affects the central claim that CoT-SFT yields stronger generalization than SFT, especially on OOD subsets, because any systematic preference of the judge for chain-of-thought outputs could artifactually inflate the reported advantage.

minor comments (1)

[Abstract] The abstract states that prior ambiguity-oriented evaluations suffer from data-quality issues and mismatch with translation scenarios; a short concrete example of one such issue would help readers immediately grasp the motivation for VIDA.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address the concern regarding the LLM-as-a-judge validation below and commit to strengthening the manuscript accordingly.

read point-by-point responses

Referee: The Disambiguation-Centric Metrics section relies on an LLM-as-a-judge classifier to produce the primary outcome measure (span-level disambiguation accuracy), yet no validation against human judgments, no quantification of judge accuracy or bias, and no ablation on judge reliability are reported. This directly affects the central claim that CoT-SFT yields stronger generalization than SFT, especially on OOD subsets, because any systematic preference of the judge for chain-of-thought outputs could artifactually inflate the reported advantage.

Authors: We agree that the absence of explicit validation for the LLM judge represents a limitation that could affect confidence in the disambiguation accuracy results and the comparative claims for CoT-SFT. The manuscript describes the judge prompt design intended to focus strictly on span-level resolution of the annotated ambiguous expression, independent of overall translation quality or reasoning style. However, no human validation, accuracy metrics, bias quantification, or ablation was included. In the revised version, we will add a dedicated subsection reporting a human validation study: three annotators will evaluate a stratified sample of 400 outputs (100 per model/setting combination across vanilla, SFT, and CoT-SFT, including OOD cases). We will report agreement rates with the LLM judge, Cohen's kappa, and any systematic preferences (e.g., toward CoT outputs). If bias is detected, we will either correct the metric or qualify the claims. We will also include a prompt ablation using an alternative judge model. These additions will directly address the potential artifact concern and support the generalization findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a purely empirical contribution centered on dataset curation (VIDA with 2,500 instances) and experimental evaluation of LVLMs under different training regimes. No mathematical derivations, fitted parameters renamed as predictions, or self-referential chains appear in the abstract or described methodology. The Disambiguation-Centric Metrics and LLM-as-a-judge are presented as measurement tools defined from the new annotations rather than reducing to prior results by construction. All claims rest on direct experimental comparisons, rendering the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claims rest on two unverified premises: accurate manual identification of spans whose resolution requires vision, and reliable performance of the LLM judge for span-level correctness. No external benchmarks or formal validation of these premises are described.

axioms (2)

domain assumption Annotated ambiguous spans require visual evidence for correct resolution
Stated as the defining property of the VIDA dataset instances
ad hoc to paper LLM-as-a-judge classifier accurately verifies span-level disambiguation
Proposed metric depends on this without reported validation or error analysis

pith-pipeline@v0.9.0 · 5516 in / 1387 out tokens · 65674 ms · 2026-05-08T19:29:55.976722+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

103 extracted references · 75 canonical work pages · 11 internal anchors

[1]

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) , year=

Multimodal lexical translation , author=. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) , year=

2018
[2]

Multimodal Transformer for Multimodal Machine Translation

Yao, Shaowei and Wan, Xiaojun. Multimodal Transformer for Multimodal Machine Translation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.400

work page doi:10.18653/v1/2020.acl-main.400 2020
[3]

Ru Wang and Selena Song and Liang Ding and Shixiang Shane Gu and Mingming Gong and Yusuke Iwasawa and Yutaka Matsuo and Jiaxian Guo , year=
[4]

and Rao, Jun and Li, Bei and Ding, Liang and Chao, Lidia S

Ma, Xinyu and Liu, Xuebo and Wong, Derek F. and Rao, Jun and Li, Bei and Ding, Liang and Chao, Lidia S. and Tao, Dacheng and Zhang, Min. 3 AM : An Ambiguity-Aware Multi-Modal Machine Translation Dataset. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024

2024
[5]

DRT : Deep Reasoning Translation via Long Chain-of-Thought

Wang, Jiaan and Meng, Fandong and Liang, Yunlong and Zhou, Jie. DRT : Deep Reasoning Translation via Long Chain-of-Thought. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.351

work page doi:10.18653/v1/2025.findings-acl.351 2025
[7]

Qwen2.5 Technical Report

Qwen2.5 technical report , author=. arXiv preprint arXiv:2412.15115 , url=

work page Pith review arXiv
[8]

Proceedings of the 40th Annual Meeting on Association for Computational Linguistics , pages =

Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing. BLEU: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002. doi:10.3115/1073083.1073135

work page doi:10.3115/1073083.1073135 2002
[9]

COMET : A Neural Framework for MT Evaluation

Rei, Ricardo and Stewart, Craig and Farinha, Ana C and Lavie, Alon. COMET : A Neural Framework for MT Evaluation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.213

work page doi:10.18653/v1/2020.emnlp-main.213 2020
[10]

LLaVA-OneVision: Easy Visual Task Transfer

LLaVA-OneVision: Easy Visual Task Transfer , author=. arXiv preprint arXiv:2408.03326 , url=

work page Pith review arXiv
[11]

ArXiv , year=

Qwen2 Technical Report , author=. ArXiv , year=
[12]

chr F : character n-gram F -score for automatic MT evaluation

Popovi \'c , Maja. chr F : character n-gram F -score for automatic MT evaluation. Proceedings of the Tenth Workshop on Statistical Machine Translation. 2015. doi:10.18653/v1/W15-3049

work page doi:10.18653/v1/w15-3049 2015
[13]

A Study of Translation Edit Rate with Targeted Human Annotation

Snover, Matthew and Dorr, Bonnie and Schwartz, Rich and Micciulla, Linnea and Makhoul, John. A Study of Translation Edit Rate with Targeted Human Annotation. Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers. 2006

2006
[14]

2019 , url=

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle=. 2019 , url=

2019
[15]

METEOR : An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

Banerjee, Satanjeev and Lavie, Alon. METEOR : An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005

2005
[17]

A survey on multi-modal machine translation: Tasks, methods and challenges,

A survey on multi-modal machine translation: Tasks, methods and challenges , author=. arXiv preprint arXiv:2405.12669 , url=

work page arXiv
[18]

arXiv preprint arXiv:1603.08079 , year=

Do you see what i mean? visual resolution of linguistic ambiguities , author=. arXiv preprint arXiv:1603.08079 , year=

work page arXiv
[19]

arXiv preprint arXiv:2205.11631 , url=

Towards opening the black box of neural machine translation: Source and target interpretations of the transformer , author=. arXiv preprint arXiv:2205.11631 , url=

work page arXiv
[20]

s1: Simple test-time scaling

s1: Simple test-time scaling , author=. arXiv preprint arXiv:2501.19393 , url=

work page Pith review arXiv
[22]

Tox- BART : Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech

Yadav, Neemesh and Masud, Sarah and Goyal, Vikram and Akhtar, Md Shad and Chakraborty, Tanmoy. Tox- BART : Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.831

work page doi:10.18653/v1/2024.findings-acl.831 2024
[23]

Walking in Others' Shoes: How Perspective-Taking Guides Large Language Models in Reducing Toxicity and Bias

Xu, Rongwu and Zhou, Zian and Zhang, Tianwei and Qi, Zehan and Yao, Su and Xu, Ke and Xu, Wei and Qiu, Han. Walking in Others' Shoes: How Perspective-Taking Guides Large Language Models in Reducing Toxicity and Bias. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.476

work page doi:10.18653/v1/2024.emnlp-main.476 2024
[24]

XD etox: Text Detoxification with Token-Level Toxicity Explanations

Lee, Beomseok and Kim, Hyunwoo and Kim, Keon and Choi, Yong Suk. XD etox: Text Detoxification with Token-Level Toxicity Explanations. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.848

work page doi:10.18653/v1/2024.emnlp-main.848 2024
[25]

Qwen2.5-VL Technical Report

Qwen2.5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , url=

work page Pith review arXiv
[26]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models , author=. arXiv preprint arXiv:2504.10479 , url=

work page internal anchor Pith review arXiv
[27]

Sourab Mangrulkar and Sylvain Gugger and Lysandre Debut and Younes Belkada and Sayak Paul and Benjamin Bossan , howpublished =
[28]

, author=

Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=. 2022 , url=

2022
[29]

Improve vision language model chain-of-thought reasoning, 2024

Improve vision language model chain-of-thought reasoning , author=. arXiv preprint arXiv:2410.16198 , url=

work page arXiv
[30]

Sihao Ding, Santosh Vasa, and Aditi Ramadwar

Measuring and improving chain-of-thought reasoning in vision-language models , author=. arXiv preprint arXiv:2309.04461 , url=

work page arXiv
[31]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=. 2022 , url=

2022
[32]

Advances in neural information processing systems , volume=

Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=. 2022 , url=

2022
[33]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

work page Pith review arXiv
[34]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Least-to-most prompting enables complex reasoning in large language models , author=. arXiv preprint arXiv:2205.10625 , year=

work page internal anchor Pith review arXiv
[35]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review arXiv
[36]

Advances in neural information processing systems , volume=

Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=. 2023 , url=

2023
[37]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks , author=. arXiv preprint arXiv:2211.12588 , year=

work page internal anchor Pith review arXiv
[38]

Beyond Chain-of-Thought, Effec- tive Graph-of-Thought Reasoning in Language Models,

Beyond chain-of-thought, effective graph-of-thought reasoning in language models , author=. arXiv preprint arXiv:2305.16582 , year=

work page arXiv
[39]

Teaching small language models to reason.ArXiv preprint, abs/2212.08410,

Teaching Small Language Models to Reason , author=. arXiv preprint arXiv:2212.08410 , year=

work page arXiv
[40]

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes, 2023

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes , author=. arXiv preprint arXiv:2305.02301 , year=

work page arXiv
[41]

Through the Valley: Path to Effective Long CoT Training for Small Language Models

Through the Valley: Path to Effective Long CoT Training for Small Language Models , author=. arXiv preprint arXiv:2506.07712 , year=

work page arXiv
[42]

arXiv preprint arXiv:2412.01690 , year=

Can we afford the perfect prompt? balancing cost and accuracy with the economical prompting index , author=. arXiv preprint arXiv:2412.01690 , year=

work page arXiv
[43]

arXiv preprint arXiv:2408.13654 , year=

Symbolic working memory enhances language models for complex rule application , author=. arXiv preprint arXiv:2408.13654 , year=

work page arXiv
[44]

arXiv preprint arXiv:2502.11574 , year=

Large language models and mathematical reasoning failures , author=. arXiv preprint arXiv:2502.11574 , year=

work page arXiv
[45]

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Stop overthinking: A survey on efficient reasoning for large language models , author=. arXiv preprint arXiv:2503.16419 , year=

work page internal anchor Pith review arXiv
[46]

arXiv preprint arXiv:2504.00424 , year=

Hawkeye: Efficient reasoning with model collaboration , author=. arXiv preprint arXiv:2504.00424 , year=

work page arXiv
[47]

Detect, Disambiguate, and Translate: On-Demand Visual Reasoning for Multimodal Machine Translation with Large Vision-Language Models

Liu, Danyang and Kong, Fanjie and Sun, Xiaohang and Patil, Dhruva and Vajpayee, Avijit and Liu, Zhu and Bhat, Vimal and Sadoughi, Najmeh. Detect, Disambiguate, and Translate: On-Demand Visual Reasoning for Multimodal Machine Translation with Large Vision-Language Models. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Asso...

work page doi:10.18653/v1/2025.naacl-long.74 2025
[48]

1.4 million open-source distilled reasoning dataset to empower large language model training.arXiv preprint arXiv:2503.19633, 2025

1.4 million open-source distilled reasoning dataset to empower large language model training , author=. arXiv preprint arXiv:2503.19633 , year=

work page arXiv
[49]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review arXiv
[50]

Fazl Barez, Tung-Yu Wu, Iv´an Arcuschin, Michael Lan, Vincent Wang, Noah Siegel, Nicolas Collignon, Clement Neo, Isabelle Lee, Alasdair Paren, et al

A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models , author=. arXiv preprint arXiv:2505.23945 , url=

work page arXiv
[51]

arXiv preprint arXiv:2507.09184 , url=

MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models , author=. arXiv preprint arXiv:2507.09184 , url=

work page arXiv
[52]

OpenAI o1 System Card

OpenAI o1 System Card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

2025 , journal=

GPT-5 system card , author=. 2025 , journal=

2025
[54]

LLaMA: Open and Efficient Foundation Language Models

LLaMA: Open and Efficient Foundation Language Models , author=. arXiv preprint arXiv:2302.13971 , url=

work page internal anchor Pith review arXiv
[55]

Qwen Technical Report

Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , url=

work page Pith review arXiv
[56]

Advances in neural information processing systems , volume=

Visual instruction tuning , author=. Advances in neural information processing systems , volume=. 2023 , url=

2023
[57]

2023 , url=

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. 2023 , url=

2023
[58]

International conference on machine learning , pages=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

2023
[59]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , author=. arXiv preprint arXiv:2304.10592 , url=

work page internal anchor Pith review arXiv
[60]

Multi30k: Multilingual english-german image descriptions,

Multi30k: Multilingual english-german image descriptions , author=. arXiv preprint arXiv:1605.00459 , url=

work page arXiv
[61]

Neural Machine Translation by Jointly Learning to Align and Translate

Neural machine translation by jointly learning to align and translate , author=. arXiv preprint arXiv:1409.0473 , url=

work page internal anchor Pith review arXiv
[62]

arXiv preprint arXiv:2202.13645 , url=

MSCTD: A multimodal sentiment chat translation dataset , author=. arXiv preprint arXiv:2202.13645 , url=

work page arXiv
[63]

Proceedings of the IEEE international conference on computer vision , pages=

Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models , author=. Proceedings of the IEEE international conference on computer vision , pages=. 2015 , url=

2015
[64]

Improving Word Sense Disambiguation in Neural Machine Translation with Sense Embeddings

Rios Gonzales, Annette and Mascarell, Laura and Sennrich, Rico. Improving Word Sense Disambiguation in Neural Machine Translation with Sense Embeddings. Proceedings of the Second Conference on Machine Translation. 2017. doi:10.18653/v1/W17-4702

work page doi:10.18653/v1/w17-4702 2017
[65]

Neural Machine Translation with Extended Context

Tiedemann, J. Neural Machine Translation with Extended Context. Proceedings of the Third Workshop on Discourse in Machine Translation. 2017. doi:10.18653/v1/W17-4811

work page doi:10.18653/v1/w17-4811 2017
[66]

arXiv preprint arXiv:2505.21172 , url=

TAT-R1: Terminology-Aware Translation with Reinforcement Learning and Word Alignment , author=. arXiv preprint arXiv:2505.21172 , url=

work page arXiv
[67]

S em E val-2015 Task 13: Multilingual All-Words Sense Disambiguation and Entity Linking

Moro, Andrea and Navigli, Roberto. S em E val-2015 Task 13: Multilingual All-Words Sense Disambiguation and Entity Linking. Proceedings of the 9th International Workshop on Semantic Evaluation ( S em E val 2015). 2015. doi:10.18653/v1/S15-2049

work page doi:10.18653/v1/s15-2049 2015
[68]

arXiv preprint arXiv:2411.05781 , year=

Using language models to disambiguate lexical choices in translation , author=. arXiv preprint arXiv:2411.05781 , year=

work page arXiv
[69]

arXiv preprint arXiv:2302.07856 , year=

Dictionary-based phrase-level prompting of large language models for machine translation , author=. arXiv preprint arXiv:2302.07856 , year=

work page arXiv
[70]

Towards Effective Disambiguation for Machine Translation with Large Language Models

Iyer, Vivek and Chen, Pinzhen and Birch, Alexandra. Towards Effective Disambiguation for Machine Translation with Large Language Models. Proceedings of the Eighth Conference on Machine Translation. 2023. doi:10.18653/v1/2023.wmt-1.44

work page doi:10.18653/v1/2023.wmt-1.44 2023
[71]

arXiv preprint arXiv:1710.07177 , year=

Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description , author=. arXiv preprint arXiv:1710.07177 , year=

work page arXiv
[72]

A novel graph-based multi-modal fusion encoder for neural machine translation,

A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation , author=. arXiv preprint arXiv:2007.08742 , year=

work page arXiv 2007
[73]

Doubly-attentive decoder for multi- modal neural machine translation,

Doubly-Attentive Decoder for Multi-modal Neural Machine Translation , author=. arXiv preprint arXiv:1702.01287 , year=

work page arXiv
[74]

Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers , pages=

Attention-based Multimodal Neural Machine Translation , author=. Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers , pages=. 2016 , url=

2016
[75]

arXiv preprint arXiv:2507.17588 , year=

Dual-branch Prompting for Multimodal Machine Translation , author=. arXiv preprint arXiv:2507.17588 , year=

work page arXiv
[76]

Multimodal Machine Translation with Text-Image In-depth Questioning

Gao, Yue and Zhao, Jing and Sun, Shiliang and Qiao, Xiaosong and Song, Tengfei and Yang, Hao. Multimodal Machine Translation with Text-Image In-depth Questioning. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.483

work page doi:10.18653/v1/2025.findings-acl.483 2025
[77]

arXiv preprint arXiv:2505.19507 , year=

Multimodal Machine Translation with Visual Scene Graph Pruning , author=. arXiv preprint arXiv:2505.19507 , year=

work page arXiv
[78]

arXiv preprint arXiv:2507.07306 , year=

ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning , author=. arXiv preprint arXiv:2507.07306 , year=

work page arXiv
[79]

Adversarial Evaluation of Multimodal Machine Translation

Elliott, Desmond. Adversarial Evaluation of Multimodal Machine Translation. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1329

work page doi:10.18653/v1/d18-1329 2018
[80]

doi: 10.18653/v1/2021.acl-long.480

Wu, Zhiyong and Kong, Lingpeng and Bi, Wei and Li, Xiang and Kao, Ben. Good for Misconceived Reasons: An Empirical Revisiting on the Need for Visual Context in Multimodal Machine Translation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V...

work page doi:10.18653/v1/2021.acl-long.480 2021
[81]

Evaluating Discourse Phenomena in Neural Machine Translation

Bawden, Rachel and Sennrich, Rico and Birch, Alexandra and Haddow, Barry. Evaluating Discourse Phenomena in Neural Machine Translation. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10.18653/v1/N18-1118

work page doi:10.18653/v1/n18-1118 2018
[82]

Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description

Elliott, Desmond and Frank, Stella and Barrault, Lo. Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description. Proceedings of the Second Conference on Machine Translation. 2017. doi:10.18653/v1/W17-4718

work page doi:10.18653/v1/w17-4718 2017
[83]

arXiv preprint arXiv:2310.14610 , year=

That was the last straw, we need more: Are Translation Systems Sensitive to Disambiguating Context? , author=. arXiv preprint arXiv:2310.14610 , year=

work page arXiv

Showing first 80 references.