pith. machine review for the scientific record. sign in

arxiv: 2605.02035 · v1 · submitted 2026-05-03 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

A Multimodal Dataset for Visually Grounded Ambiguity in Machine Translation

Chris Biemann, Jingheng Pan, Liang Ding, Longyue Wang, Weihua Luo, Xintong Wang

Pith reviewed 2026-05-08 19:29 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords multimodal machine translationvisual ambiguitydisambiguation datasetchain-of-thought fine-tuninglarge vision language modelssupervised fine-tuningevaluation metricsout-of-distribution generalization
0
0 comments X

The pith

A dataset of 2,500 translation instances shows that chain-of-thought fine-tuning helps models use visual evidence to resolve ambiguities more consistently.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates a new dataset to test how well machine translation models use images to resolve ambiguous words or phrases in the input text. Prior benchmarks had quality problems and did not match real translation needs, so the authors curate 2,500 examples where vision is essential for disambiguation and design metrics to check resolution at the specific span. Experiments reveal that adding chain-of-thought reasoning to fine-tuning helps models disambiguate more reliably than regular fine-tuning, with particular benefits on examples unlike those seen during training. Readers should care because translation systems that properly ground meaning in visuals could reduce errors in contexts like describing images or videos.

Core claim

We introduce VIDA, a multimodal dataset consisting of 2,500 carefully curated instances in which resolving an annotated ambiguous source span requires visual evidence from the corresponding image. We propose Disambiguation-Centric Metrics that employ an LLM-as-a-judge classifier to determine whether the model has correctly resolved the ambiguous expressions at the span level. Through experiments with state-of-the-art large vision language models using vanilla inference, supervised fine-tuning, and chain-of-thought supervised fine-tuning, we find that chain-of-thought fine-tuning provides more consistent gains in disambiguation accuracy, especially on out-of-distribution subsets.

What carries the argument

The VIDA dataset of 2,500 translation instances requiring visual evidence for ambiguity resolution, evaluated with Disambiguation-Centric Metrics that use an LLM-as-a-judge to verify span-level correctness.

If this is right

  • Standard supervised fine-tuning improves overall translation quality but provides less consistent gains on disambiguation tasks.
  • Chain-of-thought supervised fine-tuning produces stronger and more reliable improvements in disambiguation accuracy.
  • The gains from chain-of-thought fine-tuning are especially pronounced on out-of-distribution examples.
  • The new metrics enable precise checking of whether models resolve specific ambiguous spans correctly rather than relying on overall sentence quality.
  • The approach supports evaluation across a wider range of ambiguity types than previous benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Encouraging step-by-step reasoning during training appears to help multimodal models better integrate visual context with linguistic input.
  • The dataset and metrics could be extended to additional languages or domains to test whether the same training pattern holds.
  • If the LLM judge scales reliably, it could support faster iteration on new disambiguation methods without full human evaluation.
  • Similar chain-of-thought fine-tuning might improve performance on other multimodal tasks involving ambiguous descriptions.

Load-bearing premise

The 2,500 instances are accurately annotated such that visual evidence is genuinely required to resolve each ambiguous span, and the LLM-as-a-judge classifier reliably measures correct span-level disambiguation without its own biases or errors.

What would settle it

An independent human review that identifies many dataset examples resolvable from source text alone without the image, or that shows the LLM judge disagrees with human judgments on resolution correctness in a substantial portion of cases.

Figures

Figures reproduced from arXiv: 2605.02035 by Chris Biemann, Jingheng Pan, Liang Ding, Longyue Wang, Weihua Luo, Xintong Wang.

Figure 1
Figure 1. Figure 1: Three-stage VIDA curation pipeline rule-based string matching. Furthermore, stan￾dard MT metrics such as BLEU (Papineni et al., 2002) and COMET (Rei et al., 2020) do not di￾rectly verify whether an ambiguous span has been resolved correctly, since surface-overlap metrics may penalize valid paraphrases or lexical variation and sentence-level metrics are too coarse-grained for span-level disambiguation. In t… view at source ↗
Figure 2
Figure 2. Figure 2: Example of CoT six-step reasoning resolving the ambiguity. view at source ↗
Figure 3
Figure 3. Figure 3: Case study of CoT-SFT vs. SFT tion and recognizes the intended interpretation dur￾ing ambiguity checking. However, in the later dis￾ambiguation step, it over-interprets the phrase by incorrectly linking it to "someone physically touch￾ing" mentioned in the grounding step, rather than the relevant cue about the product feature. As a result, the model revises an initially adequate in￾terpretation into an inc… view at source ↗
read the original abstract

Ambiguity resolution is a key challenge in multimodal machine translation (MMT), where models must genuinely leverage visual input to map an ambiguous expression to its intended meaning. Although prior work has proposed disambiguation-oriented benchmarks that provide supportive evidence for the role of vision, we observe substantial issues in data quality and a mismatch with translation scenarios. Moreover, existing ambiguity-oriented evaluations are not well suited to broader ambiguity types in open-ended translation. To address these limitations, we present VIDA (Visually-Dependent Ambiguity), a dataset of 2,500 carefully curated instances in which resolving an annotated ambiguous source span requires visual evidence. We further propose Disambiguation-Centric Metrics that use an LLM-as-a-judge classifier to verify whether annotated ambiguous expressions are resolved correctly at the span level. Experiments with two state-of-the-art Large Vision Language Models under vanilla inference, supervised fine-tuning (SFT), and our chain-of-thought SFT (CoT-SFT) show that while SFT improves overall translation quality, CoT-SFT yields more consistent gains in disambiguation accuracy, especially on out-of-distribution subsets, indicating a stronger generalization for resolving diverse ambiguity types.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces the VIDA dataset of 2,500 instances in which resolving an annotated ambiguous source span in machine translation requires visual evidence. It proposes Disambiguation-Centric Metrics that employ an LLM-as-a-judge classifier to assess whether the ambiguous expression is correctly resolved at the span level. Experiments on two state-of-the-art large vision-language models compare vanilla inference, standard supervised fine-tuning (SFT), and chain-of-thought SFT (CoT-SFT), concluding that CoT-SFT produces more consistent gains in disambiguation accuracy, particularly on out-of-distribution subsets.

Significance. If the dataset curation ensures genuine visual dependence and the LLM judge is shown to be reliable, the work would supply a needed benchmark for evaluating visual grounding in multimodal MT and would demonstrate a practical benefit of explicit reasoning traces for handling diverse ambiguity types beyond existing datasets.

major comments (1)
  1. [Disambiguation-Centric Metrics and Experiments] The Disambiguation-Centric Metrics section relies on an LLM-as-a-judge classifier to produce the primary outcome measure (span-level disambiguation accuracy), yet no validation against human judgments, no quantification of judge accuracy or bias, and no ablation on judge reliability are reported. This directly affects the central claim that CoT-SFT yields stronger generalization than SFT, especially on OOD subsets, because any systematic preference of the judge for chain-of-thought outputs could artifactually inflate the reported advantage.
minor comments (1)
  1. [Abstract] The abstract states that prior ambiguity-oriented evaluations suffer from data-quality issues and mismatch with translation scenarios; a short concrete example of one such issue would help readers immediately grasp the motivation for VIDA.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address the concern regarding the LLM-as-a-judge validation below and commit to strengthening the manuscript accordingly.

read point-by-point responses
  1. Referee: The Disambiguation-Centric Metrics section relies on an LLM-as-a-judge classifier to produce the primary outcome measure (span-level disambiguation accuracy), yet no validation against human judgments, no quantification of judge accuracy or bias, and no ablation on judge reliability are reported. This directly affects the central claim that CoT-SFT yields stronger generalization than SFT, especially on OOD subsets, because any systematic preference of the judge for chain-of-thought outputs could artifactually inflate the reported advantage.

    Authors: We agree that the absence of explicit validation for the LLM judge represents a limitation that could affect confidence in the disambiguation accuracy results and the comparative claims for CoT-SFT. The manuscript describes the judge prompt design intended to focus strictly on span-level resolution of the annotated ambiguous expression, independent of overall translation quality or reasoning style. However, no human validation, accuracy metrics, bias quantification, or ablation was included. In the revised version, we will add a dedicated subsection reporting a human validation study: three annotators will evaluate a stratified sample of 400 outputs (100 per model/setting combination across vanilla, SFT, and CoT-SFT, including OOD cases). We will report agreement rates with the LLM judge, Cohen's kappa, and any systematic preferences (e.g., toward CoT outputs). If bias is detected, we will either correct the metric or qualify the claims. We will also include a prompt ablation using an alternative judge model. These additions will directly address the potential artifact concern and support the generalization findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a purely empirical contribution centered on dataset curation (VIDA with 2,500 instances) and experimental evaluation of LVLMs under different training regimes. No mathematical derivations, fitted parameters renamed as predictions, or self-referential chains appear in the abstract or described methodology. The Disambiguation-Centric Metrics and LLM-as-a-judge are presented as measurement tools defined from the new annotations rather than reducing to prior results by construction. All claims rest on direct experimental comparisons, rendering the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claims rest on two unverified premises: accurate manual identification of spans whose resolution requires vision, and reliable performance of the LLM judge for span-level correctness. No external benchmarks or formal validation of these premises are described.

axioms (2)
  • domain assumption Annotated ambiguous spans require visual evidence for correct resolution
    Stated as the defining property of the VIDA dataset instances
  • ad hoc to paper LLM-as-a-judge classifier accurately verifies span-level disambiguation
    Proposed metric depends on this without reported validation or error analysis

pith-pipeline@v0.9.0 · 5516 in / 1387 out tokens · 65674 ms · 2026-05-08T19:29:55.976722+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

103 extracted references · 75 canonical work pages · 11 internal anchors

  1. [1]

    Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) , year=

    Multimodal lexical translation , author=. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) , year=

  2. [2]

    Multimodal Transformer for Multimodal Machine Translation

    Yao, Shaowei and Wan, Xiaojun. Multimodal Transformer for Multimodal Machine Translation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.400

  3. [3]

    Ru Wang and Selena Song and Liang Ding and Shixiang Shane Gu and Mingming Gong and Yusuke Iwasawa and Yutaka Matsuo and Jiaxian Guo , year=

  4. [4]

    and Rao, Jun and Li, Bei and Ding, Liang and Chao, Lidia S

    Ma, Xinyu and Liu, Xuebo and Wong, Derek F. and Rao, Jun and Li, Bei and Ding, Liang and Chao, Lidia S. and Tao, Dacheng and Zhang, Min. 3 AM : An Ambiguity-Aware Multi-Modal Machine Translation Dataset. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024

  5. [5]

    DRT : Deep Reasoning Translation via Long Chain-of-Thought

    Wang, Jiaan and Meng, Fandong and Liang, Yunlong and Zhou, Jie. DRT : Deep Reasoning Translation via Long Chain-of-Thought. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.351

  6. [7]

    Qwen2.5 Technical Report

    Qwen2.5 technical report , author=. arXiv preprint arXiv:2412.15115 , url=

  7. [8]

    Proceedings of the 40th Annual Meeting on Association for Computational Linguistics , pages =

    Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing. BLEU: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002. doi:10.3115/1073083.1073135

  8. [9]

    COMET : A Neural Framework for MT Evaluation

    Rei, Ricardo and Stewart, Craig and Farinha, Ana C and Lavie, Alon. COMET : A Neural Framework for MT Evaluation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.213

  9. [10]

    LLaVA-OneVision: Easy Visual Task Transfer

    LLaVA-OneVision: Easy Visual Task Transfer , author=. arXiv preprint arXiv:2408.03326 , url=

  10. [11]

    ArXiv , year=

    Qwen2 Technical Report , author=. ArXiv , year=

  11. [12]

    chr F : character n-gram F -score for automatic MT evaluation

    Popovi \'c , Maja. chr F : character n-gram F -score for automatic MT evaluation. Proceedings of the Tenth Workshop on Statistical Machine Translation. 2015. doi:10.18653/v1/W15-3049

  12. [13]

    A Study of Translation Edit Rate with Targeted Human Annotation

    Snover, Matthew and Dorr, Bonnie and Schwartz, Rich and Micciulla, Linnea and Makhoul, John. A Study of Translation Edit Rate with Targeted Human Annotation. Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers. 2006

  13. [14]

    2019 , url=

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle=. 2019 , url=

  14. [15]

    METEOR : An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

    Banerjee, Satanjeev and Lavie, Alon. METEOR : An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005

  15. [17]

    A survey on multi-modal machine translation: Tasks, methods and challenges,

    A survey on multi-modal machine translation: Tasks, methods and challenges , author=. arXiv preprint arXiv:2405.12669 , url=

  16. [18]

    arXiv preprint arXiv:1603.08079 , year=

    Do you see what i mean? visual resolution of linguistic ambiguities , author=. arXiv preprint arXiv:1603.08079 , year=

  17. [19]

    arXiv preprint arXiv:2205.11631 , url=

    Towards opening the black box of neural machine translation: Source and target interpretations of the transformer , author=. arXiv preprint arXiv:2205.11631 , url=

  18. [20]

    s1: Simple test-time scaling

    s1: Simple test-time scaling , author=. arXiv preprint arXiv:2501.19393 , url=

  19. [22]

    Tox- BART : Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech

    Yadav, Neemesh and Masud, Sarah and Goyal, Vikram and Akhtar, Md Shad and Chakraborty, Tanmoy. Tox- BART : Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.831

  20. [23]

    Walking in Others' Shoes: How Perspective-Taking Guides Large Language Models in Reducing Toxicity and Bias

    Xu, Rongwu and Zhou, Zian and Zhang, Tianwei and Qi, Zehan and Yao, Su and Xu, Ke and Xu, Wei and Qiu, Han. Walking in Others' Shoes: How Perspective-Taking Guides Large Language Models in Reducing Toxicity and Bias. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.476

  21. [24]

    XD etox: Text Detoxification with Token-Level Toxicity Explanations

    Lee, Beomseok and Kim, Hyunwoo and Kim, Keon and Choi, Yong Suk. XD etox: Text Detoxification with Token-Level Toxicity Explanations. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.848

  22. [25]

    Qwen2.5-VL Technical Report

    Qwen2.5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , url=

  23. [26]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models , author=. arXiv preprint arXiv:2504.10479 , url=

  24. [27]

    Sourab Mangrulkar and Sylvain Gugger and Lysandre Debut and Younes Belkada and Sayak Paul and Benjamin Bossan , howpublished =

  25. [28]

    , author=

    Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=. 2022 , url=

  26. [29]

    Improve vision language model chain-of-thought reasoning, 2024

    Improve vision language model chain-of-thought reasoning , author=. arXiv preprint arXiv:2410.16198 , url=

  27. [30]

    Sihao Ding, Santosh Vasa, and Aditi Ramadwar

    Measuring and improving chain-of-thought reasoning in vision-language models , author=. arXiv preprint arXiv:2309.04461 , url=

  28. [31]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=. 2022 , url=

  29. [32]

    Advances in neural information processing systems , volume=

    Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=. 2022 , url=

  30. [33]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

  31. [34]

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

    Least-to-most prompting enables complex reasoning in large language models , author=. arXiv preprint arXiv:2205.10625 , year=

  32. [35]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  33. [36]

    Advances in neural information processing systems , volume=

    Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=. 2023 , url=

  34. [37]

    Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks , author=. arXiv preprint arXiv:2211.12588 , year=

  35. [38]

    Beyond Chain-of-Thought, Effec- tive Graph-of-Thought Reasoning in Language Models,

    Beyond chain-of-thought, effective graph-of-thought reasoning in language models , author=. arXiv preprint arXiv:2305.16582 , year=

  36. [39]

    Teaching small language models to reason.ArXiv preprint, abs/2212.08410,

    Teaching Small Language Models to Reason , author=. arXiv preprint arXiv:2212.08410 , year=

  37. [40]

    Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes, 2023

    Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes , author=. arXiv preprint arXiv:2305.02301 , year=

  38. [41]

    Through the Valley: Path to Effective Long CoT Training for Small Language Models

    Through the Valley: Path to Effective Long CoT Training for Small Language Models , author=. arXiv preprint arXiv:2506.07712 , year=

  39. [42]

    arXiv preprint arXiv:2412.01690 , year=

    Can we afford the perfect prompt? balancing cost and accuracy with the economical prompting index , author=. arXiv preprint arXiv:2412.01690 , year=

  40. [43]

    arXiv preprint arXiv:2408.13654 , year=

    Symbolic working memory enhances language models for complex rule application , author=. arXiv preprint arXiv:2408.13654 , year=

  41. [44]

    arXiv preprint arXiv:2502.11574 , year=

    Large language models and mathematical reasoning failures , author=. arXiv preprint arXiv:2502.11574 , year=

  42. [45]

    Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

    Stop overthinking: A survey on efficient reasoning for large language models , author=. arXiv preprint arXiv:2503.16419 , year=

  43. [46]

    arXiv preprint arXiv:2504.00424 , year=

    Hawkeye: Efficient reasoning with model collaboration , author=. arXiv preprint arXiv:2504.00424 , year=

  44. [47]

    Detect, Disambiguate, and Translate: On-Demand Visual Reasoning for Multimodal Machine Translation with Large Vision-Language Models

    Liu, Danyang and Kong, Fanjie and Sun, Xiaohang and Patil, Dhruva and Vajpayee, Avijit and Liu, Zhu and Bhat, Vimal and Sadoughi, Najmeh. Detect, Disambiguate, and Translate: On-Demand Visual Reasoning for Multimodal Machine Translation with Large Vision-Language Models. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Asso...

  45. [48]

    1.4 million open-source distilled reasoning dataset to empower large language model training.arXiv preprint arXiv:2503.19633, 2025

    1.4 million open-source distilled reasoning dataset to empower large language model training , author=. arXiv preprint arXiv:2503.19633 , year=

  46. [49]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

  47. [50]

    Fazl Barez, Tung-Yu Wu, Iv´an Arcuschin, Michael Lan, Vincent Wang, Noah Siegel, Nicolas Collignon, Clement Neo, Isabelle Lee, Alasdair Paren, et al

    A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models , author=. arXiv preprint arXiv:2505.23945 , url=

  48. [51]

    arXiv preprint arXiv:2507.09184 , url=

    MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models , author=. arXiv preprint arXiv:2507.09184 , url=

  49. [52]

    OpenAI o1 System Card

    OpenAI o1 System Card , author=. arXiv preprint arXiv:2412.16720 , year=

  50. [53]

    2025 , journal=

    GPT-5 system card , author=. 2025 , journal=

  51. [54]

    LLaMA: Open and Efficient Foundation Language Models

    LLaMA: Open and Efficient Foundation Language Models , author=. arXiv preprint arXiv:2302.13971 , url=

  52. [55]

    Qwen Technical Report

    Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , url=

  53. [56]

    Advances in neural information processing systems , volume=

    Visual instruction tuning , author=. Advances in neural information processing systems , volume=. 2023 , url=

  54. [57]

    2023 , url=

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. 2023 , url=

  55. [58]

    International conference on machine learning , pages=

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

  56. [59]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , author=. arXiv preprint arXiv:2304.10592 , url=

  57. [60]

    Multi30k: Multilingual english-german image descriptions,

    Multi30k: Multilingual english-german image descriptions , author=. arXiv preprint arXiv:1605.00459 , url=

  58. [61]

    Neural Machine Translation by Jointly Learning to Align and Translate

    Neural machine translation by jointly learning to align and translate , author=. arXiv preprint arXiv:1409.0473 , url=

  59. [62]

    arXiv preprint arXiv:2202.13645 , url=

    MSCTD: A multimodal sentiment chat translation dataset , author=. arXiv preprint arXiv:2202.13645 , url=

  60. [63]

    Proceedings of the IEEE international conference on computer vision , pages=

    Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models , author=. Proceedings of the IEEE international conference on computer vision , pages=. 2015 , url=

  61. [64]

    Improving Word Sense Disambiguation in Neural Machine Translation with Sense Embeddings

    Rios Gonzales, Annette and Mascarell, Laura and Sennrich, Rico. Improving Word Sense Disambiguation in Neural Machine Translation with Sense Embeddings. Proceedings of the Second Conference on Machine Translation. 2017. doi:10.18653/v1/W17-4702

  62. [65]

    Neural Machine Translation with Extended Context

    Tiedemann, J. Neural Machine Translation with Extended Context. Proceedings of the Third Workshop on Discourse in Machine Translation. 2017. doi:10.18653/v1/W17-4811

  63. [66]

    arXiv preprint arXiv:2505.21172 , url=

    TAT-R1: Terminology-Aware Translation with Reinforcement Learning and Word Alignment , author=. arXiv preprint arXiv:2505.21172 , url=

  64. [67]

    S em E val-2015 Task 13: Multilingual All-Words Sense Disambiguation and Entity Linking

    Moro, Andrea and Navigli, Roberto. S em E val-2015 Task 13: Multilingual All-Words Sense Disambiguation and Entity Linking. Proceedings of the 9th International Workshop on Semantic Evaluation ( S em E val 2015). 2015. doi:10.18653/v1/S15-2049

  65. [68]

    arXiv preprint arXiv:2411.05781 , year=

    Using language models to disambiguate lexical choices in translation , author=. arXiv preprint arXiv:2411.05781 , year=

  66. [69]

    arXiv preprint arXiv:2302.07856 , year=

    Dictionary-based phrase-level prompting of large language models for machine translation , author=. arXiv preprint arXiv:2302.07856 , year=

  67. [70]

    Towards Effective Disambiguation for Machine Translation with Large Language Models

    Iyer, Vivek and Chen, Pinzhen and Birch, Alexandra. Towards Effective Disambiguation for Machine Translation with Large Language Models. Proceedings of the Eighth Conference on Machine Translation. 2023. doi:10.18653/v1/2023.wmt-1.44

  68. [71]

    arXiv preprint arXiv:1710.07177 , year=

    Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description , author=. arXiv preprint arXiv:1710.07177 , year=

  69. [72]

    A novel graph-based multi-modal fusion encoder for neural machine translation,

    A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation , author=. arXiv preprint arXiv:2007.08742 , year=

  70. [73]

    Doubly-attentive decoder for multi- modal neural machine translation,

    Doubly-Attentive Decoder for Multi-modal Neural Machine Translation , author=. arXiv preprint arXiv:1702.01287 , year=

  71. [74]

    Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers , pages=

    Attention-based Multimodal Neural Machine Translation , author=. Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers , pages=. 2016 , url=

  72. [75]

    arXiv preprint arXiv:2507.17588 , year=

    Dual-branch Prompting for Multimodal Machine Translation , author=. arXiv preprint arXiv:2507.17588 , year=

  73. [76]

    Multimodal Machine Translation with Text-Image In-depth Questioning

    Gao, Yue and Zhao, Jing and Sun, Shiliang and Qiao, Xiaosong and Song, Tengfei and Yang, Hao. Multimodal Machine Translation with Text-Image In-depth Questioning. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.483

  74. [77]

    arXiv preprint arXiv:2505.19507 , year=

    Multimodal Machine Translation with Visual Scene Graph Pruning , author=. arXiv preprint arXiv:2505.19507 , year=

  75. [78]

    arXiv preprint arXiv:2507.07306 , year=

    ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning , author=. arXiv preprint arXiv:2507.07306 , year=

  76. [79]

    Adversarial Evaluation of Multimodal Machine Translation

    Elliott, Desmond. Adversarial Evaluation of Multimodal Machine Translation. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1329

  77. [80]

    doi: 10.18653/v1/2021.acl-long.480

    Wu, Zhiyong and Kong, Lingpeng and Bi, Wei and Li, Xiang and Kao, Ben. Good for Misconceived Reasons: An Empirical Revisiting on the Need for Visual Context in Multimodal Machine Translation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V...

  78. [81]

    Evaluating Discourse Phenomena in Neural Machine Translation

    Bawden, Rachel and Sennrich, Rico and Birch, Alexandra and Haddow, Barry. Evaluating Discourse Phenomena in Neural Machine Translation. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10.18653/v1/N18-1118

  79. [82]

    Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description

    Elliott, Desmond and Frank, Stella and Barrault, Lo. Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description. Proceedings of the Second Conference on Machine Translation. 2017. doi:10.18653/v1/W17-4718

  80. [83]

    arXiv preprint arXiv:2310.14610 , year=

    That was the last straw, we need more: Are Translation Systems Sensitive to Disambiguating Context? , author=. arXiv preprint arXiv:2310.14610 , year=

Showing first 80 references.