pith. machine review for the scientific record.
sign in

arxiv: 2509.22123 · v2 · pith:OYRWKDGRnew · submitted 2025-09-26 · 💻 cs.CL

Multilingual Vision-Language Models, A Survey

Pith reviewed 2026-05-18 12:58 UTC · model grok-4.3

classification 💻 cs.CL
keywords multilingual vision-language modelslanguage neutralitycultural awarenesssurveycontrastive learningbenchmarkscross-lingual representationsgenerative architectures
0
0 comments X

The pith

Multilingual vision-language models face a core tension between language-neutral representations and cultural adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey reviews 33 models and 23 benchmarks to examine how vision-language systems handle text and images across languages. It shows that training methods such as contrastive learning push models toward consistent cross-lingual representations that ignore cultural differences. Benchmarks largely rely on translated data to test semantic matching, which keeps evaluations focused on consistency rather than local context. The paper notes uneven performance across languages and a mismatch between what training objectives reward and what evaluations measure. A sympathetic reader would see this as evidence that current approaches leave models less useful in culturally specific real-world settings.

Core claim

The survey establishes that multilingual vision-language models exhibit a key tension between language neutrality, achieved through consistent cross-lingual representations, and cultural awareness, which requires adaptation to specific cultural contexts. Training predominantly favors neutrality via contrastive learning, while cultural awareness depends on access to diverse data. Two-thirds of the examined benchmarks use translation-based approaches that prioritize semantic consistency, although some recent work incorporates culturally grounded content. The analysis reveals discrepancies in cross-lingual capabilities and gaps between training objectives and evaluation goals.

What carries the argument

The identified tension between language neutrality (consistent cross-lingual representations) and cultural awareness (adaptation to cultural contexts) that shapes both training and evaluation.

If this is right

  • Continued emphasis on contrastive learning will keep models strong on semantic consistency but limited in handling cultural nuances.
  • Shifting benchmarks away from translations toward original culturally grounded data would better expose current model limitations.
  • Uneven cross-lingual performance will persist until training data and objectives better align with evaluation goals.
  • Applications such as image description or visual question answering will remain less reliable in non-dominant cultural settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • New data collection efforts focused on parallel cultural contexts rather than translations could reduce the observed gap.
  • Model architectures might need separate pathways for neutral semantics and culture-specific signals instead of a single shared representation.
  • Evaluation on real-world user interactions in multiple languages could provide a stronger test than static benchmarks.

Load-bearing premise

The chosen set of 33 models and 23 benchmarks gives a representative and unbiased picture of the current state of multilingual vision-language research.

What would settle it

Documentation of one model that simultaneously achieves high cross-lingual consistency on translation-based tests and strong performance on original, non-translated culturally specific tasks would undermine the claimed tension.

Figures

Figures reproduced from arXiv: 2509.22123 by Andrei-Alexandru Manea, Jind\v{r}ich Libovick\'y.

Figure 1
Figure 1. Figure 1: Visual-language reasoning example from M5-VGR set. The original caption is in German “Auf dem ersten Bild ist der [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the VL encoders architecture. The white empty box represents a nontrainable layer that computes the loss [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the VL decoder’s architecture and the transformer layer for a mixture of modality experts, picture extracted from [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Timeline and dependencies of models discussed in this survey. Models are ordered by the time of release. Arrows in the [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visual-language outlier detection example from M5-VLOD set. The original caption in Thai can be translated as: "Each photo [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sample from MVL-SIB input, copied from the original publication, under CC BY-SA 4.0. [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Sample from CVQA dataset, copied from the original publication, under CC BY-SA 4.0. [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
read the original abstract

This survey examines multilingual vision-language models that process text and images across languages. We review 33 models and 23 benchmarks, spanning encoder-only and generative architectures, and identify a key tension between language neutrality (consistent cross-lingual representations) and cultural awareness (adaptation to cultural contexts). Current training methods favor neutrality through contrastive learning, while cultural awareness depends on diverse data. Two-thirds of evaluation benchmarks use translation-based approaches prioritizing semantic consistency, though recent work incorporates culturally grounded content. We find discrepancies in cross-lingual capabilities and gaps between training objectives and evaluation goals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript surveys multilingual vision-language models, reviewing 33 models and 23 benchmarks across encoder-only and generative architectures. It identifies a central tension between language neutrality (consistent cross-lingual representations promoted by contrastive learning) and cultural awareness (adaptation to specific cultural contexts), noting that current training favors neutrality while two-thirds of benchmarks rely on translation-based methods prioritizing semantic consistency over cultural grounding. The paper highlights discrepancies in cross-lingual capabilities and gaps between training objectives and evaluation goals.

Significance. If the reviewed set proves representative, the survey offers a useful synthesis of training-evaluation mismatches in multilingual VLMs and frames an actionable tension between neutrality and cultural specificity that could guide future data curation and benchmark design. The explicit contrast between contrastive training practices and translation-heavy evaluation is a clear contribution for the field.

major comments (1)
  1. [Model and benchmark selection description] The section introducing the 33 models and 23 benchmarks (near the start of the model and benchmark review sections): the paper states it reviews these quantities but supplies no inclusion/exclusion criteria, search strings, date cutoffs, or audit against a larger candidate pool. This is load-bearing for the central claim of a field-wide training-evaluation tension, because an uncharacterized selection could over-represent contrastive or English-centric models and thereby artifactually produce the reported discrepancy.
minor comments (2)
  1. [Model summary table] Table or figure summarizing the 33 models would benefit from an explicit column or note on data sources used for cultural grounding versus contrastive pretraining.
  2. [Benchmark discussion] The discussion of recent culturally grounded benchmarks could include a short forward-looking paragraph on how to operationalize cultural awareness metrics beyond current translation baselines.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our survey. We agree that explicit documentation of the model and benchmark selection process is necessary to support the central claims regarding training-evaluation tensions. We will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: The section introducing the 33 models and 23 benchmarks (near the start of the model and benchmark review sections): the paper states it reviews these quantities but supplies no inclusion/exclusion criteria, search strings, date cutoffs, or audit against a larger candidate pool. This is load-bearing for the central claim of a field-wide training-evaluation tension, because an uncharacterized selection could over-represent contrastive or English-centric models and thereby artifactually produce the reported discrepancy.

    Authors: We acknowledge this limitation in the current draft. The manuscript does not currently detail the search strategy or explicit inclusion/exclusion criteria used to arrive at the 33 models and 23 benchmarks. In the revised version, we will insert a dedicated subsection (likely in Section 2 or as a new Section 1.3) that specifies: (1) search sources (arXiv, ACL Anthology, CVPR/ICCV/ECCV, and recent surveys), (2) keywords and date range (primarily 2020–September 2024), (3) inclusion criteria (models supporting at least one non-English language with public papers or reports; benchmarks with multilingual or cross-lingual evaluation), and (4) exclusion criteria (purely monolingual English models or benchmarks without reported multilingual results). We will also add a brief discussion of selection limitations and how we cross-checked against existing surveys to reduce English-centric bias. This addition will directly strengthen the foundation for our analysis of the neutrality–cultural awareness tension without altering the reviewed set or core findings. revision: yes

Circularity Check

0 steps flagged

No circularity: survey summarizes external literature without self-referential derivations

full rationale

The paper is a literature survey that reviews 33 models and 23 benchmarks to identify a tension between language neutrality and cultural awareness. Its central claims derive from analysis of cited external works rather than any internal equations, fitted parameters, or predictions that reduce to the paper's own inputs by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citation chains appear in the abstract or described structure. The selection of reviewed items is presented as the scope of the survey; while selection criteria are not detailed, this does not create a circular reduction per the defined patterns. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a literature survey with no original mathematical derivations, experiments, or postulates. No free parameters, axioms, or invented entities are introduced; all content synthesizes prior published work on multilingual vision-language models.

pith-pipeline@v0.9.0 · 5617 in / 1101 out tokens · 48320 ms · 2026-05-18T12:58:03.772078+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

170 extracted references · 170 canonical work pages · 21 internal anchors

  1. [1]

    Pranav Aggarwal and Ajinkya Kale. 2020. Towards Zero-shot Cross-lingual Image Retrieval.CoRRabs/2012.05107 (2020). arXiv:2012.05107 https://arxiv.org/abs/2012.05107

  2. [2]

    Željko Agić and Natalie Schluter. 2018. Baselines and Test Data for Cross-Lingual Inference. InProceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Maz...

  3. [3]

    Cem Akkus, Luyang Chu, Vladana Djakovic, Steffen Jauch-Walser, Philipp Koch, Giacomo Loss, Christopher Marquardt, Marco Moldovan, Nadja Sauter, Maximilian Schneider, Rickmer Schulte, Karol Urbanczyk, Jann Goschenhofer, Christian Heumann, Rasmus Hvingelby, Daniel Schalk, and Matthias Aßenmacher. 2023. Multimodal Deep Learning.CoRRabs/2301.04856 (2023). arX...

  4. [4]

    Vegesna, Abhipsha Das, Anthony Susevski, Ryan Sze- Yin Chan, S

    Nahid Alam, Karthik Reddy Kanjula, Surya Guthikonda, Timothy Chung, Bala Krishna S. Vegesna, Abhipsha Das, Anthony Susevski, Ryan Sze- Yin Chan, S. M. Iftekhar Uddin, Shayekh Bin Islam, Roshan Santhosh, Snegha A, Drishti Sharma, Chen Liu, Isha Chaturvedi, Genta Indra Winata, Ashvanth. S, Snehanshu Mukherjee, and Alham Fikri Aji. 2024. Maya: An Instruction...

  5. [5]

    Jacob Andreas. 2022. Language Models as Agent Models. InFindings of the Association for Computational Linguistics: EMNLP 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 5769–5779. doi:10.18653/ v1/2022.findings-emnlp.423

  6. [6]

    Lawrence Zitnick, and Devi Parikh

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In2015 IEEE International Conference on Computer Vision, ICCV 2015. IEEE Computer Society, Santiago, Chile, 2425–2433. doi:10.1109/ICCV.2015.279

  7. [7]

    Gomez, Phil Blunsom, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker

    Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin, Bharat Venkitesh, Madeline Smith, Jon Ander Campos, Yi Chern Tan, Kelly Marchisio, Max Bartolo, Sebastian Ruder, Acyr Locatelli, Julia Kreutzer, Nick Frosst, Aidan N. Gomez, Phil Blunsom, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker. 2024. Aya 23: Open Weight Releases to ...

  8. [8]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Ming-Hsuan Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical ...

  9. [9]

    Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. 2022. BEiT: BERT Pre-Training of Image Transformers. InThe Tenth International Conference on Learning Representations, ICLR 2022. OpenReview.net, Virtual Event. https://openreview.net/forum?id=p-BhZSz59o4

  10. [10]

    Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Songhao Piao, and Furu Wei. 2022. VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, Sanmi Koyejo, S...

  11. [11]

    Loïc Barrault, Fethi Bougares, Lucia Specia, Chiraag Lala, Desmond Elliott, and Stella Frank. 2018. Findings of the Third Shared Task on Multimodal Machine Translation. InProceedings of the Third Conference on Machine Translation: Shared Task Papers, Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Hu...

  12. [12]

    Bender and Alexander Koller

    Emily M. Bender and Alexander Koller. 2020. Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 5185–5198. doi:10.1865...

  13. [13]

    Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A Neural Probabilistic Language Model.J. Mach. Learn. Res.3 (2003), 1137–1155. https://jmlr.org/papers/v3/bengio03a.html Manuscript submitted to ACM 26 Andrei-Alexandru Manea and Jindřich Libovický

  14. [14]

    Lucas Beyer, Pavel Izmailov, Alexander Kolesnikov, Mathilde Caron, Simon Kornblith, Xiaohua Zhai, Matthias Minderer, Michael Tschannen, Ibrahim Alabdulmohsin, and Filip Pavetic. 2023. FlexiViT: One Model for All Patch Sizes. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC. IEEE, Canada, 14496–14506. doi:10.1109/C...

  15. [15]

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey A. Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias ...

  16. [16]

    Bowman, Gabor Angeli, Christopher Potts, and Christopher D

    Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lluís Màrquez, Chris Callison-Burch, and Jian Su (Eds.). Association for Computational Linguistics, Lisbon, Portugal...

  17. [17]

    María Alejandra Bravo, Sudhanshu Mittal, Simon Ging, and Thomas Brox. 2023. Open-vocabulary Attribute Detection. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC. IEEE, Canada, 7041–7050. doi:10.1109/CVPR52729.2023.00680

  18. [18]

    Brown, Vincent J

    Peter F. Brown, Vincent J. Della Pietra, Peter V. deSouza, Jenifer C. Lai, and Robert L. Mercer. 1992. Class-Basedn-gram Models of Natural Language. Computational Linguistics18, 4 (1992), 467–480. https://aclanthology.org/J92-4003/

  19. [19]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  20. [20]

    Emanuele Bugliarello, Fangyu Liu, Jonas Pfeiffer, Siva Reddy, Desmond Elliott, Edoardo Maria Ponti, and Ivan Vulic. 2022. IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages. InInternational Conference on Machine Learning, ICML 2022, 17-23 July 2022 (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stef...

  21. [21]

    Fredrik Carlsson, Philipp Eisen, Faton Rekathati, and Magnus Sahlgren. 2022. Cross-lingual and Multilingual CLIP. InProceedings of the Thirteenth Language Resources and Evaluation Conference, Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Maria...

  22. [22]

    Soravit Changpinyo, Linting Xue, Michal Yarom, Ashish Thapliyal, Idan Szpektor, Julien Amelot, Xi Chen, and Radu Soricut. 2023. MaXM: Towards Multilingual Visual Question Answering. InFindings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore...

  23. [23]

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. 2024. ShareGPT4V: Improving Large Multi-modal Models with Better Captions. InComputer Vision - ECCV 2024 - 18th European Conference (Lecture Notes in Computer Science, Vol. 15075), Ales Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and...

  24. [24]

    Fleet, and Geoffrey E

    Ting Chen, Saurabh Saxena, Lala Li, David J. Fleet, and Geoffrey E. Hinton. 2022. Pix2seq: A Language Modeling Framework for Object Detection. InThe Tenth International Conference on Learning Representations, ICLR 2022. OpenReview.net, Virtual Event. https://openreview.net/forum?id= e42KbIw6Wb

  25. [25]

    Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shakeri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, A. J. Piergiovanni, Matthias...

  26. [26]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. 2015. Microsoft COCO Captions: Data Collection and Evaluation Server.CoRRabs/1504.00325 (2015). arXiv:1504.00325 http://arxiv.org/abs/1504.00325

  27. [27]

    Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. InComputer Vision - ECCV 2020 - 16th European Conference (Lecture Notes in Computer Science, Vol. 12375), Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer, Gl...

  28. [28]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023- 03-30-vicuna/

  29. [29]

    Grzegorz Chrupała, Ákos Kádár, and Afra Alishahi. 2015. Learning language through pictures. InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Manuscript submitted to ACM Multilingual Vision-Language Models, A Survey 2...

  30. [30]

    Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, and Chunhua Shen. 2023. MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices.CoRRabs/2312.16886 (2023). arXiv:2312.16886 doi:10.48550/ARXIV.2312.16886

  31. [31]

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Nat...

  32. [32]

    Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: Evaluating Cross-lingual Sentence Representations. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association f...

  33. [33]

    Rocktim Das, Simeon Hristov, Haonan Li, Dimitar Dimitrov, Ivan Koychev, and Preslav Nakov. 2024. EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Sr...

  34. [34]

    Dauphin, Angela Fan, Michael Auli, and David Grangier

    Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language Modeling with Gated Convolutional Networks. InProceedings of the 34th International Conference on Machine Learning, ICML 2017 (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, Sydney, NSW, Australia, 933–941. http://proceedings.mlr....

  35. [35]

    Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme Ruiz, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Ga...

  36. [36]

    InInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023 (Proceedings of Machine Learning Research, Vol

    Scaling Vision Transformers to 22 Billion Parameters. InInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023 (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, Honolulu, Hawaii, 7480–7512. https://proceedings.mlr.pres...

  37. [37]

    Alabdulmohsin, Avital Oliver, Piotr Padlewski, Alexey A

    Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M. Alabdulmohsin, Avital Oliver, Piotr Padlewski, Alexey A. Gritsenko, Mario Lucic, and Neil Houlsby. 2023. Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution. InAdvances i...

  38. [38]

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009. IEEE Computer Society, Miami, Florida, 248–255. doi:10.1109/CVPR.2009.5206848

  39. [39]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy ...

  40. [40]

    Shizhe Diao, Wangchunshu Zhou, Xinsong Zhang, and Jiawei Wang. 2023. Write and Paint: Generative Vision-Language Models are Unified Modal Learners. InThe Eleventh International Conference on Learning Representations, ICLR 2023. OpenReview.net, Kigali, Rwanda. https: //openreview.net/forum?id=HgQR0mXQ1_a

  41. [41]

    Dimitar Dimitrov, Firoj Alam, Maram Hasanain, Abul Hasnat, Fabrizio Silvestri, Preslav Nakov, and Giovanni Da San Martino. 2024. SemEval-2024 Task 4: Multilingual Detection of Persuasion Techniques in Memes. InProceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), Atul Kr. Ojha, A. Seza Doğruöz, Harish Tayyar Madabushi, Giov...

  42. [42]

    Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. 2024. LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens. InForty-first International Conference on Machine Learning, ICML 2024. OpenReview.net, Vienna, Austria. https://openreview.net/forum?id=ONOtpXLqqw

  43. [43]

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In9th International Conference on Learning Representations, ICLR 2...

  44. [44]

    Zi-Yi Dou and Graham Neubig. 2021. Word Alignment by Fine-tuning Embeddings on Parallel Corpora. InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty (Eds.). Association for Computational Linguistics, Online, 2112–2128. doi:10.18653/v1/202...

  45. [45]

    Philipp Dufter and Hinrich Schütze. 2020. Identifying Elements Essential for BERT’s Multilinguality. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 4423–4437. doi:10.18653/v1/2020.emnlp-main.358

  46. [46]

    Constantin Eichenberg, Sidney Black, Samuel Weinbach, Letitia Parcalabescu, and Anette Frank. 2022. MAGMA – Multimodal Augmentation of Generative Models through Adapter-based Finetuning. InFindings of the Association for Computational Linguistics: EMNLP 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics...

  47. [47]

    Desmond Elliott, Stella Frank, Loïc Barrault, Fethi Bougares, and Lucia Specia. 2017. Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description. InProceedings of the Second Conference on Machine Translation, Ondřej Bojar, Christian Buck, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matth...

  48. [48]

    Desmond Elliott, Stella Frank, Khalil Sima’an, and Lucia Specia. 2016. Multi30K: Multilingual English-German Image Descriptions. InProceedings of the 5th Workshop on Vision and Language, Anya Belz, Erkut Erdem, Krystian Mikolajczyk, and Katerina Pastra (Eds.). Association for Computational Linguistics, Berlin, Germany, 70–74. doi:10.18653/v1/W16-3210

  49. [49]

    William Fedus, Barret Zoph, and Noam Shazeer. 2021. Switch transformer: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research22, 120 (2021), 1–39

  50. [50]

    Clayton Fields and Casey Kennington. 2023. Vision Language Transformers: A Survey.CoRRabs/2307.03254 (2023). arXiv:2307.03254 doi:10.48550/ ARXIV.2307.03254

  51. [51]

    Matthieu Futeral, Cordelia Schmid, Ivan Laptev, Benoît Sagot, and Rachel Bawden. 2023. Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Ed...

  52. [52]

    Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. 2024. Making LLaMA SEE and Draw with SEED Tokenizer. InThe Twelfth International Conference on Learning Representations, ICLR 2024. OpenReview.net, Vienna, Austria. https://openreview.net/forum? id=0Nui91LBQS

  53. [53]

    Gregor Geigle, Abhay Jain, Radu Timofte, and Goran Glavaš. 2024. mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs. InProceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), Jing Gu, Tsu-Jui (Ray) Fu, Drew Hudson, Asli Celikyilmaz, and William Wang (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 7...

  54. [54]

    Gregor Geigle, Florian Schneider, Carolin Holtermann, Chris Biemann, Radu Timofte, Anne Lauscher, and Goran Glavas. 2025. Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model.CoRRabs/2501.05122 (2025). arXiv:2501.05122 doi:10.48550/ARXIV.2501.05122

  55. [55]

    Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer Feed-Forward Layers Are Key-Value Memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, Online and Punta Cana, ...

  56. [56]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, et al...

  57. [57]

    Reto Gubelmann. 2024. Pragmatic Norms Are All You Need – Why The Symbol Grounding Problem Does Not Apply to LLMs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 11663–11678. doi:10.18653/v1/20...

  58. [58]

    Katharina Hämmerl, Jindřich Libovický, and Alexander Fraser. 2024. Understanding Cross-Lingual Alignment—A Survey. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 10922–10943. doi:10.18653/v1/2024.findings-acl.649

  59. [59]

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. IEEE Computer Society, Las Vegas, NV, USA, 770–778. doi:10.1109/CVPR.2016.90

  60. [60]

    Xuehai He, Chunyuan Li, Pengchuan Zhang, Jianwei Yang, and Xin Eric Wang. 2023. Parameter-Efficient Model Adaptation for Vision Transformers. InThirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artif...

  61. [61]

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal neural networks. InAdvances in Neural Information Processing Systems, Vol. 35. 30016–30030

  62. [62]

    Hanxu Hu and Frank Keller. 2023. Meta-Learning For Vision-and-Language Cross-lingual Transfer.CoRRabs/2305.14843 (2023). arXiv:2305.14843 doi:10.48550/ARXIV.2305.14843

  63. [63]

    Ronghang Hu, Daniel Fried, Anna Rohrbach, Dan Klein, Trevor Darrell, and Kate Saenko. 2019. Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, Manuscript submitted to ACM Multilingual Vision-Language Models, A Survey ...

  64. [64]

    Hudson and Christopher D

    Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019. Computer Vision Foundation / IEEE, Long Beach, CA, USA, 6700–6709. doi:10.1109/CVPR.2019.00686

  65. [65]

    Ayyoob Imani, Peiqin Lin, Amir Hossein Kargaran, Silvia Severini, Masoud Jalili Sabet, Nora Kassner, Chunlan Ma, Helmut Schmid, André Martins, François Yvon, and Hinrich Schütze. 2023. Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volum...

  66. [66]

    Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts.Neural computation3, 1 (1991), 79–87

  67. [67]

    Aashi Jain, Mandy Guo, Krishna Srinivasan, Ting Chen, Sneha Kudugunta, Chao Jia, Yinfei Yang, and Jason Baldridge. 2021. MURAL: Multimodal, Multitask Representations Across Languages. InFindings of the Association for Computational Linguistics: EMNLP 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Co...

  68. [68]

    Aashi Jain, Mandy Guo, Krishna Srinivasan, Ting Chen, Sneha Kudugunta, Chao Jia, Yinfei Yang, and Jason Baldridge. 2021. MURAL: Multimodal, Multitask Retrieval Across Languages. arXiv:2109.05125 [cs.IR] https://arxiv.org/abs/2109.05125

  69. [69]

    Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. InProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021 (Proceedings of Machine Learning Rese...

  70. [70]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B.CoRRabs/2310.0682...

  71. [71]

    Yang Jin, Kun Xu, Kun Xu, Liwei Chen, Chao Liao, Jianchao Tan, Quzhe Huang, Bin Chen, Chengru Song, Dai Meng, Di Zhang, Wenwu Ou, Kun Gai, and Yadong Mu. 2024. Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization. InThe Twelfth International Conference on Learning Representations, ICLR 2024. OpenReview.net, Vienna, Austria....

  72. [72]

    Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. The State and Fate of Linguistic Diversity and Inclusion in the NLP World. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Li...

  73. [73]

    Ákos Kádár, Desmond Elliott, Marc-Alexandre Côté, Grzegorz Chrupała, and Afra Alishahi. 2018. Lessons Learned in Multilingual Grounded Language Learning. InProceedings of the 22nd Conference on Computational Natural Language Learning, Anna Korhonen and Ivan Titov (Eds.). Association for Computational Linguistics, Brussels, Belgium, 402–412. doi:10.18653/v...

  74. [74]

    Antonia Karamolegkou, Phillip Rust, Ruixiang Cui, Yong Cao, Anders Søgaard, and Daniel Hershcovich. 2024. Vision-Language Models under Cultural and Inclusive Considerations. InProceedings of the 1st Human-Centered Large Language Modeling Workshop, Nikita Soni, Lucie Flek, Ashish Sharma, Diyi Yang, Sara Hooker, and H. Andrew Schwartz (Eds.). ACL, TBD, 53–6...

  75. [75]

    Yasmine Karoui, Rémi Lebret, Negar Foroutan Eghlidi, and Karl Aberer. 2023. Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Lingui...

  76. [76]

    Andrej Karpathy and Li Fei-Fei. 2017. Deep Visual-Semantic Alignments for Generating Image Descriptions.IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2017), 664–676. doi:10.1109/TPAMI.2016.2598339

  77. [77]

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. 2014. ReferItGame: Referring to Objects in Photographs of Natural Scenes. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). Association for Computational Linguistics, Doha, Qatar, 787–...

  78. [78]

    Gukyeong Kwon, Zhaowei Cai, Avinash Ravichandran, Erhan Bas, Rahul Bhotika, and Stefano Soatto. 2023. Masked Vision and Language Modeling for Multi-modal Representation Learning. InThe Eleventh International Conference on Learning Representations, ICLR 2023. OpenReview.net, Kigali, Rwanda. https://openreview.net/forum?id=ZhuXksSJYWn

  79. [79]

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. 2023. SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension.CoRRabs/2307.16125 (2023). arXiv:2307.16125 doi:10.48550/ARXIV.2307.16125

  80. [80]

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. 2022. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. InInternational Conference on Machine Learning, ICML 2022, 17-23 July 2022 (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szep...

Showing first 80 references.