Multilingual Vision-Language Models, A Survey
Pith reviewed 2026-05-18 12:58 UTC · model grok-4.3
The pith
Multilingual vision-language models face a core tension between language-neutral representations and cultural adaptation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The survey establishes that multilingual vision-language models exhibit a key tension between language neutrality, achieved through consistent cross-lingual representations, and cultural awareness, which requires adaptation to specific cultural contexts. Training predominantly favors neutrality via contrastive learning, while cultural awareness depends on access to diverse data. Two-thirds of the examined benchmarks use translation-based approaches that prioritize semantic consistency, although some recent work incorporates culturally grounded content. The analysis reveals discrepancies in cross-lingual capabilities and gaps between training objectives and evaluation goals.
What carries the argument
The identified tension between language neutrality (consistent cross-lingual representations) and cultural awareness (adaptation to cultural contexts) that shapes both training and evaluation.
If this is right
- Continued emphasis on contrastive learning will keep models strong on semantic consistency but limited in handling cultural nuances.
- Shifting benchmarks away from translations toward original culturally grounded data would better expose current model limitations.
- Uneven cross-lingual performance will persist until training data and objectives better align with evaluation goals.
- Applications such as image description or visual question answering will remain less reliable in non-dominant cultural settings.
Where Pith is reading between the lines
- New data collection efforts focused on parallel cultural contexts rather than translations could reduce the observed gap.
- Model architectures might need separate pathways for neutral semantics and culture-specific signals instead of a single shared representation.
- Evaluation on real-world user interactions in multiple languages could provide a stronger test than static benchmarks.
Load-bearing premise
The chosen set of 33 models and 23 benchmarks gives a representative and unbiased picture of the current state of multilingual vision-language research.
What would settle it
Documentation of one model that simultaneously achieves high cross-lingual consistency on translation-based tests and strong performance on original, non-translated culturally specific tasks would undermine the claimed tension.
Figures
read the original abstract
This survey examines multilingual vision-language models that process text and images across languages. We review 33 models and 23 benchmarks, spanning encoder-only and generative architectures, and identify a key tension between language neutrality (consistent cross-lingual representations) and cultural awareness (adaptation to cultural contexts). Current training methods favor neutrality through contrastive learning, while cultural awareness depends on diverse data. Two-thirds of evaluation benchmarks use translation-based approaches prioritizing semantic consistency, though recent work incorporates culturally grounded content. We find discrepancies in cross-lingual capabilities and gaps between training objectives and evaluation goals.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript surveys multilingual vision-language models, reviewing 33 models and 23 benchmarks across encoder-only and generative architectures. It identifies a central tension between language neutrality (consistent cross-lingual representations promoted by contrastive learning) and cultural awareness (adaptation to specific cultural contexts), noting that current training favors neutrality while two-thirds of benchmarks rely on translation-based methods prioritizing semantic consistency over cultural grounding. The paper highlights discrepancies in cross-lingual capabilities and gaps between training objectives and evaluation goals.
Significance. If the reviewed set proves representative, the survey offers a useful synthesis of training-evaluation mismatches in multilingual VLMs and frames an actionable tension between neutrality and cultural specificity that could guide future data curation and benchmark design. The explicit contrast between contrastive training practices and translation-heavy evaluation is a clear contribution for the field.
major comments (1)
- [Model and benchmark selection description] The section introducing the 33 models and 23 benchmarks (near the start of the model and benchmark review sections): the paper states it reviews these quantities but supplies no inclusion/exclusion criteria, search strings, date cutoffs, or audit against a larger candidate pool. This is load-bearing for the central claim of a field-wide training-evaluation tension, because an uncharacterized selection could over-represent contrastive or English-centric models and thereby artifactually produce the reported discrepancy.
minor comments (2)
- [Model summary table] Table or figure summarizing the 33 models would benefit from an explicit column or note on data sources used for cultural grounding versus contrastive pretraining.
- [Benchmark discussion] The discussion of recent culturally grounded benchmarks could include a short forward-looking paragraph on how to operationalize cultural awareness metrics beyond current translation baselines.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our survey. We agree that explicit documentation of the model and benchmark selection process is necessary to support the central claims regarding training-evaluation tensions. We will revise the manuscript accordingly.
read point-by-point responses
-
Referee: The section introducing the 33 models and 23 benchmarks (near the start of the model and benchmark review sections): the paper states it reviews these quantities but supplies no inclusion/exclusion criteria, search strings, date cutoffs, or audit against a larger candidate pool. This is load-bearing for the central claim of a field-wide training-evaluation tension, because an uncharacterized selection could over-represent contrastive or English-centric models and thereby artifactually produce the reported discrepancy.
Authors: We acknowledge this limitation in the current draft. The manuscript does not currently detail the search strategy or explicit inclusion/exclusion criteria used to arrive at the 33 models and 23 benchmarks. In the revised version, we will insert a dedicated subsection (likely in Section 2 or as a new Section 1.3) that specifies: (1) search sources (arXiv, ACL Anthology, CVPR/ICCV/ECCV, and recent surveys), (2) keywords and date range (primarily 2020–September 2024), (3) inclusion criteria (models supporting at least one non-English language with public papers or reports; benchmarks with multilingual or cross-lingual evaluation), and (4) exclusion criteria (purely monolingual English models or benchmarks without reported multilingual results). We will also add a brief discussion of selection limitations and how we cross-checked against existing surveys to reduce English-centric bias. This addition will directly strengthen the foundation for our analysis of the neutrality–cultural awareness tension without altering the reviewed set or core findings. revision: yes
Circularity Check
No circularity: survey summarizes external literature without self-referential derivations
full rationale
The paper is a literature survey that reviews 33 models and 23 benchmarks to identify a tension between language neutrality and cultural awareness. Its central claims derive from analysis of cited external works rather than any internal equations, fitted parameters, or predictions that reduce to the paper's own inputs by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citation chains appear in the abstract or described structure. The selection of reviewed items is presented as the scope of the survey; while selection criteria are not detailed, this does not create a circular reduction per the defined patterns. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We review 33 models and 23 benchmarks... identify a key tension between language neutrality... and cultural awareness... Current training methods favor neutrality through contrastive learning
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Two-thirds of evaluation benchmarks use translation-based approaches prioritizing semantic consistency
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Željko Agić and Natalie Schluter. 2018. Baselines and Test Data for Cross-Lingual Inference. InProceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Maz...
work page 2018
-
[3]
Cem Akkus, Luyang Chu, Vladana Djakovic, Steffen Jauch-Walser, Philipp Koch, Giacomo Loss, Christopher Marquardt, Marco Moldovan, Nadja Sauter, Maximilian Schneider, Rickmer Schulte, Karol Urbanczyk, Jann Goschenhofer, Christian Heumann, Rasmus Hvingelby, Daniel Schalk, and Matthias Aßenmacher. 2023. Multimodal Deep Learning.CoRRabs/2301.04856 (2023). arX...
-
[4]
Vegesna, Abhipsha Das, Anthony Susevski, Ryan Sze- Yin Chan, S
Nahid Alam, Karthik Reddy Kanjula, Surya Guthikonda, Timothy Chung, Bala Krishna S. Vegesna, Abhipsha Das, Anthony Susevski, Ryan Sze- Yin Chan, S. M. Iftekhar Uddin, Shayekh Bin Islam, Roshan Santhosh, Snegha A, Drishti Sharma, Chen Liu, Isha Chaturvedi, Genta Indra Winata, Ashvanth. S, Snehanshu Mukherjee, and Alham Fikri Aji. 2024. Maya: An Instruction...
-
[5]
Jacob Andreas. 2022. Language Models as Agent Models. InFindings of the Association for Computational Linguistics: EMNLP 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 5769–5779. doi:10.18653/ v1/2022.findings-emnlp.423
work page 2022
-
[6]
Lawrence Zitnick, and Devi Parikh
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In2015 IEEE International Conference on Computer Vision, ICCV 2015. IEEE Computer Society, Santiago, Chile, 2425–2433. doi:10.1109/ICCV.2015.279
-
[7]
Gomez, Phil Blunsom, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker
Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin, Bharat Venkitesh, Madeline Smith, Jon Ander Campos, Yi Chern Tan, Kelly Marchisio, Max Bartolo, Sebastian Ruder, Acyr Locatelli, Julia Kreutzer, Nick Frosst, Aidan N. Gomez, Phil Blunsom, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker. 2024. Aya 23: Open Weight Releases to ...
-
[8]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Ming-Hsuan Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.13923 2025
-
[9]
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. 2022. BEiT: BERT Pre-Training of Image Transformers. InThe Tenth International Conference on Learning Representations, ICLR 2022. OpenReview.net, Virtual Event. https://openreview.net/forum?id=p-BhZSz59o4
work page 2022
-
[10]
Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Songhao Piao, and Furu Wei. 2022. VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, Sanmi Koyejo, S...
work page 2022
-
[11]
Loïc Barrault, Fethi Bougares, Lucia Specia, Chiraag Lala, Desmond Elliott, and Stella Frank. 2018. Findings of the Third Shared Task on Multimodal Machine Translation. InProceedings of the Third Conference on Machine Translation: Shared Task Papers, Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Hu...
-
[12]
Emily M. Bender and Alexander Koller. 2020. Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 5185–5198. doi:10.1865...
-
[13]
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A Neural Probabilistic Language Model.J. Mach. Learn. Res.3 (2003), 1137–1155. https://jmlr.org/papers/v3/bengio03a.html Manuscript submitted to ACM 26 Andrei-Alexandru Manea and Jindřich Libovický
work page 2003
-
[14]
Lucas Beyer, Pavel Izmailov, Alexander Kolesnikov, Mathilde Caron, Simon Kornblith, Xiaohua Zhai, Matthias Minderer, Michael Tschannen, Ibrahim Alabdulmohsin, and Filip Pavetic. 2023. FlexiViT: One Model for All Patch Sizes. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC. IEEE, Canada, 14496–14506. doi:10.1109/C...
-
[15]
Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey A. Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.07726 2024
-
[16]
Bowman, Gabor Angeli, Christopher Potts, and Christopher D
Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lluís Màrquez, Chris Callison-Burch, and Jian Su (Eds.). Association for Computational Linguistics, Lisbon, Portugal...
-
[17]
María Alejandra Bravo, Sudhanshu Mittal, Simon Ging, and Thomas Brox. 2023. Open-vocabulary Attribute Detection. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC. IEEE, Canada, 7041–7050. doi:10.1109/CVPR52729.2023.00680
-
[18]
Peter F. Brown, Vincent J. Della Pietra, Peter V. deSouza, Jenifer C. Lai, and Robert L. Mercer. 1992. Class-Basedn-gram Models of Natural Language. Computational Linguistics18, 4 (1992), 467–480. https://aclanthology.org/J92-4003/
work page 1992
-
[19]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...
work page 2020
-
[20]
Emanuele Bugliarello, Fangyu Liu, Jonas Pfeiffer, Siva Reddy, Desmond Elliott, Edoardo Maria Ponti, and Ivan Vulic. 2022. IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages. InInternational Conference on Machine Learning, ICML 2022, 17-23 July 2022 (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stef...
work page 2022
-
[21]
Fredrik Carlsson, Philipp Eisen, Faton Rekathati, and Magnus Sahlgren. 2022. Cross-lingual and Multilingual CLIP. InProceedings of the Thirteenth Language Resources and Evaluation Conference, Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Maria...
work page 2022
-
[22]
Soravit Changpinyo, Linting Xue, Michal Yarom, Ashish Thapliyal, Idan Szpektor, Julien Amelot, Xi Chen, and Radu Soricut. 2023. MaXM: Towards Multilingual Visual Question Answering. InFindings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore...
-
[23]
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. 2024. ShareGPT4V: Improving Large Multi-modal Models with Better Captions. InComputer Vision - ECCV 2024 - 18th European Conference (Lecture Notes in Computer Science, Vol. 15075), Ales Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and...
-
[24]
Ting Chen, Saurabh Saxena, Lala Li, David J. Fleet, and Geoffrey E. Hinton. 2022. Pix2seq: A Language Modeling Framework for Object Detection. InThe Tenth International Conference on Learning Representations, ICLR 2022. OpenReview.net, Virtual Event. https://openreview.net/forum?id= e42KbIw6Wb
work page 2022
-
[25]
Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shakeri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, A. J. Piergiovanni, Matthias...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.18565 2023
-
[26]
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. 2015. Microsoft COCO Captions: Data Collection and Evaluation Server.CoRRabs/1504.00325 (2015). arXiv:1504.00325 http://arxiv.org/abs/1504.00325
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[27]
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. InComputer Vision - ECCV 2020 - 16th European Conference (Lecture Notes in Computer Science, Vol. 12375), Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer, Gl...
-
[28]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023- 03-30-vicuna/
work page 2023
-
[29]
Grzegorz Chrupała, Ákos Kádár, and Afra Alishahi. 2015. Learning language through pictures. InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Manuscript submitted to ACM Multilingual Vision-Language Models, A Survey 2...
-
[30]
Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, and Chunhua Shen. 2023. MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices.CoRRabs/2312.16886 (2023). arXiv:2312.16886 doi:10.48550/ARXIV.2312.16886
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.16886 2023
-
[31]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Nat...
-
[32]
Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: Evaluating Cross-lingual Sentence Representations. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association f...
-
[33]
Rocktim Das, Simeon Hristov, Haonan Li, Dimitar Dimitrov, Ivan Koychev, and Preslav Nakov. 2024. EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Sr...
-
[34]
Dauphin, Angela Fan, Michael Auli, and David Grangier
Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language Modeling with Gated Convolutional Networks. InProceedings of the 34th International Conference on Machine Learning, ICML 2017 (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, Sydney, NSW, Australia, 933–941. http://proceedings.mlr....
work page 2017
-
[35]
Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme Ruiz, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Ga...
-
[36]
Scaling Vision Transformers to 22 Billion Parameters. InInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023 (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, Honolulu, Hawaii, 7480–7512. https://proceedings.mlr.pres...
work page 2023
-
[37]
Alabdulmohsin, Avital Oliver, Piotr Padlewski, Alexey A
Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M. Alabdulmohsin, Avital Oliver, Piotr Padlewski, Alexey A. Gritsenko, Mario Lucic, and Neil Houlsby. 2023. Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution. InAdvances i...
work page 2023
-
[38]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009. IEEE Computer Society, Miami, Florida, 248–255. doi:10.1109/CVPR.2009.5206848
-
[39]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy ...
-
[40]
Shizhe Diao, Wangchunshu Zhou, Xinsong Zhang, and Jiawei Wang. 2023. Write and Paint: Generative Vision-Language Models are Unified Modal Learners. InThe Eleventh International Conference on Learning Representations, ICLR 2023. OpenReview.net, Kigali, Rwanda. https: //openreview.net/forum?id=HgQR0mXQ1_a
work page 2023
-
[41]
Dimitar Dimitrov, Firoj Alam, Maram Hasanain, Abul Hasnat, Fabrizio Silvestri, Preslav Nakov, and Giovanni Da San Martino. 2024. SemEval-2024 Task 4: Multilingual Detection of Persuasion Techniques in Memes. InProceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), Atul Kr. Ojha, A. Seza Doğruöz, Harish Tayyar Madabushi, Giov...
-
[42]
Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. 2024. LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens. InForty-first International Conference on Machine Learning, ICML 2024. OpenReview.net, Vienna, Austria. https://openreview.net/forum?id=ONOtpXLqqw
work page 2024
-
[43]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In9th International Conference on Learning Representations, ICLR 2...
work page 2021
-
[44]
Zi-Yi Dou and Graham Neubig. 2021. Word Alignment by Fine-tuning Embeddings on Parallel Corpora. InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty (Eds.). Association for Computational Linguistics, Online, 2112–2128. doi:10.18653/v1/202...
-
[45]
Philipp Dufter and Hinrich Schütze. 2020. Identifying Elements Essential for BERT’s Multilinguality. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 4423–4437. doi:10.18653/v1/2020.emnlp-main.358
-
[46]
Constantin Eichenberg, Sidney Black, Samuel Weinbach, Letitia Parcalabescu, and Anette Frank. 2022. MAGMA – Multimodal Augmentation of Generative Models through Adapter-based Finetuning. InFindings of the Association for Computational Linguistics: EMNLP 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics...
work page 2022
-
[47]
Desmond Elliott, Stella Frank, Loïc Barrault, Fethi Bougares, and Lucia Specia. 2017. Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description. InProceedings of the Second Conference on Machine Translation, Ondřej Bojar, Christian Buck, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matth...
-
[48]
Desmond Elliott, Stella Frank, Khalil Sima’an, and Lucia Specia. 2016. Multi30K: Multilingual English-German Image Descriptions. InProceedings of the 5th Workshop on Vision and Language, Anya Belz, Erkut Erdem, Krystian Mikolajczyk, and Katerina Pastra (Eds.). Association for Computational Linguistics, Berlin, Germany, 70–74. doi:10.18653/v1/W16-3210
-
[49]
William Fedus, Barret Zoph, and Noam Shazeer. 2021. Switch transformer: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research22, 120 (2021), 1–39
work page 2021
- [50]
-
[51]
Matthieu Futeral, Cordelia Schmid, Ivan Laptev, Benoît Sagot, and Rachel Bawden. 2023. Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Ed...
-
[52]
Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. 2024. Making LLaMA SEE and Draw with SEED Tokenizer. InThe Twelfth International Conference on Learning Representations, ICLR 2024. OpenReview.net, Vienna, Austria. https://openreview.net/forum? id=0Nui91LBQS
work page 2024
-
[53]
Gregor Geigle, Abhay Jain, Radu Timofte, and Goran Glavaš. 2024. mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs. InProceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), Jing Gu, Tsu-Jui (Ray) Fu, Drew Hudson, Asli Celikyilmaz, and William Wang (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 7...
-
[54]
Gregor Geigle, Florian Schneider, Carolin Holtermann, Chris Biemann, Radu Timofte, Anne Lauscher, and Goran Glavas. 2025. Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model.CoRRabs/2501.05122 (2025). arXiv:2501.05122 doi:10.48550/ARXIV.2501.05122
-
[55]
Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer Feed-Forward Layers Are Key-Value Memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, Online and Punta Cana, ...
work page internal anchor Pith review doi:10.18653/v1/2021.emnlp-main.446 2021
-
[56]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, et al...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Reto Gubelmann. 2024. Pragmatic Norms Are All You Need – Why The Symbol Grounding Problem Does Not Apply to LLMs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 11663–11678. doi:10.18653/v1/20...
-
[58]
Katharina Hämmerl, Jindřich Libovický, and Alexander Fraser. 2024. Understanding Cross-Lingual Alignment—A Survey. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 10922–10943. doi:10.18653/v1/2024.findings-acl.649
-
[59]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. IEEE Computer Society, Las Vegas, NV, USA, 770–778. doi:10.1109/CVPR.2016.90
-
[60]
Xuehai He, Chunyuan Li, Pengchuan Zhang, Jianwei Yang, and Xin Eric Wang. 2023. Parameter-Efficient Model Adaptation for Vision Transformers. InThirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artif...
-
[61]
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal neural networks. InAdvances in Neural Information Processing Systems, Vol. 35. 30016–30030
work page 2022
-
[62]
Hanxu Hu and Frank Keller. 2023. Meta-Learning For Vision-and-Language Cross-lingual Transfer.CoRRabs/2305.14843 (2023). arXiv:2305.14843 doi:10.48550/ARXIV.2305.14843
-
[63]
Ronghang Hu, Daniel Fried, Anna Rohrbach, Dan Klein, Trevor Darrell, and Kate Saenko. 2019. Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, Manuscript submitted to ACM Multilingual Vision-Language Models, A Survey ...
-
[64]
Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019. Computer Vision Foundation / IEEE, Long Beach, CA, USA, 6700–6709. doi:10.1109/CVPR.2019.00686
-
[65]
Ayyoob Imani, Peiqin Lin, Amir Hossein Kargaran, Silvia Severini, Masoud Jalili Sabet, Nora Kassner, Chunlan Ma, Helmut Schmid, André Martins, François Yvon, and Hinrich Schütze. 2023. Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volum...
-
[66]
Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts.Neural computation3, 1 (1991), 79–87
work page 1991
-
[67]
Aashi Jain, Mandy Guo, Krishna Srinivasan, Ting Chen, Sneha Kudugunta, Chao Jia, Yinfei Yang, and Jason Baldridge. 2021. MURAL: Multimodal, Multitask Representations Across Languages. InFindings of the Association for Computational Linguistics: EMNLP 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Co...
- [68]
-
[69]
Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. InProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021 (Proceedings of Machine Learning Rese...
work page 2021
-
[70]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B.CoRRabs/2310.0682...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825 2023
-
[71]
Yang Jin, Kun Xu, Kun Xu, Liwei Chen, Chao Liao, Jianchao Tan, Quzhe Huang, Bin Chen, Chengru Song, Dai Meng, Di Zhang, Wenwu Ou, Kun Gai, and Yadong Mu. 2024. Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization. InThe Twelfth International Conference on Learning Representations, ICLR 2024. OpenReview.net, Vienna, Austria....
work page 2024
-
[72]
Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. The State and Fate of Linguistic Diversity and Inclusion in the NLP World. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Li...
-
[73]
Ákos Kádár, Desmond Elliott, Marc-Alexandre Côté, Grzegorz Chrupała, and Afra Alishahi. 2018. Lessons Learned in Multilingual Grounded Language Learning. InProceedings of the 22nd Conference on Computational Natural Language Learning, Anna Korhonen and Ivan Titov (Eds.). Association for Computational Linguistics, Brussels, Belgium, 402–412. doi:10.18653/v...
-
[74]
Antonia Karamolegkou, Phillip Rust, Ruixiang Cui, Yong Cao, Anders Søgaard, and Daniel Hershcovich. 2024. Vision-Language Models under Cultural and Inclusive Considerations. InProceedings of the 1st Human-Centered Large Language Modeling Workshop, Nikita Soni, Lucie Flek, Ashish Sharma, Diyi Yang, Sara Hooker, and H. Andrew Schwartz (Eds.). ACL, TBD, 53–6...
-
[75]
Yasmine Karoui, Rémi Lebret, Negar Foroutan Eghlidi, and Karl Aberer. 2023. Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Lingui...
-
[76]
Andrej Karpathy and Li Fei-Fei. 2017. Deep Visual-Semantic Alignments for Generating Image Descriptions.IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2017), 664–676. doi:10.1109/TPAMI.2016.2598339
-
[77]
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. 2014. ReferItGame: Referring to Objects in Photographs of Natural Scenes. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). Association for Computational Linguistics, Doha, Qatar, 787–...
-
[78]
Gukyeong Kwon, Zhaowei Cai, Avinash Ravichandran, Erhan Bas, Rahul Bhotika, and Stefano Soatto. 2023. Masked Vision and Language Modeling for Multi-modal Representation Learning. InThe Eleventh International Conference on Learning Representations, ICLR 2023. OpenReview.net, Kigali, Rwanda. https://openreview.net/forum?id=ZhuXksSJYWn
work page 2023
-
[79]
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. 2023. SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension.CoRRabs/2307.16125 (2023). arXiv:2307.16125 doi:10.48550/ARXIV.2307.16125
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.16125 2023
-
[80]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. 2022. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. InInternational Conference on Machine Learning, ICML 2022, 17-23 July 2022 (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szep...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.