Recognition: 2 theorem links
· Lean TheoremZAYA1-VL-8B Technical Report
Pith reviewed 2026-05-12 01:10 UTC · model grok-4.3
The pith
ZAYA1-VL-8B matches leading vision-language models on understanding and reasoning benchmarks despite its compact size.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ZAYA1-VL-8B is presented as a compact vision-language model that achieves performance competitive with leading base models such as Molmo2-4B and InternVL3.5-4B, while surpassing Qwen2.5-VL-3B, PLM-3B, and MolmoE-1B on image understanding, reasoning, and counting benchmarks, enabled by vision-specific LoRA adapters and bidirectional attention over image tokens.
What carries the argument
Vision-specific LoRA adapters integrated into the LLM combined with bidirectional attention over image tokens, which increase modality-specific capacity and enhance visual understanding without adding more experts.
Load-bearing premise
The reported benchmark scores accurately measure the model's true generalization ability rather than resulting from data overlap with the evaluation sets or selective reporting.
What would settle it
Evaluating the model on a newly created set of image understanding benchmarks with no overlap to any training data would show whether the competitive performance holds.
Figures
read the original abstract
We present ZAYA1-VL-8B, a compact mixture-of-experts vision-language model built upon our in-house language model, ZAYA1-8B. Despite its compact size, ZAYA1-VL achieves performance competitive with leading base models such as Molmo2-4B and InternVL3.5-4B, while surpassing models including Qwen2.5-VL-3B, PLM-3B, and MolmoE-1B across a range of image understanding, reasoning, and counting benchmarks. The architecture incorporates two key innovations: (1) vision-specific LoRA adapters integrated into the LLM to increase modality-specific capacity without increasing the number of experts, and (2) bidirectional attention over image tokens within the LLM to enhance visual understanding. We detail the full training pipeline including data composition at each stage, sequence packing, and the attention masking scheme. The model comprises 9.2B total parameters, with 1.4B active parameters including the vision encoder, and is publicly available at https://huggingface.co/Zyphra/ZAYA1-VL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents ZAYA1-VL-8B, a compact 9.2B-parameter (1.4B active) mixture-of-experts vision-language model built on the in-house ZAYA1-8B LLM. It introduces two architectural innovations—vision-specific LoRA adapters integrated into the LLM and bidirectional attention over image tokens—and details the training pipeline including data composition, sequence packing, and attention masking. The central claim is that ZAYA1-VL-8B achieves performance competitive with Molmo2-4B and InternVL3.5-4B while surpassing Qwen2.5-VL-3B, PLM-3B, and MolmoE-1B on image understanding, reasoning, and counting benchmarks. The model is released publicly.
Significance. If the performance claims are substantiated, the work would demonstrate a practical route to increasing modality-specific capacity in MoE VL models without expanding the expert count, which could aid development of efficient multimodal systems. The public model release supports reproducibility and further research.
major comments (2)
- [Abstract] Abstract: The claim that ZAYA1-VL 'achieves performance competitive with leading base models such as Molmo2-4B and InternVL3.5-4B, while surpassing models including Qwen2.5-VL-3B, PLM-3B, and MolmoE-1B' is unsupported by any numerical scores, tables, ablation studies, error bars, or evaluation protocols in the manuscript text.
- [Training pipeline] Training pipeline description: No specific dataset lists, decontamination steps, data splits, or overlap checks with the reported benchmarks are provided, which is required to substantiate that the results reflect genuine generalization rather than data leakage or selective evaluation.
minor comments (1)
- [Abstract] The breakdown of the 1.4B active parameters (including the vision encoder) is stated but not broken down by component; a short table or paragraph clarifying this would aid clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below and will make revisions to strengthen the manuscript where the points are valid.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that ZAYA1-VL 'achieves performance competitive with leading base models such as Molmo2-4B and InternVL3.5-4B, while surpassing models including Qwen2.5-VL-3B, PLM-3B, and MolmoE-1B' is unsupported by any numerical scores, tables, ablation studies, error bars, or evaluation protocols in the manuscript text.
Authors: We agree that the abstract claim would benefit from direct substantiation. The current manuscript focuses on architectural and training details but does not embed the supporting numerical results or protocols in the provided text. In revision, we will add a concise results summary with key benchmark scores, a reference to evaluation protocols, and note on ablations to the abstract or early sections, along with a main results table. revision: yes
-
Referee: [Training pipeline] Training pipeline description: No specific dataset lists, decontamination steps, data splits, or overlap checks with the reported benchmarks are provided, which is required to substantiate that the results reflect genuine generalization rather than data leakage or selective evaluation.
Authors: The manuscript describes data composition at each training stage along with sequence packing and attention masking. However, we acknowledge the need for greater specificity. We will revise to include an explicit table or list of datasets per stage, decontamination procedures, data splits, and benchmark overlap checks to confirm no leakage and support generalization claims. revision: yes
Circularity Check
No circularity: empirical model report with no derivations or self-referential predictions
full rationale
The paper is a technical report on training and benchmarking a vision-language model. It describes architecture choices (vision LoRA adapters, bidirectional image attention), data composition, sequence packing, attention masking, and reports benchmark scores against other models. No equations, first-principles derivations, or predictions appear in the abstract or described content. No self-citations, uniqueness theorems, or ansatzes are invoked. Performance claims rest on empirical evaluation rather than any closed loop that reduces to fitted inputs by construction. This matches the default case of a self-contained empirical paper with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel uncleararchitecture incorporates two key innovations: (1) vision-specific LoRA adapters ... (2) bidirectional attention over image tokens
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery theorem unclearTraining stages ... data composition at each stage, sequence packing, and the attention masking scheme
Reference graph
Works this paper leans on
-
[1]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021
work page 2021
-
[2]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceeding...
work page 2022
-
[3]
Transductive zero-shot and few-shot clip
S´egol`ene Martin, Yunshi Huang, Fereshteh Shakeri, Jean- Christophe Pesquet, and Ismail Ben Ayed. Transductive zero-shot and few-shot clip. In2024 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 28816–28826, 2024
work page 2024
-
[4]
Caps- adapter: Caption-based multimodal adapter in zero-shot classification
Qijie Wang, Liu Guandu, and Bin Wang. Caps- adapter: Caption-based multimodal adapter in zero-shot classification. InACM Multimedia 2024, 2024
work page 2024
-
[5]
Shuai Zhao, Ruijie Quan, Linchao Zhu, and Yi Yang. Clip4str: A simple baseline for scene text recognition with pre-trained vision-language model.IEEE Transac- tions on Image Processing, 33:6893–6904, 2024
work page 2024
-
[6]
Open-vocabulary detr with conditional matching
Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. Open-vocabulary detr with conditional matching. InComputer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX, page 106–122, 2022
work page 2022
-
[7]
Open-vocabulary semantic segmentation with mask-adapted clip
Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7061–7070, June 2023
work page 2023
-
[8]
Laion-5b: an open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: an open large-scale dataset for training next generation image-text models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22. Cu...
work page 2022
-
[9]
Frozen in time: A joint video and image encoder for end-to-end retrieval
Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1708–1718, 2021
work page 2021
-
[10]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, October 2023
work page 2023
-
[12]
Deepseek-ocr: Contexts optical compression, 2025
Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr: Contexts optical compression, 2025
work page 2025
-
[13]
Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Fed- erico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha¨el Rama- monjisoa, Francisco Massa, Daniel Haziza, et al. Dinov3, 2025
work page 2025
-
[14]
Come-vl: Scaling complementary multi-encoder vision-language learning, 2026
Ankan Deria, Komal Kumar, Xilin He, Imran Razzak, Hisham Cholakkal, Fahad Shahbaz Khan, and Salman Khan. Come-vl: Scaling complementary multi-encoder vision-language learning, 2026
work page 2026
-
[15]
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, et al. Gpt-4 technical report, 2024
work page 2024
-
[16]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chen- gen Huang, Chenxu Lv, Chujie Zheng, et al. Qwen3 technical report, 2025
work page 2025
-
[17]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022
work page 2022
-
[18]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023
work page 2023
-
[19]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
work page 2023
-
[20]
Qwen3-vl technical report, 2025
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, 13 Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, et al. Qwen3-vl technical report, 2025
work page 2025
-
[21]
Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models, 2025
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models, 2025
work page 2025
-
[22]
V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5v and glm-4.1v- thinking: Towards versatile multimodal reasoning with scalable reinforcement learning, 2026
work page 2026
-
[23]
Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models
Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. InProceed- ings of the Computer Vision and Pattern Recognition Conference, pages 91–104, 2025
work page 2025
-
[24]
Llava-next: Improved reasoning, ocr, and world knowledge, January 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024
work page 2024
-
[25]
Su Jianlin. Transformer upgrade road: 4. Rotating position coding of two-dimensional positions, May 2021
work page 2021
-
[26]
Su Jianlin. Transformer upgrade road: 17. Simple Thinking of Multimodal Position Coding, Mar 2024
work page 2024
-
[27]
Rotary position embedding for vision transformer
Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embedding for vision transformer. InEuropean Conference on Computer Vision (ECCV), volume 15068 ofLecture Notes in Computer Science, pages 289–305. Springer, 2024
work page 2024
-
[28]
Haoyu Liu, Sucheng Ren, Tingyu Zhu, Peng Wang, Cihang Xie, Alan Yuille, Zeyu Zheng, and Feng Wang. Spiral rope: Rotate your rotary positional embeddings in the 2d plane.arXiv preprint arXiv:2602.03227, 2026
-
[29]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Llava-prumerge: Adaptive token reduction for efficient large multimodal models.ICCV, 2025
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models.ICCV, 2025
work page 2025
-
[31]
Chenyu Yang, Xuan Dong, Xizhou Zhu, Weijie Su, Jiahao Wang, Hao Tian, Zhe Chen, Wenhai Wang, Lewei Lu, and Jifeng Dai. Pvc: Progressive visual token compression for unified image and video processing in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24939–24949, 2025
work page 2025
-
[32]
Token Merging: Your ViT But Faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2023
work page internal anchor Pith review arXiv 2023
-
[33]
From pixels to words – towards native vision-language primitives at scale
Haiwen Diao, Mingxuan Li, Silei Wu, Linjun Dai, Xiaohua Wang, Hanming Deng, Lewei Lu, Dahua Lin, and Ziwei Liu. From pixels to words – towards native vision-language primitives at scale. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[34]
OpenAI. Gpt-4v(ision) system card. September 2023. Accessed: 2026-04-10
work page 2023
-
[35]
olmocr 2: Unit test rewards for document ocr, 2025
Jake Poznanski, Luca Soldaini, and Kyle Lo. olmocr 2: Unit test rewards for document ocr, 2025
work page 2025
-
[36]
LASA Team, Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, Yu Sun, Junao Shen, Chaojun Wang, Jie Tan, Deli Zhao, Tingyang Xu, Hao Zhang, and Yu Rong. Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning, 2025
work page 2025
-
[37]
Towards medical complex reasoning with llms through medical verifiable problems
Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, and Benyou Wang. Towards medical complex reasoning with llms through medical verifiable problems. pages 14552–14573, 01 2025
work page 2025
-
[38]
Ui-tars- 2 technical report: Advancing gui agent with multi-turn reinforcement learning, 2025
Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, Wanjun Zhong, Yining Ye, Yujia Qin, Yuwen Xiong, et al. Ui-tars- 2 technical report: Advancing gui agent with multi-turn reinforcement learning, 2025
work page 2025
-
[39]
OpenCUA: Open foundations for computer- use agents
Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, et al. OpenCUA: Open foundations for computer- use agents. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[40]
Molmoweb: Open visual web agent and open data for the open web, 2026
Tanmay Gupta, Piper Wolters, Zixian Ma, Peter Sushko, Rock Yuren Pang, Diego Llanes, Yue Yang, Taira Anderson, Boyuan Zheng, Zhongzheng Ren, Harsh Trivedi, Taylor Blanton, Caleb Ouellette, Winson Han, Ali Farhadi, and Ranjay Krishna. Molmoweb: Open visual web agent and open data for the open web, 2026
work page 2026
-
[41]
Pascal Benschop, Cristian Meo, Justin Dauwels, and Jelte P. Mense. Evaluation of vision-llms in surveillance video, 2025
work page 2025
-
[42]
Openvla: An open- source vision-language-action model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, et al. Openvla: An open- source vision-language-action model. In Pulkit Agrawal, Oliver Kroemer, and Wolfram Burgard, editors,Proceed- ings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 2679–2713. PMLR, 06–09 Nov 2025
work page 2025
-
[43]
Gemini robotics: Bringing ai into the physical world, 2025
Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, Steven Bohez, Konstantinos Bousmalis, Anthony Brohan, Thomas Buschmann, Arunkumar Byravan, Serkan Cabi, Ken 14 Caluwaerts, et al. Gemini robotics: Bringing ai into t...
work page 2025
-
[44]
From vision to action: En- abling real-world agentic VLMs
Aravilli Atchuta Ram. From vision to action: En- abling real-world agentic VLMs. In1st Workshop on VLM4RWD @ NeurIPS 2025, 2025
work page 2025
-
[45]
DriveLM: Driving with graph visual question answering
Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. DriveLM: Driving with graph visual question answering. In First Vision and Language for Autonomous Driving and Robotics Workshop, 2024
work page 2024
-
[46]
Waslan- der, Yu Liu, and Hongsheng Li
Hao Shao, Yuxuan Hu, Letian Wang, Steven L. Waslan- der, Yu Liu, and Hongsheng Li. Lmdrive: Closed-loop end-to-end driving with large language models.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15120–15130, 2023
work page 2024
-
[47]
Large language models for manufacturing.Journal of Manufacturing Systems, 86:516–545, 2026
Yiwei Li, Huaqin Zhao, Hanqi Jiang, Yi Pan, Zhengliang Liu, Zihao Wu, Peng Shu, Jie Tian, Tianze Yang, Shaochen Xu, Yanjun Lyu, Parker Blenk, Jacob Pence, Jason Rupram, Eliza Banu, Kenan Song, Dajiang Zhu, Xianqiao Wang, and Tianming Liu. Large language models for manufacturing.Journal of Manufacturing Systems, 86:516–545, 2026
work page 2026
-
[48]
A comprehensive survey of multimodal LLMs for scientific discovery
Liang Yan, Xu Jiang, Jian Ma, Yuhang Liu, Tian Bian, Qichao Wang, Abhishek Basu, Yu Rong, Tingyang Xu, Pengcheng Wu, Le Song, Imran Razzak, Junchi Yan, Zengfeng Huang, and Yutong Xie. A comprehensive survey of multimodal LLMs for scientific discovery. In 1st Workshop on VLM4RWD @ NeurIPS 2025, 2025
work page 2025
-
[49]
Evaluating object hallucination in large vision-language models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 292–305, Singapore, December 2023. Association for Computational Linguistics
work page 2023
-
[50]
Do you see me : A multidimensional benchmark for evaluating visual perception in multimodal LLMs
Aditya Sanjiv Kanade and Tanuja Ganu. Do you see me : A multidimensional benchmark for evaluating visual perception in multimodal LLMs. In Vera Demberg, Kentaro Inui, and Llu ´ıs Marquez, editors,Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7285–7326, Rabat, Moro...
-
[51]
Association for Computational Linguistics
-
[52]
Dash: Detection and assessment of systematic hallucinations of vlms
Maximilian Augustin, Yannic Neuhaus, and Matthias Hein. Dash: Detection and assessment of systematic hallucinations of vlms. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025
work page 2025
-
[53]
Boosting multimodal large language models with visual tokens withdrawal for rapid inference
Zhihang Lin, Mingbao Lin, Luxi Lin, and Rongrong Ji. Boosting multimodal large language models with visual tokens withdrawal for rapid inference. In Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances i...
work page 2025
-
[54]
Eureka: Intelligent feature engineering for enterprise AI cloud resource demand prediction
Hangxuan Li, Renjun Jia, Xuezhang Wu, zeqi zheng, Yunjie Qian, and Xianling Zhang. Eureka: Intelligent feature engineering for enterprise AI cloud resource demand prediction. In1st Workshop on VLM4RWD @ NeurIPS 2025, 2025
work page 2025
-
[55]
Khan, Waseem Ullah, and Mohsen Guizani
Ahmed Sharshar, Latif U. Khan, Waseem Ullah, and Mohsen Guizani. Vision-language models for edge networks: A comprehensive survey.IEEE Internet of Things Journal, 12(16):32701–32724, 2025
work page 2025
-
[56]
Efficient inference scaling for safety assurance
Ruizhong Qiu, Gaotang Li, Ting-Wei Li, Tianxin Wei, Jingrui He, and Hanghang Tong. Efficient inference scaling for safety assurance. In1st Workshop on VLM4RWD @ NeurIPS 2025, 2025
work page 2025
-
[57]
Scene understanding via scene representation generation with vision-language models
Yuan Chen and Peng Shi. Scene understanding via scene representation generation with vision-language models. In1st Workshop on VLM4RWD @ NeurIPS 2025, 2025
work page 2025
-
[58]
Spatialvlm: Endowing vision-language models with spatial reasoning capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14455– 14465, 2024
work page 2024
-
[59]
Minghui Hou, Wei-Hsing Huang, Shaofeng Liang, Daizong Liu, Tai-Hao Wen, Gang Wang, Runwei Guan, and Weiping Ding. Mmdrive: Interactive scene un- derstanding beyond vision with multi-representational fusion.Information Fusion, 133:104314, 2026
work page 2026
-
[60]
Anyattack: Towards large-scale self-supervised adversarial attacks on vision-language models
Jiaming Zhang, Junhong Ye, Xingjun Ma, Yige Li, Yunfan Yang, Chen Yunhao, Jitao Sang, and Dit-Yan Yeung. Anyattack: Towards large-scale self-supervised adversarial attacks on vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[61]
AdvEDM: Fine- grained adversarial attack against VLM-based embodied agents
Yichen Wang, Hangtao Zhang, Hewen Pan, Ziqi Zhou, Xianlong Wang, Peijin Guo, Lulu Xue, Shengshan Hu, Minghui Li, and Leo Yu Zhang. AdvEDM: Fine- grained adversarial attack against VLM-based embodied agents. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[62]
The claude model card addendum - claude 3.5 family, 2024
Anthropic. The claude model card addendum - claude 3.5 family, 2024
work page 2024
-
[63]
arXiv preprint arXiv:2601.10611 , year=
Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, et al. Molmo2: Open weights and data for vision-language models with video understanding and grounding.arXiv preprint arXiv:2601.10611, 2026
-
[64]
arXiv preprint arXiv:2504.13180 , year=
Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, et al. Perceptionlm: Open-access data and mod- els for detailed visual understanding.arXiv preprint arXiv:2504.13180, 2025
-
[65]
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, 15 Changrui Chen, Didi Zhu, Chunsheng Wu, Huajie Tan, Chunyuan Li, Jing Yang, Jie Yu, Xiyao Wang, Bin Qin, Yumeng Wang, Zizhen Yan, Ziyong Feng, Ziwei Liu, Bo Li, and Jiankang Deng. Llava-onevision-1.5: Fully open framework for democratized multimodal traini...
work page internal anchor Pith review arXiv 2025
-
[66]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
Changyao Tian, Hao Li, Gen Luo, Xizhou Zhu, Weijie Su, Hanming Deng, Jinguo Zhu, Jie Shao, Ziran Zhu, Yunpeng Liu, et al. Navil: Rethinking scaling properties of native multimodal large language models under data constraints.arXiv preprint arXiv:2510.08565, 2025
-
[68]
Gen Luo, Xue Yang, Wenhan Dou, Zhaokai Wang, Jiawen Liu, Jifeng Dai, Yu Qiao, and Xizhou Zhu. Mono-internvl: Pushing the boundaries of monolithic multimodal large language models with endogenous visual pre-training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24960–24971, 2025
work page 2025
-
[69]
Moe-llava: Mixture of experts for large vision-language models.IEEE Transactions on Multimedia, 2026
Bin Lin, Zhenyu Tang, Yang Ye, Jinfa Huang, Junwu Zhang, Yatian Pang, Peng Jin, Munan Ning, Jiebo Luo, and Li Yuan. Moe-llava: Mixture of experts for large vision-language models.IEEE Transactions on Multimedia, 2026
work page 2026
-
[70]
Training foundation models on a full-stack amd platform: Compute, networking, and system design
Quentin Anthony, Yury Tokpanov, Skyler Szot, Srivatsan Rajagopal, Praneeth Medepalli, Anna Golubeva, Vasu Shyam, Robert Washbourne, Rishi Iyer, Ansh Chaurasia, et al. Training foundation models on a full-stack amd platform: Compute, networking, and system design. arXiv preprint arXiv:2511.17127, 2025
-
[71]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
work page 2017
-
[72]
Roformer: Enhanced trans- former with rotary position embedding.Neurocomput., 568(C), February 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced trans- former with rotary position embedding.Neurocomput., 568(C), February 2024
work page 2024
-
[73]
Junqi Ge, Ziyi Chen, Jintao Lin, Jinguo Zhu, Xihui Liu, Jifeng Dai, and Xizhou Zhu. V2pe: Improving multimodal long-context capability of vision-language models with variable visual position encoding, 2024
work page 2024
-
[74]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Deng, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[75]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[76]
HoPE: Hybrid of position embedding for long context vision-language models
Haoran Li, Yingjie Qin, Baoyuan Ou, Lai Xu, and Ruiwen Xu. HoPE: Hybrid of position embedding for long context vision-language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[77]
Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, and Dahua Lin. VideoroPE: What makes for good video rotary position embedding? InForty-second International Conference on Machine Learning, 2025
work page 2025
-
[78]
Circle- roPE: Cone-like decoupled rotary positional embedding for vision-language models, 2026
Chengcheng Wang, Jianyuan Guo, Hongguang Li, Yuchuan Tian, Ying Nie, Chang Xu, and Kai Han. Circle- roPE: Cone-like decoupled rotary positional embedding for vision-language models, 2026
work page 2026
-
[79]
Revisiting multi- modal positional encoding in vision–language models
Jie Huang, Xuejing Liu, Sibo Song, RuiBing Hou, Hong Chang, Junyang Lin, and Shuai Bai. Revisiting multi- modal positional encoding in vision–language models. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[80]
Kimi-vl technical report, 2025
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report, 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.