Kwai Keye-VL-2.0 Technical Report

Bin Wen; Changyi Liu; Chengru Song; Chongling Rao; Chuan Yi; Fan Yang; Feng Han; Guowang Zhang; Haixuan Gao; Hang Li

arxiv: 2606.10651 · v1 · pith:XGXUEWFMnew · submitted 2026-06-09 · 💻 cs.CV

Kwai Keye-VL-2.0 Technical Report

Kwai Keye Team , Bin Wen , Changyi Liu , Chengru Song , Chongling Rao , Guowang Zhang , Han Li , Haonan Fan

show 45 more authors

Hengrui Ju Jiankang Chen Jiapeng Chen Jiawei Yuan Kaixuan Yang Kaiyu Jiang Kun Gai Lingzhi Zhou Na Nie Sen Na Tianke Zhang Tingting Gao Xuanyu Zheng Yulong Chen Fan Yang Haixuan Gao Lele Yang Mingqiao Liu Muxi Diao Qi Zhang Qile Su Wei Chen Wentao Hong Xingyu Lu Yancheng Long Yankai Yang Yingxin Li Yiyang Fan Yu Xia Yuzhe Chen Ziliang Lai Chuan Yi Haonan Jia Tianming Liang Weixin Xu Xiaoxiao Ma Yang Tian Yufei Han Feng Han Hang Li Jing Wang Jinghui Jia Junmin Chen Junyu Shi Ruilin Zhang

This is my paper

Pith reviewed 2026-06-27 13:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal foundation modelmixture of expertslong video understandingsparse attentionon-policy distillationtemporal localizationagentic intelligencecontext length scaling

0 comments

The pith

A 30B MoE multimodal model processes 256K video contexts by activating only 3B parameters and leads on long-video benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an open-source Mixture-of-Experts model that adapts sparse attention to multimodal architectures to manage hour-long videos without full attention costs. It pairs this with a distillation technique that uses on-policy rollouts from multiple teachers to align the model across tasks while keeping only a small subset of parameters active. The result is reported state-of-the-art results among comparable models on temporal grounding and extended video comprehension tasks. The authors also release the checkpoints. A reader would care because the approach claims to make long-context multimodal reasoning practical at lower active compute.

Core claim

Keye-VL-2.0-30B-A3B is the first model to adapt DeepSeek Sparse Attention to GQA-based multimodal setups, supporting lossless 256K context while selecting critical frames. Cross-Modal Multi-Teacher On-Policy Distillation combined with Context-RL and Video-RL prevents catastrophic forgetting during multi-task alignment, allowing the MoE backbone to deliver strong agentic performance in code, tool, and search scenarios with multimodal self-correction. The model reaches state-of-the-art results among similar-scale systems on TimeLens for fine-grained temporal localization and on Video-MME-v2 and LongVideoBench for long-video comprehension.

What carries the argument

Adaptation of sparse attention to GQA-based multimodal architectures together with Cross-Modal Multi-Teacher On-Policy Distillation that feeds token-level teacher signals from on-policy rollouts back into the 3B-active-parameter MoE backbone.

If this is right

Hour-level videos can be processed while retaining long-range temporal dependencies at manageable compute cost.
Multi-task alignment for agent collaboration becomes feasible without the model forgetting prior capabilities.
Only 3B parameters need activation during inference while still supporting advanced multimodal self-correction.
Custom kernels and heterogeneous parallelism can scale training and inference throughput for video inputs.
Open release of checkpoints enables community extension to new agentic applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sparse-attention plus distillation pattern could be tested on non-video modalities to check if the efficiency gains transfer.
If the infrastructure optimizations generalize, similar active-parameter ratios might appear in other large multimodal systems.
Longer contexts beyond 256K could be explored by extending the same attention adaptation.
The agentic results suggest the model could be evaluated on interactive tasks that require sustained self-correction over many turns.

Load-bearing premise

The reported benchmark scores reflect performance that would hold under standard prompting and without test-set-specific tuning or data selection.

What would settle it

Reproduction on a fresh long-video benchmark with no training overlap showing the model no longer leads similar-scale open models on temporal localization or video comprehension metrics.

Figures

Figures reproduced from arXiv: 2606.10651 by Bin Wen, Changyi Liu, Chengru Song, Chongling Rao, Chuan Yi, Fan Yang, Feng Han, Guowang Zhang, Haixuan Gao, Hang Li, Han Li, Haonan Fan, Haonan Jia, Hengrui Ju, Jiankang Chen, Jiapeng Chen, Jiawei Yuan, Jinghui Jia, Jing Wang, Junmin Chen, Junyu Shi, Kaixuan Yang, Kaiyu Jiang, Kun Gai, Kwai Keye Team, Lele Yang, Lingzhi Zhou, Mingqiao Liu, Muxi Diao, Na Nie, Qile Su, Qi Zhang, Ruilin Zhang, Sen Na, Tianke Zhang, Tianming Liang, Tingting Gao, Wei Chen, Weixin Xu, Wentao Hong, Xiaoxiao Ma, Xingyu Lu, Xuanyu Zheng, Yancheng Long, Yang Tian, Yankai Yang, Yingxin Li, Yiyang Fan, Yufei Han, Yulong Chen, Yu Xia, Yuzhe Chen, Ziliang Lai.

**Figure 2.** Figure 2: The Keye-VL-2.0-30B-A3B pre-training pipeline, [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: An example of scene-wise dense caption. Each video is decomposed into scenes annotated with timestamps, dense captions, and a global overview. 3.6 Video Pre-Training Curriculum To scale from short-video understanding to high-resolution long-video reasoning, we adopt a multi-stage video curriculum, summarized in [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Inference cost of Keye-VL-2.0-30B-A3B. DSA-specific prefill and decode optimizations reduce the cost of ultra-long video inference relative to dense attention under the same H800 pricing assumption. 5.3 Efficient Inference for GQA+DSA For ultra-long video inference, we introduce three optimizations. • Chunk ViT: video frames are split into chunks, processed sequentially by the ViT, and then merged, reducin… view at source ↗

**Figure 5.** Figure 5: Overall evaluation summary of Keye-VL-2.0-30B-A3B. [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Text case for logical constraint solving. [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

**Figure 7.** Figure 7: Image case for spatial layout understanding. [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

**Figure 8.** Figure 8: Image case for anatomical diagram understanding. [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

**Figure 9.** Figure 9: Video case for long-form scene-level understanding. [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

**Figure 10.** Figure 10: Video case for scene-level daily vlog understanding. [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

**Figure 11.** Figure 11: Agent case for multi-domain service orchestration. [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗

read the original abstract

We introduce Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts (MoE) multimodal foundation model designed to advance long-video understanding and agentic intelligence. To address the challenges of ultra-long contexts, information redundancy, and prohibitive computational costs inherent in hour-level videos, Keye-VL-2.0 is the first to adapt DeepSeek Sparse Attention (DSA) to GQA-based multimodal architectures, enabling lossless 256K context processing while capturing critical frames and long-range temporal dependencies. This architecture is underpinned by a highly optimized training and inference infrastructure, including scalable video I/O, heterogeneous ViT-LM parallelism, and custom DSA kernels that significantly maximize throughput and minimize computational overhead. Furthermore, to overcome the algorithmic dilemma of catastrophic forgetting during multi-task alignment, we introduce Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) paired with Context-RL and Video-RL. By distilling dense token-level teacher feedback from on-policy rollouts back into the MoE backbone, which activates only 3B parameters, Keye-VL-2.0 natively empowers advanced agent collaboration across Code, Tool, and Search scenarios with multimodal self-correction. Extensive evaluations across video understanding, temporal grounding, reasoning, STEM, and agent benchmarks demonstrate that Keye-VL-2.0-30B-A3B achieves state-of-the-art performance among models of similar scale, particularly excelling in fine-grained temporal localization on TimeLens and long-video comprehension on Video-MME-v2 and LongVideoBench. We release our model checkpoints to accelerate community progress toward scalable and robust multimodal agentic applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a model release note for an open 30B MoE video model with DSA adaptation and MOPD distillation, but the SOTA claims have no visible numbers, baselines, or methods to back them up.

read the letter

The main point is that Kwai Keye-VL-2.0-30B-A3B is a new open-source MoE model (3B active parameters) aimed at long-video understanding and multimodal agents. It adapts DeepSeek Sparse Attention to GQA for 256K context handling and introduces Cross-Modal Multi-Teacher On-Policy Distillation with RL components to manage alignment across tasks.

Releasing the checkpoints is the clearest positive here. Teams working on video analytics or tool-using agents can actually download and test the weights, which is more useful than closed models making similar claims. The engineering steps around sparse attention for multimodal inputs and the distillation approach to limit forgetting are concrete, even if they build on prior MoE and RL work.

The soft spot is straightforward: the abstract asserts state-of-the-art results on TimeLens, Video-MME-v2, and LongVideoBench with no tables, baselines, error bars, prompting details, or evaluation protocol. Without those, the performance numbers cannot be checked for standard issues like test-set tuning or non-generalizable setups. The full text was not available for review, so this gap remains.

The paper is mainly for practitioners who want the weights for downstream experiments rather than readers seeking rigorously documented advances. It does not supply enough evidence for a serious referee to evaluate the central claims, so it would not merit peer review in this form. If the complete manuscript includes the missing results sections with reproducible details, that would change the picture.

Referee Report

2 major / 0 minor

Summary. The paper introduces Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts (MoE) multimodal foundation model for long-video understanding and agentic intelligence. It adapts DeepSeek Sparse Attention (DSA) to GQA-based architectures to enable lossless 256K context processing for hour-level videos, describes optimized training/inference infrastructure (scalable video I/O, heterogeneous ViT-LM parallelism, custom DSA kernels), and proposes Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) paired with Context-RL and Video-RL to address catastrophic forgetting during multi-task alignment. The model is claimed to achieve state-of-the-art performance among similar-scale models on video understanding, temporal grounding, reasoning, STEM, and agent benchmarks, with particular strength on TimeLens (fine-grained temporal localization), Video-MME-v2, and LongVideoBench (long-video comprehension). Model checkpoints are released.

Significance. If the performance claims hold under standard evaluation protocols, the work would advance efficient long-context multimodal modeling by showing how sparse attention and on-policy distillation can be combined in an MoE backbone (activating only 3B parameters) for video and agentic tasks. The open release of checkpoints is a concrete community benefit. However, the current manuscript provides no visible experimental details, so the significance cannot yet be assessed.

major comments (2)

[Abstract] Abstract: the central claim that Keye-VL-2.0-30B-A3B 'achieves state-of-the-art performance among models of similar scale' is presented with no accompanying evaluation details, baselines, metrics, error bars, prompting protocols, or result tables. This is load-bearing for the paper's primary assertion.
[Evaluation (missing)] No evaluation section is visible in the supplied manuscript text. Without benchmark protocols, data splits, comparison tables, or ablation studies, the SOTA statements on TimeLens, Video-MME-v2, and LongVideoBench cannot be verified and the risk of undisclosed test-set tuning or non-standard prompting cannot be ruled out.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. We agree that the absence of an evaluation section prevents verification of the SOTA claims and will add a comprehensive Evaluation section with all protocols, tables, baselines, and metrics in the revision.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that Keye-VL-2.0-30B-A3B 'achieves state-of-the-art performance among models of similar scale' is presented with no accompanying evaluation details, baselines, metrics, error bars, prompting protocols, or result tables. This is load-bearing for the paper's primary assertion.

Authors: We accept this criticism. The abstract condenses results that are described at a high level in the manuscript, but without the supporting evaluation details the claim cannot stand alone. In revision we will either qualify the abstract language or add explicit forward references to the new Evaluation section while retaining the overall claim. revision: yes
Referee: [Evaluation (missing)] No evaluation section is visible in the supplied manuscript text. Without benchmark protocols, data splits, comparison tables, or ablation studies, the SOTA statements on TimeLens, Video-MME-v2, and LongVideoBench cannot be verified and the risk of undisclosed test-set tuning or non-standard prompting cannot be ruled out.

Authors: The referee correctly observes that no Evaluation section appears in the text provided for review. This is an omission in the current draft. We will insert a full Evaluation section that reports benchmark protocols, data splits, comparison tables against similar-scale models, prompting templates, error bars where applicable, and ablation studies on DSA, MOPD, and RL components. This will directly address verification concerns. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The provided abstract and description contain no mathematical derivations, equations, or 'predictions' that reduce to inputs by construction. Claims rest on empirical benchmark results and architectural descriptions (DSA adaptation, MOPD) without self-referential fitting, self-citation load-bearing for uniqueness theorems, or renaming of known results. No load-bearing steps match the enumerated circularity patterns, so the report is treated as self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an engineering technical report whose claims rest on empirical performance rather than mathematical derivations. No explicit free parameters, axioms, or invented entities are defined in the abstract; model scale (30B/3B) and context length (256K) are design choices.

pith-pipeline@v0.9.1-grok · 6040 in / 1111 out tokens · 19468 ms · 2026-06-27T13:51:11.878334+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

163 extracted references · 1 canonical work pages

[1]

2026 , month =

OpenAI , title =. 2026 , month =

2026
[2]

2026 , month =

Anthropic , title =. 2026 , month =

2026
[3]

Gemini 3.5 Flash , year =
[4]

Qwen3.7: The Agent Frontier , year =
[5]

arXiv preprint arXiv:2507.01949 , year=

Kwai Keye-VL Technical Report , author=. arXiv preprint arXiv:2507.01949 , year=

arXiv
[6]

arXiv preprint arXiv:2509.01563 , year=

Kwai Keye-VL 1.5 Technical Report , author=. arXiv preprint arXiv:2509.01563 , year=

arXiv
[7]

2025 , url =

Zhang, Jun and Wang, Teng and Ge, Yuying and Ge, Yixiao and Li, Xinhao and Shan, Ying and Wang, Limin , journal =. 2025 , url =

2025
[8]

arXiv preprint arXiv:2604.05015 , year =

Fu, Chaoyou and others , title =. arXiv preprint arXiv:2604.05015 , year =

Pith/arXiv arXiv
[9]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Wu, Haoning and Li, Dongxu and Chen, Bei and Li, Junnan , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[10]

2025 , url =

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , journal =. 2025 , url =

2025
[11]

2026 , eprint=

GLM-5: from Vibe Coding to Agentic Engineering , author=. 2026 , eprint=

2026
[12]

arXiv preprint arXiv:2605.10943 , year =

Pith/arXiv arXiv
[13]

International Conference on Learning Representations (ICLR) , year =

Shazeer, Noam and Mirhoseini, Azalia and Maziarz, Krzysztof and Davis, Andy and Le, Quoc and Hinton, Geoffrey and Dean, Jeff , title =. International Conference on Learning Representations (ICLR) , year =
[14]

5: Visual Agentic Intelligence , author=

Kimi K2. 5: Visual Agentic Intelligence , author=. arXiv preprint arXiv:2602.02276 , year=

Pith/arXiv arXiv
[15]

arXiv preprint arXiv:2602.22623 , year=

ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL , author=. arXiv preprint arXiv:2602.22623 , year=

arXiv
[16]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv
[17]

5-vl technical report , author=

Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

Pith/arXiv arXiv
[18]

arXiv preprint arXiv:2410.18558 , year=

Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data , author=. arXiv preprint arXiv:2410.18558 , year=

arXiv
[19]

arXiv preprint arXiv:2409.05840 , year=

Mmevol: Empowering multimodal large language models with evol-instruct , author=. arXiv preprint arXiv:2409.05840 , year=

arXiv
[20]

Science China Information Sciences , volume=

Mminstruct: A high-quality multi-modal instruction tuning dataset with extensive diversity , author=. Science China Information Sciences , volume=. 2024 , publisher=

2024
[21]

2023 , eprint=

Sigmoid Loss for Language Image Pre-Training , author=. 2023 , eprint=

2023
[22]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[23]

Neurocomputing , volume=

Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

2024
[24]

Advances in neural information processing systems , volume=

Laion-5b: An open large-scale dataset for training next generation image-text models , author=. Advances in neural information processing systems , volume=
[25]

Advances in Neural Information Processing Systems , volume=

Datacomp: In search of the next generation of multimodal datasets , author=. Advances in Neural Information Processing Systems , volume=
[26]

2022 , howpublished =

COYO-700M: Image-Text Pair Dataset , author =. 2022 , howpublished =

2022
[27]

2025 , eprint=

Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding , author=. 2025 , eprint=

2025
[28]

R efer I t G ame: Referring to Objects in Photographs of Natural Scenes

Kazemzadeh, Sahar and Ordonez, Vicente and Matten, Mark and Berg, Tamara. R efer I t G ame: Referring to Objects in Photographs of Natural Scenes. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ( EMNLP ). 2014. doi:10.3115/v1/D14-1086

work page doi:10.3115/v1/d14-1086 2014
[29]

International Journal of Computer Vision , year=

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , author=. International Journal of Computer Vision , year=
[30]

arXiv preprint arXiv:2309.16511 , year=

Toloka visual question answering benchmark , author=. arXiv preprint arXiv:2309.16511 , year=

arXiv
[31]

arXiv preprint arXiv:2010.11929 , year=

An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=

Pith/arXiv arXiv 2010
[32]

arXiv preprint arXiv:2411.10442 , year=

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization , author=. arXiv preprint arXiv:2411.10442 , year=

Pith/arXiv arXiv
[33]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
[34]

arXiv preprint arXiv:2502.09925 , year=

Taskgalaxy: Scaling multi-modal instruction fine-tuning with tens of thousands vision task types , author=. arXiv preprint arXiv:2502.09925 , year=

arXiv
[35]

arXiv preprint arXiv:2502.10391 , year=

MM-RLHF: The Next Step Forward in Multimodal LLM Alignment , author=. arXiv preprint arXiv:2502.10391 , year=

arXiv
[36]

arXiv preprint arXiv:2402.03300 , year=

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

Pith/arXiv arXiv
[37]

Thinking with Images

DeepEyes: Incentivizing" Thinking with Images" via Reinforcement Learning , author=. arXiv preprint arXiv:2505.14362 , year=

Pith/arXiv arXiv
[38]

arXiv preprint arXiv:2503.07365 , year=

Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning , author=. arXiv preprint arXiv:2503.07365 , year=

Pith/arXiv arXiv
[39]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
[40]

arXiv preprint arXiv:2504.10479 , year=

Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models , author=. arXiv preprint arXiv:2504.10479 , year=

Pith/arXiv arXiv
[41]

2025 , eprint=

MiMo-VL Technical Report , author=. 2025 , eprint=

2025
[42]

Advances in Neural Information Processing Systems , volume=

Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution , author=. Advances in Neural Information Processing Systems , volume=
[43]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[44]

Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part IV 14 , pages=

A diagram is worth a dozen images , author=. Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part IV 14 , pages=. 2016 , organization=

2016
[45]

arXiv preprint arXiv:2502.09696 , year=

Zerobench: An impossible visual benchmark for contemporary large multimodal models , author=. arXiv preprint arXiv:2502.09696 , year=

arXiv
[46]

arXiv preprint arXiv:2501.13826 , year=

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos , author=. arXiv preprint arXiv:2501.13826 , year=

Pith/arXiv arXiv
[47]

arXiv preprint arXiv:2409.17146 , year=

Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models , author=. arXiv preprint arXiv:2409.17146 , year=

Pith/arXiv arXiv
[48]

5-omni technical report , author=

Qwen2. 5-omni technical report , author=. arXiv preprint arXiv:2503.20215 , year=

Pith/arXiv arXiv
[49]

arXiv preprint arXiv:2503.24290 , year=

Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model , author=. arXiv preprint arXiv:2503.24290 , year=

Pith/arXiv arXiv
[50]

ACM Transactions on Multimedia Computing, Communications and Applications , year=

MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples , author=. ACM Transactions on Multimedia Computing, Communications and Applications , year=
[51]

arXiv preprint arXiv:2412.05271 , year=

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling , author=. arXiv preprint arXiv:2412.05271 , year=

Pith/arXiv arXiv
[52]

arXiv preprint arXiv:2312.11805 , year=

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

Pith/arXiv arXiv
[53]

arXiv preprint arXiv:2407.07895 , year=

Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models , author=. arXiv preprint arXiv:2407.07895 , year=

Pith/arXiv arXiv
[54]

arXiv preprint arXiv:2503.20020 , year=

Gemini robotics: Bringing ai into the physical world , author=. arXiv preprint arXiv:2503.20020 , year=

Pith/arXiv arXiv
[55]

arXiv preprint arXiv:2410.21276 , year=

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

Pith/arXiv arXiv
[56]

arXiv preprint arXiv:2504.07491 , year=

Kimi-vl technical report , author=. arXiv preprint arXiv:2504.07491 , year=

Pith/arXiv arXiv
[57]

arXiv preprint arXiv:2412.01282 , year=

Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model , author=. arXiv preprint arXiv:2412.01282 , year=

Pith/arXiv arXiv
[58]

arXiv preprint arXiv:2501.01957 , year=

Vita-1.5: Towards gpt-4o level real-time vision and speech interaction , author=. arXiv preprint arXiv:2501.01957 , year=

Pith/arXiv arXiv
[59]

arXiv preprint arXiv:2410.10441 , year=

Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs , author=. arXiv preprint arXiv:2410.10441 , year=

arXiv
[60]

arXiv preprint arXiv:2309.07124 , year=

Rain: Your language models can align themselves without finetuning , author=. arXiv preprint arXiv:2309.07124 , year=

arXiv
[61]

arXiv preprint arXiv:2311.10122 , year=

Video-llava: Learning united visual representation by alignment before projection , author=. arXiv preprint arXiv:2311.10122 , year=

Pith/arXiv arXiv
[62]

Advances in Neural Information Processing Systems , volume=

Cheap and quick: Efficient vision-language instruction tuning for large language models , author=. Advances in Neural Information Processing Systems , volume=
[63]

arXiv preprint arXiv:2403.03003 , year=

Feast your eyes: Mixture-of-resolution adaptation for multimodal large language models , author=. arXiv preprint arXiv:2403.03003 , year=

arXiv
[64]

arXiv preprint arXiv:2411.13093 , year=

Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension , author=. arXiv preprint arXiv:2411.13093 , year=

arXiv
[65]

arXiv preprint arXiv:2503.20502 , year=

Mllm-selector: Necessity and diversity-driven high-value data selection for enhanced visual instruction tuning , author=. arXiv preprint arXiv:2503.20502 , year=

arXiv
[66]

arXiv preprint arXiv:2501.04322 , year=

Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts , author=. arXiv preprint arXiv:2501.04322 , year=

arXiv
[67]

arXiv preprint arXiv:2502.05177 , year=

Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray , author=. arXiv preprint arXiv:2502.05177 , year=

arXiv
[68]

5-vl technical report , author=

Seed1. 5-vl technical report , author=. arXiv preprint arXiv:2505.07062 , year=

Pith/arXiv arXiv
[69]

arXiv preprint arXiv:2505.08617 , year=

Openthinkimg: Learning to think with images via visual tool reinforcement learning , author=. arXiv preprint arXiv:2505.08617 , year=

Pith/arXiv arXiv
[70]

arXiv preprint arXiv:2411.09968 , year=

Seeing clearly by layer two: Enhancing attention heads to alleviate hallucination in lvlms , author=. arXiv preprint arXiv:2411.09968 , year=

arXiv
[71]

Advances in Neural Information Processing Systems , volume=

Controlmllm: Training-free visual prompt learning for multimodal large language models , author=. Advances in Neural Information Processing Systems , volume=
[72]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[73]

arXiv preprint arXiv:2409.18869 , year=

Emu3: Next-token prediction is all you need , author=. arXiv preprint arXiv:2409.18869 , year=

Pith/arXiv arXiv
[74]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv
[75]

arXiv preprint arXiv:2404.14219 , year=

Phi-3 technical report: A highly capable language model locally on your phone , author=. arXiv preprint arXiv:2404.14219 , year=

Pith/arXiv arXiv
[76]

Lambert, Nathan and Morrison, Jacob and Pyatkin, Valentina and Huang, Shengyi and Ivison, Hamish and Brahman, Faeze and Miranda, Lester James V and Liu, Alisa and Dziri, Nouha and Lyu, Shane and others , journal=
[77]

arXiv preprint arXiv:2507.02029 , year=

RoboBrain 2.0 Technical Report , author=. arXiv preprint arXiv:2507.02029 , year=

arXiv
[78]

2025 , howpublished =

Introducing OpenAI o3 and o4-mini , author=. 2025 , howpublished =

2025
[79]

2025 , howpublished =

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation , author=. 2025 , howpublished =

2025
[80]

2025 , howpublished=

ERNIE 4.5 Technical Report , author=. 2025 , howpublished=

2025

Showing first 80 references.

[1] [1]

2026 , month =

OpenAI , title =. 2026 , month =

2026

[2] [2]

2026 , month =

Anthropic , title =. 2026 , month =

2026

[3] [3]

Gemini 3.5 Flash , year =

[4] [4]

Qwen3.7: The Agent Frontier , year =

[5] [5]

arXiv preprint arXiv:2507.01949 , year=

Kwai Keye-VL Technical Report , author=. arXiv preprint arXiv:2507.01949 , year=

arXiv

[6] [6]

arXiv preprint arXiv:2509.01563 , year=

Kwai Keye-VL 1.5 Technical Report , author=. arXiv preprint arXiv:2509.01563 , year=

arXiv

[7] [7]

2025 , url =

Zhang, Jun and Wang, Teng and Ge, Yuying and Ge, Yixiao and Li, Xinhao and Shan, Ying and Wang, Limin , journal =. 2025 , url =

2025

[8] [8]

arXiv preprint arXiv:2604.05015 , year =

Fu, Chaoyou and others , title =. arXiv preprint arXiv:2604.05015 , year =

Pith/arXiv arXiv

[9] [9]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Wu, Haoning and Li, Dongxu and Chen, Bei and Li, Junnan , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[10] [10]

2025 , url =

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , journal =. 2025 , url =

2025

[11] [11]

2026 , eprint=

GLM-5: from Vibe Coding to Agentic Engineering , author=. 2026 , eprint=

2026

[12] [12]

arXiv preprint arXiv:2605.10943 , year =

Pith/arXiv arXiv

[13] [13]

International Conference on Learning Representations (ICLR) , year =

Shazeer, Noam and Mirhoseini, Azalia and Maziarz, Krzysztof and Davis, Andy and Le, Quoc and Hinton, Geoffrey and Dean, Jeff , title =. International Conference on Learning Representations (ICLR) , year =

[14] [14]

5: Visual Agentic Intelligence , author=

Kimi K2. 5: Visual Agentic Intelligence , author=. arXiv preprint arXiv:2602.02276 , year=

Pith/arXiv arXiv

[15] [15]

arXiv preprint arXiv:2602.22623 , year=

ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL , author=. arXiv preprint arXiv:2602.22623 , year=

arXiv

[16] [16]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv

[17] [17]

5-vl technical report , author=

Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

Pith/arXiv arXiv

[18] [18]

arXiv preprint arXiv:2410.18558 , year=

Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data , author=. arXiv preprint arXiv:2410.18558 , year=

arXiv

[19] [19]

arXiv preprint arXiv:2409.05840 , year=

Mmevol: Empowering multimodal large language models with evol-instruct , author=. arXiv preprint arXiv:2409.05840 , year=

arXiv

[20] [20]

Science China Information Sciences , volume=

Mminstruct: A high-quality multi-modal instruction tuning dataset with extensive diversity , author=. Science China Information Sciences , volume=. 2024 , publisher=

2024

[21] [21]

2023 , eprint=

Sigmoid Loss for Language Image Pre-Training , author=. 2023 , eprint=

2023

[22] [22]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021

[23] [23]

Neurocomputing , volume=

Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

2024

[24] [24]

Advances in neural information processing systems , volume=

Laion-5b: An open large-scale dataset for training next generation image-text models , author=. Advances in neural information processing systems , volume=

[25] [25]

Advances in Neural Information Processing Systems , volume=

Datacomp: In search of the next generation of multimodal datasets , author=. Advances in Neural Information Processing Systems , volume=

[26] [26]

2022 , howpublished =

COYO-700M: Image-Text Pair Dataset , author =. 2022 , howpublished =

2022

[27] [27]

2025 , eprint=

Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding , author=. 2025 , eprint=

2025

[28] [28]

R efer I t G ame: Referring to Objects in Photographs of Natural Scenes

Kazemzadeh, Sahar and Ordonez, Vicente and Matten, Mark and Berg, Tamara. R efer I t G ame: Referring to Objects in Photographs of Natural Scenes. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ( EMNLP ). 2014. doi:10.3115/v1/D14-1086

work page doi:10.3115/v1/d14-1086 2014

[29] [29]

International Journal of Computer Vision , year=

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , author=. International Journal of Computer Vision , year=

[30] [30]

arXiv preprint arXiv:2309.16511 , year=

Toloka visual question answering benchmark , author=. arXiv preprint arXiv:2309.16511 , year=

arXiv

[31] [31]

arXiv preprint arXiv:2010.11929 , year=

An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=

Pith/arXiv arXiv 2010

[32] [32]

arXiv preprint arXiv:2411.10442 , year=

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization , author=. arXiv preprint arXiv:2411.10442 , year=

Pith/arXiv arXiv

[33] [33]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

[34] [34]

arXiv preprint arXiv:2502.09925 , year=

Taskgalaxy: Scaling multi-modal instruction fine-tuning with tens of thousands vision task types , author=. arXiv preprint arXiv:2502.09925 , year=

arXiv

[35] [35]

arXiv preprint arXiv:2502.10391 , year=

MM-RLHF: The Next Step Forward in Multimodal LLM Alignment , author=. arXiv preprint arXiv:2502.10391 , year=

arXiv

[36] [36]

arXiv preprint arXiv:2402.03300 , year=

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

Pith/arXiv arXiv

[37] [37]

Thinking with Images

DeepEyes: Incentivizing" Thinking with Images" via Reinforcement Learning , author=. arXiv preprint arXiv:2505.14362 , year=

Pith/arXiv arXiv

[38] [38]

arXiv preprint arXiv:2503.07365 , year=

Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning , author=. arXiv preprint arXiv:2503.07365 , year=

Pith/arXiv arXiv

[39] [39]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

[40] [40]

arXiv preprint arXiv:2504.10479 , year=

Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models , author=. arXiv preprint arXiv:2504.10479 , year=

Pith/arXiv arXiv

[41] [41]

2025 , eprint=

MiMo-VL Technical Report , author=. 2025 , eprint=

2025

[42] [42]

Advances in Neural Information Processing Systems , volume=

Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution , author=. Advances in Neural Information Processing Systems , volume=

[43] [43]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[44] [44]

Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part IV 14 , pages=

A diagram is worth a dozen images , author=. Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part IV 14 , pages=. 2016 , organization=

2016

[45] [45]

arXiv preprint arXiv:2502.09696 , year=

Zerobench: An impossible visual benchmark for contemporary large multimodal models , author=. arXiv preprint arXiv:2502.09696 , year=

arXiv

[46] [46]

arXiv preprint arXiv:2501.13826 , year=

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos , author=. arXiv preprint arXiv:2501.13826 , year=

Pith/arXiv arXiv

[47] [47]

arXiv preprint arXiv:2409.17146 , year=

Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models , author=. arXiv preprint arXiv:2409.17146 , year=

Pith/arXiv arXiv

[48] [48]

5-omni technical report , author=

Qwen2. 5-omni technical report , author=. arXiv preprint arXiv:2503.20215 , year=

Pith/arXiv arXiv

[49] [49]

arXiv preprint arXiv:2503.24290 , year=

Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model , author=. arXiv preprint arXiv:2503.24290 , year=

Pith/arXiv arXiv

[50] [50]

ACM Transactions on Multimedia Computing, Communications and Applications , year=

MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples , author=. ACM Transactions on Multimedia Computing, Communications and Applications , year=

[51] [51]

arXiv preprint arXiv:2412.05271 , year=

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling , author=. arXiv preprint arXiv:2412.05271 , year=

Pith/arXiv arXiv

[52] [52]

arXiv preprint arXiv:2312.11805 , year=

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

Pith/arXiv arXiv

[53] [53]

arXiv preprint arXiv:2407.07895 , year=

Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models , author=. arXiv preprint arXiv:2407.07895 , year=

Pith/arXiv arXiv

[54] [54]

arXiv preprint arXiv:2503.20020 , year=

Gemini robotics: Bringing ai into the physical world , author=. arXiv preprint arXiv:2503.20020 , year=

Pith/arXiv arXiv

[55] [55]

arXiv preprint arXiv:2410.21276 , year=

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

Pith/arXiv arXiv

[56] [56]

arXiv preprint arXiv:2504.07491 , year=

Kimi-vl technical report , author=. arXiv preprint arXiv:2504.07491 , year=

Pith/arXiv arXiv

[57] [57]

arXiv preprint arXiv:2412.01282 , year=

Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model , author=. arXiv preprint arXiv:2412.01282 , year=

Pith/arXiv arXiv

[58] [58]

arXiv preprint arXiv:2501.01957 , year=

Vita-1.5: Towards gpt-4o level real-time vision and speech interaction , author=. arXiv preprint arXiv:2501.01957 , year=

Pith/arXiv arXiv

[59] [59]

arXiv preprint arXiv:2410.10441 , year=

Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs , author=. arXiv preprint arXiv:2410.10441 , year=

arXiv

[60] [60]

arXiv preprint arXiv:2309.07124 , year=

Rain: Your language models can align themselves without finetuning , author=. arXiv preprint arXiv:2309.07124 , year=

arXiv

[61] [61]

arXiv preprint arXiv:2311.10122 , year=

Video-llava: Learning united visual representation by alignment before projection , author=. arXiv preprint arXiv:2311.10122 , year=

Pith/arXiv arXiv

[62] [62]

Advances in Neural Information Processing Systems , volume=

Cheap and quick: Efficient vision-language instruction tuning for large language models , author=. Advances in Neural Information Processing Systems , volume=

[63] [63]

arXiv preprint arXiv:2403.03003 , year=

Feast your eyes: Mixture-of-resolution adaptation for multimodal large language models , author=. arXiv preprint arXiv:2403.03003 , year=

arXiv

[64] [64]

arXiv preprint arXiv:2411.13093 , year=

Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension , author=. arXiv preprint arXiv:2411.13093 , year=

arXiv

[65] [65]

arXiv preprint arXiv:2503.20502 , year=

Mllm-selector: Necessity and diversity-driven high-value data selection for enhanced visual instruction tuning , author=. arXiv preprint arXiv:2503.20502 , year=

arXiv

[66] [66]

arXiv preprint arXiv:2501.04322 , year=

Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts , author=. arXiv preprint arXiv:2501.04322 , year=

arXiv

[67] [67]

arXiv preprint arXiv:2502.05177 , year=

Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray , author=. arXiv preprint arXiv:2502.05177 , year=

arXiv

[68] [68]

5-vl technical report , author=

Seed1. 5-vl technical report , author=. arXiv preprint arXiv:2505.07062 , year=

Pith/arXiv arXiv

[69] [69]

arXiv preprint arXiv:2505.08617 , year=

Openthinkimg: Learning to think with images via visual tool reinforcement learning , author=. arXiv preprint arXiv:2505.08617 , year=

Pith/arXiv arXiv

[70] [70]

arXiv preprint arXiv:2411.09968 , year=

Seeing clearly by layer two: Enhancing attention heads to alleviate hallucination in lvlms , author=. arXiv preprint arXiv:2411.09968 , year=

arXiv

[71] [71]

Advances in Neural Information Processing Systems , volume=

Controlmllm: Training-free visual prompt learning for multimodal large language models , author=. Advances in Neural Information Processing Systems , volume=

[72] [72]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[73] [73]

arXiv preprint arXiv:2409.18869 , year=

Emu3: Next-token prediction is all you need , author=. arXiv preprint arXiv:2409.18869 , year=

Pith/arXiv arXiv

[74] [74]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv

[75] [75]

arXiv preprint arXiv:2404.14219 , year=

Phi-3 technical report: A highly capable language model locally on your phone , author=. arXiv preprint arXiv:2404.14219 , year=

Pith/arXiv arXiv

[76] [76]

Lambert, Nathan and Morrison, Jacob and Pyatkin, Valentina and Huang, Shengyi and Ivison, Hamish and Brahman, Faeze and Miranda, Lester James V and Liu, Alisa and Dziri, Nouha and Lyu, Shane and others , journal=

[77] [77]

arXiv preprint arXiv:2507.02029 , year=

RoboBrain 2.0 Technical Report , author=. arXiv preprint arXiv:2507.02029 , year=

arXiv

[78] [78]

2025 , howpublished =

Introducing OpenAI o3 and o4-mini , author=. 2025 , howpublished =

2025

[79] [79]

2025 , howpublished =

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation , author=. 2025 , howpublished =

2025

[80] [80]

2025 , howpublished=

ERNIE 4.5 Technical Report , author=. 2025 , howpublished=

2025