LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

Bin Qin; Bo Li; Changrui Chen; Chunjiang Ge; Chunsheng Wu; Chunyuan Li; Dehua Song; Didi Zhu; Feilong Tang; Huajie Tan

arxiv: 2605.25979 · v1 · pith:RAYFS6PWnew · submitted 2026-05-25 · 💻 cs.CV

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

Xiang An , Yin Xie , Feilong Tang , Yunyao Yan , Huajie Tan , Didi Zhu , Changrui Chen , Xiuwei Zhao

show 22 more authors

Bin Qin Kaicheng Yang Yifei Shen Yuanhan Zhang Kaichen Zhang Wenkang Zhang Zheng Cheng Nansen Zhang Chunsheng Wu Chunjiang Ge Zimin Ran Dehua Song Chunyuan Li Shikun Feng Ming Hu Zhangquan Chen Junbo Niu Bo Li Ziyong Feng Ziwei Liu Zongyuan Ge Jiankang Deng

This is my paper

Pith reviewed 2026-06-29 22:10 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelsvideo understandingtokenizationtemporal groundingspatial groundingmultimodal benchmarksJumpScorecodec compression

0 comments

The pith

LLaVA-OneVision-2 uses codec-stream tokenization to allocate limited visual tokens to motion-bearing video content via bit-cost dynamics and motion residuals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LLaVA-OneVision-2 as the next model in its series, built around a native encoder with windowed attention and a new codec-stream tokenization that processes compressed video as a continuous bit-cost stream. Bit-cost changes drive adaptive temporal grouping while motion residuals pick salient spatial patches into compact canvases, and a shared 3D RoPE unifies the coordinate system for canvases, frames, and images. This is paired with large open datasets of 8 million re-captioned videos and 4 million spatial samples, plus a new JumpScore benchmark for fine-grained temporal localization in repeated high-frequency motion. The central claim is that these choices produce unified perception across video, spatial, and tracking tasks with substantial benchmark gains over prior models under matched token budgets.

Core claim

Codec-stream tokenization treats compressed video as a continuous bit-cost stream in which bit-cost dynamics set adaptive temporal groups and motion-residual cues select salient spatial evidence into compact visual canvases; combined with a shared 3D RoPE that places codec canvases, sampled frames, and images inside one spatiotemporal coordinate system, the approach yields more stable long-video compression than fixed groups of pictures.

What carries the argument

Codec-stream tokenization: a bit-cost-driven allocation that forms adaptive temporal groups and motion-residual spatial canvases from compressed video to concentrate a fixed token budget on event content.

If this is right

Under identical visual-token budgets the codec-stream method improves temporal grounding by 9.7 points over frame sampling.
The 8B model reaches 74.9 mAP on JumpScore, 44.8 points above Qwen3-VL-8B.
The same model records average gains of 4.3 points on video tasks, 5.3 points on spatial tasks, and 15.6 J&F on tracking tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The bit-cost allocation rule could be applied to audio or sensor streams that also admit cheap compression.
A single 3D coordinate frame for images, video canvases, and point clouds may simplify multi-task training loops that currently require separate positional encodings.
If the adaptive grouping proves robust, it offers a route to process hour-long videos without increasing total token count.

Load-bearing premise

The large reported gains on JumpScore and other benchmarks arise from the codec-stream tokenization and unified 3D RoPE rather than from differences in training data scale, model size, or post-training choices.

What would settle it

Train or fine-tune a comparable 8B vision-language model on the same 8M video plus 4M spatial data but replace codec-stream inputs with standard uniform frame sampling, then measure whether JumpScore mAP falls by roughly 9–10 points or more.

read the original abstract

We introduce LLaVA-OneVision-2 (LLaVA-OV-2), the most capable vision-language model in the LLaVA-OneVision series to date, achieving superior performance across a broad range of multimodal benchmarks. The model builds on a native OneVision-Encoder and incorporates Windowed Attention for efficient local computation while maintaining native resolution. Its key advance is codec-stream tokenization: it treats compressed video as a continuous bit-cost stream, where bit-cost dynamics determine adaptive temporal groups, and motion-residual cues select salient spatial evidence into compact visual canvases. This allocation concentrates a limited token budget on event-bearing content, enabling more stable long-video token compression than fixed groups of pictures. A shared 3D RoPE further places codec canvases, sampled frames, and images in a unified spatiotemporal coordinate system. Furthermore, we build the LLaVA-OV-2 data and training stack around large-scale open supervision: approximately 8M re-captioned video samples for pretraining, a 4M-sample spatial corpus for fine-tuning. We also introduce JumpScore, a temporal-localization benchmark targeting fine-grained grounding in high-frequency, densely repeated motion, a regime underrepresented by existing video evaluations. A standout capability of LLaVA-OV-2 is its unified perception across video understanding, temporal grounding, spatial grounding, and manipulation-trace reasoning. On JumpScore, LLaVA-OneVision-2-8B reaches 74.9 JumpScore mAP, surpassing Qwen3-VL-8B (30.1) by +44.8 points; under matched visual-token budgets on the same benchmark, codec-stream inputs improve temporal grounding over frame sampling by +9.7 points. Across standard benchmarks, LLaVA-OneVision-2-8B further outperforms Qwen3-VL-8B by +4.3 average points on video tasks, +5.3 on spatial tasks, and +15.6 average J&F on tracking tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Codec-stream tokenization is a practical idea for video token allocation, but the large reported gains rest on unmatched training data and lack isolating ablations.

read the letter

The main thing to know is that LLaVA-OV-2 introduces codec-stream tokenization—using bit-cost dynamics and motion residuals to adaptively group and select visual tokens from compressed video—along with a new JumpScore benchmark for high-frequency motion grounding. The unified 3D RoPE for mixing canvases, frames, and images is a straightforward extension. These are legitimate engineering moves on top of the prior LLaVA-OneVision line.

The paper does a reasonable job explaining how the approach concentrates limited tokens on event content for longer videos, and the matched-token-budget comparison (+9.7 on temporal grounding) gives the cleanest signal that the tokenization itself helps rather than just scale. Reporting gains across video, spatial, and tracking tasks under the same model size is also useful context.

The soft spot is attribution. The headline 74.9 mAP on JumpScore versus 30.1 for Qwen3-VL-8B, and the overall +4.3 to +15.6 deltas, come with an 8 M video pre-training set and 4 M spatial fine-tuning set that are not matched in the baselines. No ablation holds data, schedule, and capacity fixed while swapping only the visual encoder or token allocator, so it is difficult to assign the deltas to codec-stream rather than curation or post-training choices. The abstract supplies no error bars or dataset construction rules either.

This is for people working on token-efficient video VLMs who want concrete implementation details on adaptive grouping. A reader already following the LLaVA series or compression-based vision models will extract the most value. It deserves peer review because the core mechanism is well-motivated and the benchmark targets an underrepresented regime, even though the empirical claims will require the missing controls to stand up.

Referee Report

2 major / 2 minor

Summary. The paper introduces LLaVA-OneVision-2 (LLaVA-OV-2), a vision-language model extending the LLaVA-OneVision series. It features a native OneVision-Encoder with Windowed Attention for native-resolution processing, codec-stream tokenization that treats compressed video as a bit-cost stream with adaptive temporal grouping and motion-residual selection into compact canvases, and a shared 3D RoPE for unified spatiotemporal coordinates across codec canvases, frames, and images. The model is trained on ~8M re-captioned video samples for pretraining and a 4M-sample spatial corpus for fine-tuning, introduces the JumpScore benchmark for fine-grained temporal localization in high-frequency repeated motions, and reports large gains including 74.9 mAP on JumpScore (vs. 30.1 for Qwen3-VL-8B), +9.7 points from codec-stream vs. frame sampling under matched token budgets, +4.3 on video tasks, +5.3 on spatial tasks, and +15.6 J&F on tracking.

Significance. If the reported gains can be isolated to the codec-stream tokenization and 3D RoPE rather than data-scale or curation differences, the work would meaningfully advance efficient long-video token allocation and unified spatiotemporal reasoning in VLMs. The JumpScore benchmark targets an underrepresented evaluation regime, and the emphasis on open-supervision data is a positive contribution if the ablations and controls are provided.

major comments (2)

[Abstract] Abstract: the +9.7 point codec-stream advantage over frame sampling (and the overall +44.8 point JumpScore gap vs. Qwen3-VL-8B) is presented without any ablation that holds training data volume, curation, optimizer schedule, and post-training fixed while swapping only the visual token allocator; the 8 M video pre-training corpus is described only at aggregate scale, so the deltas cannot be attributed to the claimed mechanisms.
[Abstract] Abstract: no error bars, variance estimates, or statistical tests accompany the headline numbers (74.9 mAP, +9.7, +4.3, +5.3, +15.6); without them the cross-model and cross-input comparisons lack the precision needed to support the central claim of architectural superiority.

minor comments (2)

[Abstract] Abstract: the phrase 'the most capable vision-language model in the LLaVA-OneVision series to date' is promotional and should be replaced by a concrete ranking or metric summary.
[Abstract] Abstract: 'J&F' on tracking tasks is undefined; a brief parenthetical or reference is needed for readers outside the immediate subfield.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the need for stronger isolation of contributions and statistical rigor. We respond point-by-point to the major comments below.

read point-by-point responses

Referee: [Abstract] Abstract: the +9.7 point codec-stream advantage over frame sampling (and the overall +44.8 point JumpScore gap vs. Qwen3-VL-8B) is presented without any ablation that holds training data volume, curation, optimizer schedule, and post-training fixed while swapping only the visual token allocator; the 8 M video pre-training corpus is described only at aggregate scale, so the deltas cannot be attributed to the claimed mechanisms.

Authors: We agree that the strongest attribution would require a fully controlled ablation holding data volume, curation, optimizer, and post-training fixed while varying only the token allocator. The reported +9.7 point gain is obtained under matched visual-token budgets on JumpScore, which isolates the allocator effect at inference time but does not control the full upstream training pipeline. The 8 M corpus is described at aggregate scale in the current text. In revision we will expand the data section with additional curation details and add a controlled ablation comparing codec-stream versus frame sampling under as many fixed variables as feasible. revision: partial
Referee: [Abstract] Abstract: no error bars, variance estimates, or statistical tests accompany the headline numbers (74.9 mAP, +9.7, +4.3, +5.3, +15.6); without them the cross-model and cross-input comparisons lack the precision needed to support the central claim of architectural superiority.

Authors: We concur that error bars and statistical tests would improve the precision of the reported comparisons. The current manuscript does not include them. We will add standard deviations from repeated evaluations and appropriate statistical notes for the headline metrics in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on external benchmarks and reported ablations

full rationale

The paper reports performance numbers on JumpScore and other benchmarks, attributes gains to codec-stream tokenization and 3D RoPE under matched token budgets, and cites an 8M/4M data corpus. No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims are falsifiable against external models and benchmarks rather than reducing to the paper's own inputs by construction. This is the expected non-finding for an empirical architecture paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no equations, parameter lists, or assumption statements are provided, so the ledger cannot be populated with specific free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 6014 in / 1208 out tokens · 26345 ms · 2026-06-29T22:10:01.117015+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Benchmarking Visual State Tracking in Multimodal Video Understanding
cs.CV 2026-06 unverdicted novelty 7.0

VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.
MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention
cs.CV 2026-06 unverdicted novelty 6.0

MOSS-Video-Preview introduces a cross-attention architecture and synthesized real-time QA data to enable continuous perception, answer revision, and faster inference in video-language models compared to decoder-only designs.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 2 Pith papers · 23 internal anchors

[1]

SPARROW: Learning spatial precision and temporal referential consistency in pixel-grounded video MLLMs.arXiv:2603.12382,

Mohamad Alansari et al. SPARROW: Learning spatial precision and temporal referential consistency in pixel-grounded video MLLMs.arXiv:2603.12382,

work page arXiv
[2]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Didi Zhu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training. arXiv:2509.23661,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Reasoning over Video: Evaluating How MLLMs Extract, Integrate, and Reconstruct Spatiotemporal Evidence

Seunghwan Bang and Hwanjun Song. Reasoning over video: Evaluating how MLLMs extract, integrate, and reconstruct spatiotemporal evidence.arXiv:2603.13091,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Token Merging: Your ViT But Faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv:2210.09461,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Spatialbot: Precise spatial understanding with vision language models

Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. SpatialBot: Precise spatial understanding with vision language models.arXiv:2406.13642,

work page arXiv
[7]

Think with grounding: Curriculum reinforced reasoning with video grounding for long video understanding.arXiv:2602.18702, 2026a

Houlun Chen et al. Think with grounding: Curriculum reinforced reasoning with video grounding for long video understanding.arXiv:2602.18702, 2026a. Jieneng Chen et al. Thinking with spatial code for physical-world video reasoning.arXiv:2603.05591, 2026b. Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Li...

work page arXiv
[8]

GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking

26 Zixu Cheng et al. GraphThinker: Reinforcing temporally grounded video reasoning with event graph thinking. arXiv:2602.17555,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, et al. Molmo2: Open weights and data for vision-language models with video understanding and grounding.arXiv:2601.10611,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvon...

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Small Vision-Language Models are Smart Compressors for Long Video Understanding

Junjie Fei et al. Small vision-language models are smart compressors for long video understanding.arXiv:2604.08120,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Benyou Wang, and Xiangyu Yue. Video-R1: Reinforcing video reasoning in MLLMs.arXiv:2503.21776,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

Chaoyou Fu, Haozhi Yuan, et al. Video-MME-v2: Towards the next stage in benchmarks for comprehensive video understanding.arXiv:2604.05015,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Video streaming thinking: VideoLLMs can watch and think simultaneously.arXiv:2603.12262,

Yiran Guan et al. Video streaming thinking: VideoLLMs can watch and think simultaneously.arXiv:2603.12262,

work page arXiv
[15]

Trace: Temporal grounding video llm via causal event modeling,

Yongxin Guo, Jingyu Liu, Mingda Li, Xiaoying Tang, Xu Chen, and Bo Zhao. TRACE: Temporal grounding video LLM via causal event modeling.arXiv:2410.05643, 2024a. Yongxin Guo, Jingyu Liu, Mingda Li, Xiaoying Tang, Xi Liu, and Xu Chen. VTG-LLM: Integrating timestamp knowledge into video LLMs for enhanced video temporal grounding.arXiv:2405.13382, 2024b. Xiaoc...

work page arXiv
[16]

VTimeLLM: Empower LLM to grasp video moments

Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. VTimeLLM: Empower LLM to grasp video moments. InCVPR, 2024a. De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, and Jan Kautz. LITA: Language instructed temporal-localization assistant. InECCV, 2024b. Haoyi Jiang, Liu Liu, Xinjie Wang, Yonghao He, Wei Sui,...

work page arXiv
[17]

STORM: Token-efficient long video understanding for multimodal LLMs.arXiv:2503.04130,

Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Ding, Yunsheng Yang, Yu Zhu, Yu Bao, Hongxu Yin, Yao Lu, Song Han, et al. STORM: Token-efficient long video understanding for multimodal LLMs.arXiv:2503.04130,

work page arXiv
[18]

AgentRVOS: Reasoning over object tracks for zero-shot referring video object segmentation

Woojeong Jin et al. AgentRVOS: Reasoning over object tracks for zero-shot referring video object segmentation. arXiv:2603.23489, 2026a. Xin Jin et al. Compression tells intelligence: Visual coding, visual token technology, and the unification. arXiv:2601.20742, 2026b. 27 Yang Jin, Zhicheng Sun, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yul...

work page arXiv
[19]

LLaVA-OneVision: Easy visual task transfer.TMLR, 2025a

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaVA-OneVision: Easy visual task transfer.TMLR, 2025a. Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Guoyin Wang, Bei Chen, and Junnan Li. Aria: An open multimodal native mixture-of-experts model.arXiv:2...

work page arXiv
[20]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv:2305.06355,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. VideoChat-Flash: Hierarchical compression for long-context video modeling.arXiv:2501.00574, 2025b. Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. VideoChat-R1: Enha...

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Video-XL-Pro: Reconstructive token compression for extremely long video understanding.arXiv:2503.18478,

Xiangrui Liu, Yan Shu, Zheng Liu, Ao Li, Yang Tian, and Bo Zhao. Video-XL-Pro: Reconstructive token compression for extremely long video understanding.arXiv:2503.18478,

work page arXiv
[23]

Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024c

Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Oryx MLLM: On-demand spatial- temporal understanding at arbitrary resolution.arXiv:2409.12961, 2024a. Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cai, Yuxian Han, Xiuyu Xu, et al. NVILA: Efficient frontier visual language models.a...

work page arXiv
[24]

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

Kun Ouyang, Yuanxin Liu, Haoning Bai, Yuxin Hu, Lu Hou, Mingxiao Zhou, and Maosong Sun. SpaceR: Reinforcing MLLMs in video spatial reasoning.arXiv:2504.01805,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Videomolmo: Spatio-temporal grounding meets pointing.arXiv preprint arXiv:2506.05336, 2025

Hanoona Rasheed, Abdelrahman Shaker, Mahmoud Wajahat, Muhammad Maaz, Tianzhu Hu, Hisham Cholakkal, Salman Khan, and Fahad Shahbaz Khan. VideoMolmo: Spatio-temporal grounding meets pointing.arXiv:2506.05336,

work page arXiv
[26]

Available: https://arxiv.org/abs/2312.02051

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. TimeChat: A time-sensitive multimodal large language model for long video understanding.arXiv:2312.02051,

work page arXiv
[27]

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

28 Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, et al. Longvu: Spatiotemporal adaptive compression for long video-language understanding.arXiv:2410.17434,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

A simple baseline for streaming video understanding.arXiv:2604.02317,

Yujiao Shen et al. A simple baseline for streaming video understanding.arXiv:2604.02317,

work page arXiv
[29]

Video-xl: Extra-long vision language model for hour-scale video understanding

Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26160–26169, 2025a. Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Lia...

work page arXiv
[30]

EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling

Jiafei Song et al. EvoComp: Learning visual token compression for multimodal large language models via semantic- guided evolutionary labeling.arXiv:2604.17087,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

OneVision-Encoder: Codec-aligned sparsity as a foundational principle for multimodal intelligence.arXiv:2602.08683,

Feilong Tang, Xiang An, Yunyao Yan, Yin Xie, Bin Qin, Kaicheng Yang, Yifei Shen, Yuanhan Zhang, Chunyuan Li, Shikun Feng, Changrui Chen, Huajie Tan, Ming Hu, Manyuan Zhang, Bo Li, Ziyong Feng, Ziwei Liu, Zongyuan Ge, and Jiankang Deng. OneVision-Encoder: Codec-aligned sparsity as a foundational principle for multimodal intelligence.arXiv:2602.08683,

work page arXiv
[32]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arX...

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models.arXiv preprint arXiv:2410.03290, 2024

Haibo Wang, Zhiyang Xu, Yu Cheng, Shizhe Diao, Yufan Zhou, Yixin Cao, Qifan Wang, Weifeng Ge, and Lifu Huang. Grounded-VideoLLM: Sharpening fine-grained temporal grounding in video large language models.arXiv:2410.03290, 2024a. Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, K...

work page arXiv
[34]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Tai Wang, Xiaohan Mao, Chenming Zhu, Runsen Xu, Ruiyuan Lyu, Peisen Li, Xiao Chen, Wenwei Zhang, Kai Chen, Tianfan Xue, Xihui Liu, Cewu Lu, Dahua Lin, and Jiangmiao Pang. EmbodiedScan: A holistic multi-modal 3D perception suite towards embodied AI. InCVPR, 2024c. Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Lingl...

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Slow-fast architecture for video multi-modal large language models.arXiv:2504.01328, 2025a

Mingze Xu, Mingfei Gao, Shiyu Zhou, Jiasen Xiao, Yinglu Niu, Joseph Garcia, Leonid Sigal, Yu Zhang, Bo Pang, Soufiane Belharbi, et al. Slow-fast architecture for video multi-modal large language models.arXiv:2504.01328, 2025a. Weili Xu, Enxin Song, Wenhao Chai, Xuexiang Wen, Tian Ye, and Gaoang Wang. Auroralong: Bringing rnns back to efficient open-ended ...

work page arXiv
[36]

S-GRPO: Unified Post-Training for Large Vision-Language Models

Yuming Yan, Kai Tang, Sihong Chen, Ke Xu, Dan Hu, Qun Yu, and Pengfei Hu. S-GRPO: Unified post-training for large vision-language models.arXiv:2604.16557,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

arXiv preprint arXiv:2509.01563 (2025)

Biao Yang, Bin Wen, Boyang Ding, et al. Kwai keye-vl 1.5 technical report.arXiv:2509.01563, 2025a. Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv:2412.14171,

work page arXiv
[38]

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, et al. MMSI-Bench: A benchmark for multi-image spatial intelligence.arXiv:2505.23764, 2025b. Shusheng Yang, Jihan Yang, Ellis Brown, Shengbang Tong, Boyang Liang, Xichen Pan, Ziteng Wang, Adithya Iyer, Sai Charitha Akula, Penghao Wu, Ro...

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, and Ming-Hsuan Yang. Sa2VA: Marrying SAM2 with LLaVA for dense grounded understanding of images and videos. arXiv:2501.04001, 2025a. Jiangye Yuan et al. Boosting MLLM spatial reasoning with geometrically referenced 3D scene representations. arXiv:2603.08592,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

arXiv preprint arXiv:2501.07888 (2025)

Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, and Yuan Lin. Tarsier2: Advancing large vision-language models from detailed video description to comprehensive video understanding.arXiv:2501.07888, 2025b. Bowen Zeng et al. HybridKV: Hybrid KV cache compression for efficient multimodal large language model inference. arXiv:2604.05887,

work page arXiv
[41]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. VideoLLaMA 3: Frontier multimodal foundation models for image and video understanding.arXiv:2501.13106, 2025a. Boqiang Zhang, Lei Ke, Ruihan Yang, Qi Gao, Tianyuan Qu, Rossell Chen, Dong Yu, et al. Penguin-vl: Exploring the...

work page internal anchor Pith review Pith/arXiv arXiv
[42]

ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning

Yiming Zhang et al. ReVSI: Rebuilding visual spatial intelligence evaluation for accurate assessment of VLM 3D reasoning.arXiv:2604.24300, 2026d. Zheyu Zhang et al. One token per highly selective frame: Towards extreme compression for long video understanding. arXiv:2604.14149, 2026e. Zijia Zhao, Yuqi Huo, Tongtian Yue, Longteng Guo, Haoyu Lu, Bingning Wa...

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Roborefer: Towards spatial referring with reasoning in vision-language models for robotics

Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Zheng, Tiejun Huang, Lu Sheng, and Shanghang Zhang. RoboRefer: Towards spatial referring with reasoning in vision-language models for robotics.arXiv:2506.04308,

work page arXiv
[44]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Yue Cao, Yangzhou Liu, Weiye Xu, Hao Li, Jiahao Wang, Han Lv, Dengnian Chen, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Yingtong Xiong, Wenw...

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Apollo: An exploration of video understanding in large multimodal models

Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, et al. Apollo: An exploration of video understanding in large multimodal models.arXiv:2412.10360,

work page arXiv

[1] [1]

SPARROW: Learning spatial precision and temporal referential consistency in pixel-grounded video MLLMs.arXiv:2603.12382,

Mohamad Alansari et al. SPARROW: Learning spatial precision and temporal referential consistency in pixel-grounded video MLLMs.arXiv:2603.12382,

work page arXiv

[2] [2]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Didi Zhu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training. arXiv:2509.23661,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Reasoning over Video: Evaluating How MLLMs Extract, Integrate, and Reconstruct Spatiotemporal Evidence

Seunghwan Bang and Hwanjun Song. Reasoning over video: Evaluating how MLLMs extract, integrate, and reconstruct spatiotemporal evidence.arXiv:2603.13091,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Token Merging: Your ViT But Faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv:2210.09461,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Spatialbot: Precise spatial understanding with vision language models

Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. SpatialBot: Precise spatial understanding with vision language models.arXiv:2406.13642,

work page arXiv

[7] [7]

Think with grounding: Curriculum reinforced reasoning with video grounding for long video understanding.arXiv:2602.18702, 2026a

Houlun Chen et al. Think with grounding: Curriculum reinforced reasoning with video grounding for long video understanding.arXiv:2602.18702, 2026a. Jieneng Chen et al. Thinking with spatial code for physical-world video reasoning.arXiv:2603.05591, 2026b. Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Li...

work page arXiv

[8] [8]

GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking

26 Zixu Cheng et al. GraphThinker: Reinforcing temporally grounded video reasoning with event graph thinking. arXiv:2602.17555,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, et al. Molmo2: Open weights and data for vision-language models with video understanding and grounding.arXiv:2601.10611,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvon...

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Small Vision-Language Models are Smart Compressors for Long Video Understanding

Junjie Fei et al. Small vision-language models are smart compressors for long video understanding.arXiv:2604.08120,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Benyou Wang, and Xiangyu Yue. Video-R1: Reinforcing video reasoning in MLLMs.arXiv:2503.21776,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

Chaoyou Fu, Haozhi Yuan, et al. Video-MME-v2: Towards the next stage in benchmarks for comprehensive video understanding.arXiv:2604.05015,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Video streaming thinking: VideoLLMs can watch and think simultaneously.arXiv:2603.12262,

Yiran Guan et al. Video streaming thinking: VideoLLMs can watch and think simultaneously.arXiv:2603.12262,

work page arXiv

[15] [15]

Trace: Temporal grounding video llm via causal event modeling,

Yongxin Guo, Jingyu Liu, Mingda Li, Xiaoying Tang, Xu Chen, and Bo Zhao. TRACE: Temporal grounding video LLM via causal event modeling.arXiv:2410.05643, 2024a. Yongxin Guo, Jingyu Liu, Mingda Li, Xiaoying Tang, Xi Liu, and Xu Chen. VTG-LLM: Integrating timestamp knowledge into video LLMs for enhanced video temporal grounding.arXiv:2405.13382, 2024b. Xiaoc...

work page arXiv

[16] [16]

VTimeLLM: Empower LLM to grasp video moments

Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. VTimeLLM: Empower LLM to grasp video moments. InCVPR, 2024a. De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, and Jan Kautz. LITA: Language instructed temporal-localization assistant. InECCV, 2024b. Haoyi Jiang, Liu Liu, Xinjie Wang, Yonghao He, Wei Sui,...

work page arXiv

[17] [17]

STORM: Token-efficient long video understanding for multimodal LLMs.arXiv:2503.04130,

Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Ding, Yunsheng Yang, Yu Zhu, Yu Bao, Hongxu Yin, Yao Lu, Song Han, et al. STORM: Token-efficient long video understanding for multimodal LLMs.arXiv:2503.04130,

work page arXiv

[18] [18]

AgentRVOS: Reasoning over object tracks for zero-shot referring video object segmentation

Woojeong Jin et al. AgentRVOS: Reasoning over object tracks for zero-shot referring video object segmentation. arXiv:2603.23489, 2026a. Xin Jin et al. Compression tells intelligence: Visual coding, visual token technology, and the unification. arXiv:2601.20742, 2026b. 27 Yang Jin, Zhicheng Sun, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yul...

work page arXiv

[19] [19]

LLaVA-OneVision: Easy visual task transfer.TMLR, 2025a

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaVA-OneVision: Easy visual task transfer.TMLR, 2025a. Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Guoyin Wang, Bei Chen, and Junnan Li. Aria: An open multimodal native mixture-of-experts model.arXiv:2...

work page arXiv

[20] [20]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv:2305.06355,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. VideoChat-Flash: Hierarchical compression for long-context video modeling.arXiv:2501.00574, 2025b. Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. VideoChat-R1: Enha...

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Video-XL-Pro: Reconstructive token compression for extremely long video understanding.arXiv:2503.18478,

Xiangrui Liu, Yan Shu, Zheng Liu, Ao Li, Yang Tian, and Bo Zhao. Video-XL-Pro: Reconstructive token compression for extremely long video understanding.arXiv:2503.18478,

work page arXiv

[23] [23]

Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024c

Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Oryx MLLM: On-demand spatial- temporal understanding at arbitrary resolution.arXiv:2409.12961, 2024a. Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cai, Yuxian Han, Xiuyu Xu, et al. NVILA: Efficient frontier visual language models.a...

work page arXiv

[24] [24]

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

Kun Ouyang, Yuanxin Liu, Haoning Bai, Yuxin Hu, Lu Hou, Mingxiao Zhou, and Maosong Sun. SpaceR: Reinforcing MLLMs in video spatial reasoning.arXiv:2504.01805,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Videomolmo: Spatio-temporal grounding meets pointing.arXiv preprint arXiv:2506.05336, 2025

Hanoona Rasheed, Abdelrahman Shaker, Mahmoud Wajahat, Muhammad Maaz, Tianzhu Hu, Hisham Cholakkal, Salman Khan, and Fahad Shahbaz Khan. VideoMolmo: Spatio-temporal grounding meets pointing.arXiv:2506.05336,

work page arXiv

[26] [26]

Available: https://arxiv.org/abs/2312.02051

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. TimeChat: A time-sensitive multimodal large language model for long video understanding.arXiv:2312.02051,

work page arXiv

[27] [27]

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

28 Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, et al. Longvu: Spatiotemporal adaptive compression for long video-language understanding.arXiv:2410.17434,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

A simple baseline for streaming video understanding.arXiv:2604.02317,

Yujiao Shen et al. A simple baseline for streaming video understanding.arXiv:2604.02317,

work page arXiv

[29] [29]

Video-xl: Extra-long vision language model for hour-scale video understanding

Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26160–26169, 2025a. Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Lia...

work page arXiv

[30] [30]

EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling

Jiafei Song et al. EvoComp: Learning visual token compression for multimodal large language models via semantic- guided evolutionary labeling.arXiv:2604.17087,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

OneVision-Encoder: Codec-aligned sparsity as a foundational principle for multimodal intelligence.arXiv:2602.08683,

Feilong Tang, Xiang An, Yunyao Yan, Yin Xie, Bin Qin, Kaicheng Yang, Yifei Shen, Yuanhan Zhang, Chunyuan Li, Shikun Feng, Changrui Chen, Huajie Tan, Ming Hu, Manyuan Zhang, Bo Li, Ziyong Feng, Ziwei Liu, Zongyuan Ge, and Jiankang Deng. OneVision-Encoder: Codec-aligned sparsity as a foundational principle for multimodal intelligence.arXiv:2602.08683,

work page arXiv

[32] [32]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arX...

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models.arXiv preprint arXiv:2410.03290, 2024

Haibo Wang, Zhiyang Xu, Yu Cheng, Shizhe Diao, Yufan Zhou, Yixin Cao, Qifan Wang, Weifeng Ge, and Lifu Huang. Grounded-VideoLLM: Sharpening fine-grained temporal grounding in video large language models.arXiv:2410.03290, 2024a. Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, K...

work page arXiv

[34] [34]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Tai Wang, Xiaohan Mao, Chenming Zhu, Runsen Xu, Ruiyuan Lyu, Peisen Li, Xiao Chen, Wenwei Zhang, Kai Chen, Tianfan Xue, Xihui Liu, Cewu Lu, Dahua Lin, and Jiangmiao Pang. EmbodiedScan: A holistic multi-modal 3D perception suite towards embodied AI. InCVPR, 2024c. Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Lingl...

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

Slow-fast architecture for video multi-modal large language models.arXiv:2504.01328, 2025a

Mingze Xu, Mingfei Gao, Shiyu Zhou, Jiasen Xiao, Yinglu Niu, Joseph Garcia, Leonid Sigal, Yu Zhang, Bo Pang, Soufiane Belharbi, et al. Slow-fast architecture for video multi-modal large language models.arXiv:2504.01328, 2025a. Weili Xu, Enxin Song, Wenhao Chai, Xuexiang Wen, Tian Ye, and Gaoang Wang. Auroralong: Bringing rnns back to efficient open-ended ...

work page arXiv

[36] [36]

S-GRPO: Unified Post-Training for Large Vision-Language Models

Yuming Yan, Kai Tang, Sihong Chen, Ke Xu, Dan Hu, Qun Yu, and Pengfei Hu. S-GRPO: Unified post-training for large vision-language models.arXiv:2604.16557,

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

arXiv preprint arXiv:2509.01563 (2025)

Biao Yang, Bin Wen, Boyang Ding, et al. Kwai keye-vl 1.5 technical report.arXiv:2509.01563, 2025a. Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv:2412.14171,

work page arXiv

[38] [38]

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, et al. MMSI-Bench: A benchmark for multi-image spatial intelligence.arXiv:2505.23764, 2025b. Shusheng Yang, Jihan Yang, Ellis Brown, Shengbang Tong, Boyang Liang, Xichen Pan, Ziteng Wang, Adithya Iyer, Sai Charitha Akula, Penghao Wu, Ro...

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, and Ming-Hsuan Yang. Sa2VA: Marrying SAM2 with LLaVA for dense grounded understanding of images and videos. arXiv:2501.04001, 2025a. Jiangye Yuan et al. Boosting MLLM spatial reasoning with geometrically referenced 3D scene representations. arXiv:2603.08592,

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

arXiv preprint arXiv:2501.07888 (2025)

Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, and Yuan Lin. Tarsier2: Advancing large vision-language models from detailed video description to comprehensive video understanding.arXiv:2501.07888, 2025b. Bowen Zeng et al. HybridKV: Hybrid KV cache compression for efficient multimodal large language model inference. arXiv:2604.05887,

work page arXiv

[41] [41]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. VideoLLaMA 3: Frontier multimodal foundation models for image and video understanding.arXiv:2501.13106, 2025a. Boqiang Zhang, Lei Ke, Ruihan Yang, Qi Gao, Tianyuan Qu, Rossell Chen, Dong Yu, et al. Penguin-vl: Exploring the...

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning

Yiming Zhang et al. ReVSI: Rebuilding visual spatial intelligence evaluation for accurate assessment of VLM 3D reasoning.arXiv:2604.24300, 2026d. Zheyu Zhang et al. One token per highly selective frame: Towards extreme compression for long video understanding. arXiv:2604.14149, 2026e. Zijia Zhao, Yuqi Huo, Tongtian Yue, Longteng Guo, Haoyu Lu, Bingning Wa...

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

Roborefer: Towards spatial referring with reasoning in vision-language models for robotics

Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Zheng, Tiejun Huang, Lu Sheng, and Shanghang Zhang. RoboRefer: Towards spatial referring with reasoning in vision-language models for robotics.arXiv:2506.04308,

work page arXiv

[44] [44]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Yue Cao, Yangzhou Liu, Weiye Xu, Hao Li, Jiahao Wang, Han Lv, Dengnian Chen, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Yingtong Xiong, Wenw...

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

Apollo: An exploration of video understanding in large multimodal models

Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, et al. Apollo: An exploration of video understanding in large multimodal models.arXiv:2412.10360,

work page arXiv