LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence
Pith reviewed 2026-06-29 22:10 UTC · model grok-4.3
The pith
LLaVA-OneVision-2 uses codec-stream tokenization to allocate limited visual tokens to motion-bearing video content via bit-cost dynamics and motion residuals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Codec-stream tokenization treats compressed video as a continuous bit-cost stream in which bit-cost dynamics set adaptive temporal groups and motion-residual cues select salient spatial evidence into compact visual canvases; combined with a shared 3D RoPE that places codec canvases, sampled frames, and images inside one spatiotemporal coordinate system, the approach yields more stable long-video compression than fixed groups of pictures.
What carries the argument
Codec-stream tokenization: a bit-cost-driven allocation that forms adaptive temporal groups and motion-residual spatial canvases from compressed video to concentrate a fixed token budget on event content.
If this is right
- Under identical visual-token budgets the codec-stream method improves temporal grounding by 9.7 points over frame sampling.
- The 8B model reaches 74.9 mAP on JumpScore, 44.8 points above Qwen3-VL-8B.
- The same model records average gains of 4.3 points on video tasks, 5.3 points on spatial tasks, and 15.6 J&F on tracking tasks.
Where Pith is reading between the lines
- The bit-cost allocation rule could be applied to audio or sensor streams that also admit cheap compression.
- A single 3D coordinate frame for images, video canvases, and point clouds may simplify multi-task training loops that currently require separate positional encodings.
- If the adaptive grouping proves robust, it offers a route to process hour-long videos without increasing total token count.
Load-bearing premise
The large reported gains on JumpScore and other benchmarks arise from the codec-stream tokenization and unified 3D RoPE rather than from differences in training data scale, model size, or post-training choices.
What would settle it
Train or fine-tune a comparable 8B vision-language model on the same 8M video plus 4M spatial data but replace codec-stream inputs with standard uniform frame sampling, then measure whether JumpScore mAP falls by roughly 9–10 points or more.
read the original abstract
We introduce LLaVA-OneVision-2 (LLaVA-OV-2), the most capable vision-language model in the LLaVA-OneVision series to date, achieving superior performance across a broad range of multimodal benchmarks. The model builds on a native OneVision-Encoder and incorporates Windowed Attention for efficient local computation while maintaining native resolution. Its key advance is codec-stream tokenization: it treats compressed video as a continuous bit-cost stream, where bit-cost dynamics determine adaptive temporal groups, and motion-residual cues select salient spatial evidence into compact visual canvases. This allocation concentrates a limited token budget on event-bearing content, enabling more stable long-video token compression than fixed groups of pictures. A shared 3D RoPE further places codec canvases, sampled frames, and images in a unified spatiotemporal coordinate system. Furthermore, we build the LLaVA-OV-2 data and training stack around large-scale open supervision: approximately 8M re-captioned video samples for pretraining, a 4M-sample spatial corpus for fine-tuning. We also introduce JumpScore, a temporal-localization benchmark targeting fine-grained grounding in high-frequency, densely repeated motion, a regime underrepresented by existing video evaluations. A standout capability of LLaVA-OV-2 is its unified perception across video understanding, temporal grounding, spatial grounding, and manipulation-trace reasoning. On JumpScore, LLaVA-OneVision-2-8B reaches 74.9 JumpScore mAP, surpassing Qwen3-VL-8B (30.1) by +44.8 points; under matched visual-token budgets on the same benchmark, codec-stream inputs improve temporal grounding over frame sampling by +9.7 points. Across standard benchmarks, LLaVA-OneVision-2-8B further outperforms Qwen3-VL-8B by +4.3 average points on video tasks, +5.3 on spatial tasks, and +15.6 average J&F on tracking tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LLaVA-OneVision-2 (LLaVA-OV-2), a vision-language model extending the LLaVA-OneVision series. It features a native OneVision-Encoder with Windowed Attention for native-resolution processing, codec-stream tokenization that treats compressed video as a bit-cost stream with adaptive temporal grouping and motion-residual selection into compact canvases, and a shared 3D RoPE for unified spatiotemporal coordinates across codec canvases, frames, and images. The model is trained on ~8M re-captioned video samples for pretraining and a 4M-sample spatial corpus for fine-tuning, introduces the JumpScore benchmark for fine-grained temporal localization in high-frequency repeated motions, and reports large gains including 74.9 mAP on JumpScore (vs. 30.1 for Qwen3-VL-8B), +9.7 points from codec-stream vs. frame sampling under matched token budgets, +4.3 on video tasks, +5.3 on spatial tasks, and +15.6 J&F on tracking.
Significance. If the reported gains can be isolated to the codec-stream tokenization and 3D RoPE rather than data-scale or curation differences, the work would meaningfully advance efficient long-video token allocation and unified spatiotemporal reasoning in VLMs. The JumpScore benchmark targets an underrepresented evaluation regime, and the emphasis on open-supervision data is a positive contribution if the ablations and controls are provided.
major comments (2)
- [Abstract] Abstract: the +9.7 point codec-stream advantage over frame sampling (and the overall +44.8 point JumpScore gap vs. Qwen3-VL-8B) is presented without any ablation that holds training data volume, curation, optimizer schedule, and post-training fixed while swapping only the visual token allocator; the 8 M video pre-training corpus is described only at aggregate scale, so the deltas cannot be attributed to the claimed mechanisms.
- [Abstract] Abstract: no error bars, variance estimates, or statistical tests accompany the headline numbers (74.9 mAP, +9.7, +4.3, +5.3, +15.6); without them the cross-model and cross-input comparisons lack the precision needed to support the central claim of architectural superiority.
minor comments (2)
- [Abstract] Abstract: the phrase 'the most capable vision-language model in the LLaVA-OneVision series to date' is promotional and should be replaced by a concrete ranking or metric summary.
- [Abstract] Abstract: 'J&F' on tracking tasks is undefined; a brief parenthetical or reference is needed for readers outside the immediate subfield.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the need for stronger isolation of contributions and statistical rigor. We respond point-by-point to the major comments below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the +9.7 point codec-stream advantage over frame sampling (and the overall +44.8 point JumpScore gap vs. Qwen3-VL-8B) is presented without any ablation that holds training data volume, curation, optimizer schedule, and post-training fixed while swapping only the visual token allocator; the 8 M video pre-training corpus is described only at aggregate scale, so the deltas cannot be attributed to the claimed mechanisms.
Authors: We agree that the strongest attribution would require a fully controlled ablation holding data volume, curation, optimizer, and post-training fixed while varying only the token allocator. The reported +9.7 point gain is obtained under matched visual-token budgets on JumpScore, which isolates the allocator effect at inference time but does not control the full upstream training pipeline. The 8 M corpus is described at aggregate scale in the current text. In revision we will expand the data section with additional curation details and add a controlled ablation comparing codec-stream versus frame sampling under as many fixed variables as feasible. revision: partial
-
Referee: [Abstract] Abstract: no error bars, variance estimates, or statistical tests accompany the headline numbers (74.9 mAP, +9.7, +4.3, +5.3, +15.6); without them the cross-model and cross-input comparisons lack the precision needed to support the central claim of architectural superiority.
Authors: We concur that error bars and statistical tests would improve the precision of the reported comparisons. The current manuscript does not include them. We will add standard deviations from repeated evaluations and appropriate statistical notes for the headline metrics in the revised version. revision: yes
Circularity Check
No circularity: empirical results rest on external benchmarks and reported ablations
full rationale
The paper reports performance numbers on JumpScore and other benchmarks, attributes gains to codec-stream tokenization and 3D RoPE under matched token budgets, and cites an 8M/4M data corpus. No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims are falsifiable against external models and benchmarks rather than reducing to the paper's own inputs by construction. This is the expected non-finding for an empirical architecture paper.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
Benchmarking Visual State Tracking in Multimodal Video Understanding
VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.
-
MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention
MOSS-Video-Preview introduces a cross-attention architecture and synthesized real-time QA data to enable continuous perception, answer revision, and faster inference in video-language models compared to decoder-only designs.
Reference graph
Works this paper leans on
-
[1]
Mohamad Alansari et al. SPARROW: Learning spatial precision and temporal referential consistency in pixel-grounded video MLLMs.arXiv:2603.12382,
-
[2]
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Didi Zhu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training. arXiv:2509.23661,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Seunghwan Bang and Hwanjun Song. Reasoning over video: Evaluating how MLLMs extract, integrate, and reconstruct spatiotemporal evidence.arXiv:2603.13091,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Token Merging: Your ViT But Faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv:2210.09461,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Spatialbot: Precise spatial understanding with vision language models
Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. SpatialBot: Precise spatial understanding with vision language models.arXiv:2406.13642,
-
[7]
Houlun Chen et al. Think with grounding: Curriculum reinforced reasoning with video grounding for long video understanding.arXiv:2602.18702, 2026a. Jieneng Chen et al. Thinking with spatial code for physical-world video reasoning.arXiv:2603.05591, 2026b. Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Li...
-
[8]
GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking
26 Zixu Cheng et al. GraphThinker: Reinforcing temporally grounded video reasoning with event graph thinking. arXiv:2602.17555,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, et al. Molmo2: Open weights and data for vision-language models with video understanding and grounding.arXiv:2601.10611,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvon...
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Small Vision-Language Models are Smart Compressors for Long Video Understanding
Junjie Fei et al. Small vision-language models are smart compressors for long video understanding.arXiv:2604.08120,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Benyou Wang, and Xiangyu Yue. Video-R1: Reinforcing video reasoning in MLLMs.arXiv:2503.21776,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
Chaoyou Fu, Haozhi Yuan, et al. Video-MME-v2: Towards the next stage in benchmarks for comprehensive video understanding.arXiv:2604.05015,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Video streaming thinking: VideoLLMs can watch and think simultaneously.arXiv:2603.12262,
Yiran Guan et al. Video streaming thinking: VideoLLMs can watch and think simultaneously.arXiv:2603.12262,
-
[15]
Trace: Temporal grounding video llm via causal event modeling,
Yongxin Guo, Jingyu Liu, Mingda Li, Xiaoying Tang, Xu Chen, and Bo Zhao. TRACE: Temporal grounding video LLM via causal event modeling.arXiv:2410.05643, 2024a. Yongxin Guo, Jingyu Liu, Mingda Li, Xiaoying Tang, Xi Liu, and Xu Chen. VTG-LLM: Integrating timestamp knowledge into video LLMs for enhanced video temporal grounding.arXiv:2405.13382, 2024b. Xiaoc...
-
[16]
VTimeLLM: Empower LLM to grasp video moments
Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. VTimeLLM: Empower LLM to grasp video moments. InCVPR, 2024a. De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, and Jan Kautz. LITA: Language instructed temporal-localization assistant. InECCV, 2024b. Haoyi Jiang, Liu Liu, Xinjie Wang, Yonghao He, Wei Sui,...
-
[17]
STORM: Token-efficient long video understanding for multimodal LLMs.arXiv:2503.04130,
Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Ding, Yunsheng Yang, Yu Zhu, Yu Bao, Hongxu Yin, Yao Lu, Song Han, et al. STORM: Token-efficient long video understanding for multimodal LLMs.arXiv:2503.04130,
-
[18]
AgentRVOS: Reasoning over object tracks for zero-shot referring video object segmentation
Woojeong Jin et al. AgentRVOS: Reasoning over object tracks for zero-shot referring video object segmentation. arXiv:2603.23489, 2026a. Xin Jin et al. Compression tells intelligence: Visual coding, visual token technology, and the unification. arXiv:2601.20742, 2026b. 27 Yang Jin, Zhicheng Sun, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yul...
-
[19]
LLaVA-OneVision: Easy visual task transfer.TMLR, 2025a
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaVA-OneVision: Easy visual task transfer.TMLR, 2025a. Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Guoyin Wang, Bei Chen, and Junnan Li. Aria: An open multimodal native mixture-of-experts model.arXiv:2...
-
[20]
VideoChat: Chat-Centric Video Understanding
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv:2305.06355,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. VideoChat-Flash: Hierarchical compression for long-context video modeling.arXiv:2501.00574, 2025b. Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. VideoChat-R1: Enha...
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Xiangrui Liu, Yan Shu, Zheng Liu, Ao Li, Yang Tian, and Bo Zhao. Video-XL-Pro: Reconstructive token compression for extremely long video understanding.arXiv:2503.18478,
-
[23]
Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Oryx MLLM: On-demand spatial- temporal understanding at arbitrary resolution.arXiv:2409.12961, 2024a. Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cai, Yuxian Han, Xiuyu Xu, et al. NVILA: Efficient frontier visual language models.a...
-
[24]
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
Kun Ouyang, Yuanxin Liu, Haoning Bai, Yuxin Hu, Lu Hou, Mingxiao Zhou, and Maosong Sun. SpaceR: Reinforcing MLLMs in video spatial reasoning.arXiv:2504.01805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Videomolmo: Spatio-temporal grounding meets pointing.arXiv preprint arXiv:2506.05336, 2025
Hanoona Rasheed, Abdelrahman Shaker, Mahmoud Wajahat, Muhammad Maaz, Tianzhu Hu, Hisham Cholakkal, Salman Khan, and Fahad Shahbaz Khan. VideoMolmo: Spatio-temporal grounding meets pointing.arXiv:2506.05336,
-
[26]
Available: https://arxiv.org/abs/2312.02051
Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. TimeChat: A time-sensitive multimodal large language model for long video understanding.arXiv:2312.02051,
-
[27]
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
28 Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, et al. Longvu: Spatiotemporal adaptive compression for long video-language understanding.arXiv:2410.17434,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
A simple baseline for streaming video understanding.arXiv:2604.02317,
Yujiao Shen et al. A simple baseline for streaming video understanding.arXiv:2604.02317,
-
[29]
Video-xl: Extra-long vision language model for hour-scale video understanding
Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26160–26169, 2025a. Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Lia...
-
[30]
Jiafei Song et al. EvoComp: Learning visual token compression for multimodal large language models via semantic- guided evolutionary labeling.arXiv:2604.17087,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Feilong Tang, Xiang An, Yunyao Yan, Yin Xie, Bin Qin, Kaicheng Yang, Yifei Shen, Yuanhan Zhang, Chunyuan Li, Shikun Feng, Changrui Chen, Huajie Tan, Ming Hu, Manyuan Zhang, Bo Li, Ziyong Feng, Ziwei Liu, Zongyuan Ge, and Jiankang Deng. OneVision-Encoder: Codec-aligned sparsity as a foundational principle for multimodal intelligence.arXiv:2602.08683,
-
[32]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arX...
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Haibo Wang, Zhiyang Xu, Yu Cheng, Shizhe Diao, Yufan Zhou, Yixin Cao, Qifan Wang, Weifeng Ge, and Lifu Huang. Grounded-VideoLLM: Sharpening fine-grained temporal grounding in video large language models.arXiv:2410.03290, 2024a. Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, K...
-
[34]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Tai Wang, Xiaohan Mao, Chenming Zhu, Runsen Xu, Ruiyuan Lyu, Peisen Li, Xiao Chen, Wenwei Zhang, Kai Chen, Tianfan Xue, Xihui Liu, Cewu Lu, Dahua Lin, and Jiangmiao Pang. EmbodiedScan: A holistic multi-modal 3D perception suite towards embodied AI. InCVPR, 2024c. Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Lingl...
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Slow-fast architecture for video multi-modal large language models.arXiv:2504.01328, 2025a
Mingze Xu, Mingfei Gao, Shiyu Zhou, Jiasen Xiao, Yinglu Niu, Joseph Garcia, Leonid Sigal, Yu Zhang, Bo Pang, Soufiane Belharbi, et al. Slow-fast architecture for video multi-modal large language models.arXiv:2504.01328, 2025a. Weili Xu, Enxin Song, Wenhao Chai, Xuexiang Wen, Tian Ye, and Gaoang Wang. Auroralong: Bringing rnns back to efficient open-ended ...
-
[36]
S-GRPO: Unified Post-Training for Large Vision-Language Models
Yuming Yan, Kai Tang, Sihong Chen, Ke Xu, Dan Hu, Qun Yu, and Pengfei Hu. S-GRPO: Unified post-training for large vision-language models.arXiv:2604.16557,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
arXiv preprint arXiv:2509.01563 (2025)
Biao Yang, Bin Wen, Boyang Ding, et al. Kwai keye-vl 1.5 technical report.arXiv:2509.01563, 2025a. Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv:2412.14171,
-
[38]
MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence
Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, et al. MMSI-Bench: A benchmark for multi-image spatial intelligence.arXiv:2505.23764, 2025b. Shusheng Yang, Jihan Yang, Ellis Brown, Shengbang Tong, Boyang Liang, Xichen Pan, Ziteng Wang, Adithya Iyer, Sai Charitha Akula, Penghao Wu, Ro...
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, and Ming-Hsuan Yang. Sa2VA: Marrying SAM2 with LLaVA for dense grounded understanding of images and videos. arXiv:2501.04001, 2025a. Jiangye Yuan et al. Boosting MLLM spatial reasoning with geometrically referenced 3D scene representations. arXiv:2603.08592,
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
arXiv preprint arXiv:2501.07888 (2025)
Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, and Yuan Lin. Tarsier2: Advancing large vision-language models from detailed video description to comprehensive video understanding.arXiv:2501.07888, 2025b. Bowen Zeng et al. HybridKV: Hybrid KV cache compression for efficient multimodal large language model inference. arXiv:2604.05887,
-
[41]
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. VideoLLaMA 3: Frontier multimodal foundation models for image and video understanding.arXiv:2501.13106, 2025a. Boqiang Zhang, Lei Ke, Ruihan Yang, Qi Gao, Tianyuan Qu, Rossell Chen, Dong Yu, et al. Penguin-vl: Exploring the...
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning
Yiming Zhang et al. ReVSI: Rebuilding visual spatial intelligence evaluation for accurate assessment of VLM 3D reasoning.arXiv:2604.24300, 2026d. Zheyu Zhang et al. One token per highly selective frame: Towards extreme compression for long video understanding. arXiv:2604.14149, 2026e. Zijia Zhao, Yuqi Huo, Tongtian Yue, Longteng Guo, Haoyu Lu, Bingning Wa...
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Roborefer: Towards spatial referring with reasoning in vision-language models for robotics
Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Zheng, Tiejun Huang, Lu Sheng, and Shanghang Zhang. RoboRefer: Towards spatial referring with reasoning in vision-language models for robotics.arXiv:2506.04308,
-
[44]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Yue Cao, Yangzhou Liu, Weiye Xu, Hao Li, Jiahao Wang, Han Lv, Dengnian Chen, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Yingtong Xiong, Wenw...
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
Apollo: An exploration of video understanding in large multimodal models
Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, et al. Apollo: An exploration of video understanding in large multimodal models.arXiv:2412.10360,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.