pith. sign in

arxiv: 2606.07639 · v1 · pith:FZIN6D5Inew · submitted 2026-06-01 · 💻 cs.CV · cs.AI

MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention

Pith reviewed 2026-06-28 15:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords real-time video understandingcross-attentionvision-language fusionmultimodal modelsdata synthesis pipelineanswer revisionnon-blocking perception
0
0 comments X

The pith

A cross-attention backbone lets visual features enter through a side channel so perception and generation run on separate non-blocking pathways.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that real-time video understanding requires perception to proceed without waiting for text generation to finish. Its proposed solution routes visual features into a cross-attention side channel instead of the main autoregressive token sequence. This separation lowers how often visual tokens must be processed and creates an explicit interface for compressing the vision stream on its own. The authors also introduce a data pipeline that rewrites dense captions into question-answer pairs whose answers update only when new frames arrive, then fine-tune an existing model on those pairs to produce continuous perception, answer revision, and timely silence. On one GPU the resulting model delivers a 5x reduction in time-to-first-token and 2.7x higher decoding throughput with little loss on offline benchmarks.

Core claim

Perception must not be blocked by generation; its natural realization is a two-channel architecture in which visual features enter through a side channel rather than joining the autoregressive sequence, so perception and generation run on separate, non-blocking pathways, reducing the frequency of visual processing and exposing a clean channel-wise interface for independent compression.

What carries the argument

Cross-attention backbone that routes visual features through a side channel separate from the autoregressive text sequence.

If this is right

  • Visual processing frequency drops because frames no longer enter the main sequence.
  • A clean channel-wise interface appears that supports independent compression of the vision stream.
  • The model acquires behaviors absent from offline models: continuous perception, answer revision on new evidence, and timely silence.
  • Time to first token drops by roughly 5x and decoding throughput rises by 2.7x on a single H200 with 256-frame inputs.
  • Offline video and multimodal understanding remain competitive with strong decoder-only baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same side-channel design could be applied to audio or other streaming modalities without retraining the language backbone.
  • Independent compression opens the possibility of running the vision encoder at a lower rate or on a separate device.
  • Real-time revision behavior may transfer to live camera feeds or interactive agents once the data pipeline is adapted.

Load-bearing premise

Converting dense captions into real-time QA pairs whose answers are revised to match only what the model has perceived so far will produce genuine real-time behavior when an offline model is specialized on them.

What would settle it

A controlled run in which the fine-tuned model either processes every new frame at the same rate as token generation or fails to revise an answer once contradictory frames arrive.

read the original abstract

Video understanding is shifting from the offline paradigm -- taking a fully recorded video as input and producing a single answer after it ends -- toward real-time interaction, in which the model perceives new frames while still replying, revises its answer as new evidence appears, and remains silent when there is nothing to say. We present MOSS-Video-Preview to validate this paradigm. Our central claim is that perception must not be blocked by generation; its natural realization is a two-channel architecture. We argue that a cross-attention backbone is better suited to real-time vision-language fusion than the prevailing decoder-only design: visual features enter through a side channel rather than joining the autoregressive sequence, so perception and generation run on separate, non-blocking pathways -- reducing the frequency of visual processing and exposing a clean channel-wise interface for independent compression. We complement this with a data synthesis pipeline that converts dense captions into real-time understanding QA whose answers are revised to match what the model has perceived so far, and we specialize an offline model on these data to elicit real-time behavior. Our model trails the strong Qwen2.5-VL-7B baseline overall -- a gap we attribute primarily to data and scale rather than the architecture -- yet attains competitive offline video and multimodal understanding, remains robust on the spatial and fine-grained temporal reasoning central to real-time use, and acquires behaviors that offline models lack: continuous perception, answer revision, and timely silence. On a single H200 with 256 frames per video, it achieves about a 5x speedup in time to first token and 2.7x higher decoding throughput, with negligible degradation in offline ability. Our study of paradigm, architecture, and data outlines a viable path toward real-time video understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents MOSS-Video-Preview, a two-channel cross-attention architecture for real-time video understanding. It argues that visual features should enter via a side channel rather than the autoregressive sequence, enabling non-blocking perception and generation pathways that reduce visual processing frequency and allow independent compression. A data synthesis pipeline converts dense captions into real-time QA (with answers revised to match perceived frames so far) to specialize an offline model and elicit behaviors such as continuous perception, answer revision, and timely silence. The model trails Qwen2.5-VL-7B overall (gap attributed to data/scale) but achieves competitive offline performance, remains robust on spatial and fine-grained temporal reasoning, and delivers ~5x TTFT speedup and 2.7x decoding throughput on a single H200 with 256 frames per video.

Significance. If the architecture and synthesis approach hold, the work outlines a concrete path to efficient real-time vision-language models by separating perception and generation channels, with reported speedups and modularity benefits for compression. The explicit two-channel design and the attempt to induce streaming behaviors from offline pretraining are notable contributions, though the absence of detailed quantitative results, error analysis, or ablations limits the strength of the evidence for the central paradigm claim.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (data synthesis pipeline): the behavioral claims (continuous perception, answer revision, timely silence) rest on the synthesis step that revises answers to match perceived frames so far, yet no ablation or comparison to standard fine-tuning is reported to show that this elicits genuine incremental evidence handling under streaming uncertainty rather than pattern-matching from dense captions that contain future information.
  2. [Abstract] Abstract: the claim that the cross-attention design is 'better suited to real-time vision-language fusion' is undercut by the model trailing the Qwen2.5-VL-7B baseline overall, with the performance gap attributed to data and scale without isolating the architecture's contribution via controlled experiments.
  3. [Abstract] Abstract: while 5x TTFT and 2.7x throughput gains are reported, no detailed quantitative results, error analysis, or per-task breakdowns are supplied to support that the model 'remains robust on the spatial and fine-grained temporal reasoning central to real-time use.'
minor comments (2)
  1. [Methods] Notation for the two-channel cross-attention (visual side channel vs. language autoregressive path) should be formalized with equations in the methods section for reproducibility.
  2. [Experiments] The manuscript would benefit from a table comparing real-time behaviors (revision frequency, silence rate) against decoder-only baselines on the same synthetic QA.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below with clarifications on our contributions and indicate where revisions to the manuscript are planned.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (data synthesis pipeline): the behavioral claims (continuous perception, answer revision, timely silence) rest on the synthesis step that revises answers to match perceived frames so far, yet no ablation or comparison to standard fine-tuning is reported to show that this elicits genuine incremental evidence handling under streaming uncertainty rather than pattern-matching from dense captions that contain future information.

    Authors: The synthesis pipeline explicitly constructs QA pairs by revising answers to align only with frames perceived up to the current timestep, a step that standard fine-tuning on full dense captions does not perform. This design targets incremental reasoning under partial information. We agree that an explicit ablation against unmodified fine-tuning would provide additional support and will add a discussion of this distinction in the revised manuscript, along with any feasible comparative results. revision: partial

  2. Referee: [Abstract] Abstract: the claim that the cross-attention design is 'better suited to real-time vision-language fusion' is undercut by the model trailing the Qwen2.5-VL-7B baseline overall, with the performance gap attributed to data and scale without isolating the architecture's contribution via controlled experiments.

    Authors: The suitability claim rests on the architectural separation of perception and generation channels, which directly enables the non-blocking pathways and the measured efficiency gains (5x TTFT, 2.7x throughput). The overall accuracy comparison is to a larger-scale model trained under different data regimes; we attribute the gap primarily to those factors rather than the backbone. A controlled same-data, same-scale isolation experiment is computationally prohibitive at this stage and is noted as future work, but the paper supplies the design rationale and concrete efficiency evidence. revision: no

  3. Referee: [Abstract] Abstract: while 5x TTFT and 2.7x throughput gains are reported, no detailed quantitative results, error analysis, or per-task breakdowns are supplied to support that the model 'remains robust on the spatial and fine-grained temporal reasoning central to real-time use.'

    Authors: The abstract condenses the offline evaluation results reported in the main body, which include competitive performance on spatial and temporal tasks. We concur that expanded per-task breakdowns and error analysis would strengthen the robustness statement and will incorporate additional quantitative details and breakdowns from our existing experiments in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: architecture motivated by independent design arguments; data synthesis is a separate training method

full rationale

The paper's central derivation is a design argument that cross-attention enables non-blocking perception-generation pathways via a side channel, stated directly in the abstract and introduction without reference to fitted quantities or self-referential equations. The data synthesis pipeline that converts dense captions into revised QA is presented as an empirical complement to elicit behaviors, not as a prediction that reduces to its own inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text. Performance claims are benchmarked externally against Qwen2.5-VL-7B and attributed to data/scale differences, keeping the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract, the central claims rest on the effectiveness of separating perception and generation via cross-attention and on the data synthesis method producing genuine real-time capabilities; no free parameters, additional axioms, or invented entities beyond the architecture itself are detailed.

axioms (1)
  • domain assumption Cross-attention can effectively fuse vision and language in a non-blocking manner for real-time tasks.
    Invoked as the basis for preferring cross-attention over decoder-only designs.
invented entities (1)
  • Two-channel cross-attention architecture for real-time video no independent evidence
    purpose: To enable separate non-blocking pathways for perception and generation.
    Newly proposed as the core architectural solution.

pith-pipeline@v0.9.1-grok · 5927 in / 1387 out tokens · 38532 ms · 2026-06-28T15:20:33.274335+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 32 linked inside Pith

  1. [1]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2304.08485,https://arxiv.org/abs/2304.08485

  2. [2]

    Video Instruction Tuning With Synthetic Data.Transactions on Machine Learning Research (TMLR), 2024

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video Instruction Tuning With Synthetic Data.Transactions on Machine Learning Research (TMLR), 2024. arXiv:2410.02713,https://arxiv.org/ abs/2410.02713

  3. [3]

    LongVideoBench: A benchmark for long-context interleaved video-languageunderstanding

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. LongVideoBench: A benchmark for long-context interleaved video-languageunderstanding. InAdvancesinNeuralInformationProcessingSystems(NeurIPS),DatasetsandBenchmarks Track, 2024. arXiv:2407.15754,https://arxiv.org/abs/2407.15754

  4. [4]

    VideoLLM-online: Online video large language model for streaming video

    Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. VideoLLM-online: Online video large language model for streaming video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. arXiv:2406.11816, https://arxiv.org/abs/2406.11816

  5. [5]

    VideoLLM knows when to speak: Enhancing time-sensitive video comprehension with video-text duet interaction format

    Yueqian Wang, Xiaojun Meng, Yuxuan Wang, Jianxin Liang, Jiansheng Wei, Huishuai Zhang, and Dongyan Zhao. VideoLLM knows when to speak: Enhancing time-sensitive video comprehension with video-text duet interaction format. InFindings of the Association for Computational Linguistics (EMNLP), 2025. arXiv:2411.17991, https://arxiv.org/abs/2411.17991

  6. [6]

    Dispider: Enabling video LLMs with active real-time interaction via disentangled perception, decision, and reaction

    Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispider: Enabling video LLMs with active real-time interaction via disentangled perception, decision, and reaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2501.03218, https://arxiv.org/abs/...

  7. [7]

    Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025.https://arxiv

    Shuai Bai, Qwen Team, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025.https://arxiv. org/abs/2511.21631

  8. [8]

    LLaVA-OneVision-2: Towards next-generation perceptual intelligence.arXiv preprint arXiv:2605.25979, 2026.https://arxiv.org/abs/2605.25979

    Xiang An, Yin Xie, Feilong Tang, et al. LLaVA-OneVision-2: Towards next-generation perceptual intelligence.arXiv preprint arXiv:2605.25979, 2026.https://arxiv.org/abs/2605.25979

  9. [9]

    Flamingo: a visual language model for few-shot learning

    Jean-BaptisteAlayrac,JeffDonahue,PaulineLuc,AntoineMiech,IainBarr,YanaHasson,KarelLenc,ArthurMensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. arXiv:2204.14198,https://arxiv.org/abs/2204.14198

  10. [10]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024.https://arxiv.org/abs/2407.21783; Section on multi-modal extensions describes the cross-attention design later released as Llama 3.2-Vision (11B/90B)

  11. [11]

    StreamingBench: Assessing the gap for MLLMs to achieve streaming video understanding.arXiv preprint arXiv:2411.03628, 2024

    JunmingLin,ZhengFang,ChiChen,ZihaoWan,FuwenLuo,PengLi,YangLiu,andMaosongSun. StreamingBench: Assessing the gap for MLLMs to achieve streaming video understanding.arXiv preprint arXiv:2411.03628, 2024. https://arxiv.org/abs/2411.03628

  12. [12]

    OVO-Bench: How far is your video-LLMs from real-world online video understanding? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

    Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, et al. OVO-Bench: How far is your video-LLMs from real-world online video understanding? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2501.05510,https: //arxiv.org/abs/2501.05510

  13. [13]

    Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025.https://arxiv.org/abs/2502.13923

    ShuaiBai,KeqinChen,XuejingLiu,JialinWang,WenbinGe,SiboSong,KaiDang,PengWang,ShijieWang,JunTang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025.https://arxiv.org/abs/2502.13923

  14. [14]

    RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. arXiv:2104.09864,https://arxiv.org/abs/ 2104.09864

  15. [15]

    Multimodal C4: An open, billion-scale corpus of images interleaved with text

    Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal C4: An open, billion-scale corpus of images interleaved with text. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023. arXiv:2304.06939,https://arxi...

  16. [16]

    Rush, Douwe Kiela, et al

    Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, et al. OBELICS: An open web-scale filtered dataset of interleaved image-text documents. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023. arXiv:2306...

  17. [17]

    UnifiedVisual: A framework for constructing unified vision- language datasets.arXiv preprint arXiv:2509.14738, 2025.https://arxiv.org/abs/2509.14738

    Pengyu Wang, Shaojun Zhou, Chenkun Tan, et al. UnifiedVisual: A framework for constructing unified vision- language datasets.arXiv preprint arXiv:2509.14738, 2025.https://arxiv.org/abs/2509.14738

  18. [18]

    ShareGPT4V: Improving large multi-modal models with better captions

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. ShareGPT4V: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision (ECCV), 2024. arXiv:2311.12793,https://arxiv.org/abs/2311.12793

  19. [19]

    ShareGPT4Video: Improving video understanding and generation with better captions

    Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. ShareGPT4Video: Improving video understanding and generation with better captions. In Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2024. arXiv:2406.04325, https://arxiv.org/abs/2406.04325

  20. [20]

    DecoupledProxyAlignment: Mitigatinglanguagepriorconflictfor multimodal alignment in MLLM.arXiv preprint arXiv:2509.14735, 2025.https://arxiv.org/abs/2509.14735

    ChenkunTan,PengyuWang,ShaojunZhou,etal. DecoupledProxyAlignment: Mitigatinglanguagepriorconflictfor multimodal alignment in MLLM.arXiv preprint arXiv:2509.14735, 2025.https://arxiv.org/abs/2509.14735

  21. [21]

    ZeRO: Memory optimizations toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2020. arXiv:1910.02054,https://arxiv.org/abs/1910.02054

  22. [22]

    LLaVA-OneVision- 1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025.https: //arxiv.org/abs/2509.23661

    Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zhengxue Cheng, et al. LLaVA-OneVision- 1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025.https: //arxiv.org/abs/2509.23661

  23. [23]

    OCRBench: On the hidden mystery of OCR in large multimodal models.Science China Information Sciences, 2024

    Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. OCRBench: On the hidden mystery of OCR in large multimodal models.Science China Information Sciences, 2024. arXiv:2305.07895,https://arxiv.org/abs/2305.07895

  24. [24]

    Are we on the right way for evaluating large vision-language models? InAdvances in Neural Information Processing Systems (NeurIPS), 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models? InAdvances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2403.20330,https://arxiv.org/abs/2403.20330

  25. [25]

    MMBench: Isyourmulti-modalmodelanall-aroundplayer? InEuropeanConferenceonComputer Vision (ECCV), 2024

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He,ZiweiLiu,etal. MMBench: Isyourmulti-modalmodelanall-aroundplayer? InEuropeanConferenceonComputer Vision (ECCV), 2024. arXiv:2307.06281,https://arxiv.org/abs/2307.06281

  26. [26]

    MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. arXiv:2311.16502,https://arx...

  27. [27]

    RealWorldQA: A new benchmark for real-world multimodal understanding

    xAI. RealWorldQA: A new benchmark for real-world multimodal understanding. Hugging Face dataset, 2024. Released alongside the Grok-1.5 Vision announcement; no accompanying paper.https://huggingface.co/ datasets/xai-org/RealworldQA

  28. [28]

    Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al

    Fei Wang, Xingyu Fu, James Y. Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al. MuirBench: A comprehensive benchmark for robust multi-image understanding.arXiv preprint arXiv:2406.09411, 2024.https://arxiv.org/abs/2406.09411

  29. [29]

    SEED-Bench: Benchmarkingmultimodal LLMs with generative comprehension.arXiv preprint arXiv:2307.16125, 2023.https://arxiv.org/abs/2307

    BohaoLi,RuiWang,GuangzhiWang,YuyingGe,YixiaoGe,andYingShan. SEED-Bench: Benchmarkingmultimodal LLMs with generative comprehension.arXiv preprint arXiv:2307.16125, 2023.https://arxiv.org/abs/2307. 16125

  30. [30]

    Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. MME-RealWorld: Could your multimodal LLM challenge high-resolution real-world scenarios that are difficult for humans? InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2408.13257,https://arxiv.or...

  31. [31]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023. arXiv:2305.10355,https://arxiv.org/abs/2305.10355

  32. [32]

    Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs

    Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2406.16860, https://arxiv.org/abs/2406.16860; CV-Bench is...

  33. [33]

    V*: Guided visual search as a core mechanism in multimodal LLMs

    Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal LLMs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. arXiv:2312.14135,https: //arxiv.org/abs/2312.14135

  34. [34]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Min Joon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean Conference on Computer Vision (ECCV), 2016. arXiv:1603.07396,https: //arxiv.org/abs/1603.07396

  35. [35]

    VisuLogic: A benchmark for evaluating visual reasoning in multi-modal large language models.arXiv preprint arXiv:2504.15279, 2025.https://arxiv.org/abs/2504.15279

    Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, et al. VisuLogic: A benchmark for evaluating visual reasoning in multi-modal large language models.arXiv preprint arXiv:2504.15279, 2025.https://arxiv.org/abs/2504.15279

  36. [36]

    Vision language models are blind

    Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind. InAsian Conference on Computer Vision (ACCV), 2024. arXiv:2407.06581,https://arxiv.org/ abs/2407.06581

  37. [37]

    ZeroBench: An impossible visual benchmark for contemporary large multimodal models.arXiv preprint arXiv:2502.09696, 2025.https://arxiv.org/abs/2502.09696

    Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion- Vlad Bogolin, Jialu Tang, et al. ZeroBench: An impossible visual benchmark for contemporary large multimodal models.arXiv preprint arXiv:2502.09696, 2025.https://arxiv.org/abs/2502.09696

  38. [38]

    Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),

  39. [39]

    arXiv:2405.21075,https://arxiv.org/abs/2405.21075

  40. [40]

    EgoSchema: A diagnostic benchmark for very long-form video language understanding

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. EgoSchema: A diagnostic benchmark for very long-form video language understanding. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023. arXiv:2308.09126,https://arxiv.org/abs/2308.09126

  41. [41]

    MLVU: Benchmarking multi-task long video understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. MLVU: Benchmarking multi-task long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2406.04264, https://arxiv.org/abs/2406.04264

  42. [42]

    LVBench: An extreme long video understanding benchmark

    Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. LVBench: An extreme long video understanding benchmark. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. arXiv:2406.08035,https: //arxiv.org/abs/2406.08035

  43. [43]

    TempCompass: Do video LLMs really understand videos? InFindings of the Association for Computational Linguistics (ACL), 2024

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. TempCompass: Do video LLMs really understand videos? InFindings of the Association for Computational Linguistics (ACL), 2024. arXiv:2403.00476,https://arxiv.org/abs/2403.00476

  44. [44]

    Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

    Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2412.14171,https://arxiv.org/abs/2412.14171; introduces the VSI-Bench benchmark

  45. [45]

    Video-Holmes: Can MLLM think like holmes for complex video reasoning?arXiv preprint arXiv:2505.21374, 2025.https://arxiv.org/abs/2505

    Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-Holmes: Can MLLM think like holmes for complex video reasoning?arXiv preprint arXiv:2505.21374, 2025.https://arxiv.org/abs/2505. 21374. 24

  46. [46]

    FlashAttention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023.https://arxiv.org/abs/2307.08691

    Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023.https://arxiv.org/abs/2307.08691

  47. [47]

    Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017.https://arxiv.org/abs/1707.06347

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017.https://arxiv.org/abs/1707.06347

  48. [48]

    DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025.https://arxiv.org/abs/ 2501.12948

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025.https://arxiv.org/abs/ 2501.12948

  49. [49]

    Megatron- LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019.https://arxiv.org/abs/1909.08053

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019.https://arxiv.org/abs/1909.08053

  50. [50]

    Efficient large-scale language model training on GPU clusters using Megatron-LM

    DeepakNarayanan,MohammadShoeybi,JaredCasper,PatrickLeGresley,MostofaPatwary,VijayAnandKorthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training on GPU clusters using Megatron-LM. InProceedings of the International Conference for High Performance Computing, Networking, Storage a...