MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention

Botian Jiang; Chenghao Liu; Chenghao Wang; Chenkun Tan; Hongkai Wang; Huazheng Zeng; Jijun Cheng; Pengfei Wang; Pengyu Wang; Qirui Zhou

arxiv: 2606.07639 · v1 · pith:FZIN6D5Inew · submitted 2026-06-01 · 💻 cs.CV · cs.AI

MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention

Pengyu Wang , Chenkun Tan , Shaojun Zhou , Wei Huang , Qirui Zhou , Zhan Huang , Zhen Ye , Jijun Cheng

show 13 more authors

Xiaomeng Qian Yanxin Chen Xingyang He Huazheng Zeng Chenghao Wang Pengfei Wang Hongkai Wang Shanqing Gao Yixian Tian Chenghao Liu Xinghao Wang Botian Jiang Xipeng Qiu

This is my paper

Pith reviewed 2026-06-28 15:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords real-time video understandingcross-attentionvision-language fusionmultimodal modelsdata synthesis pipelineanswer revisionnon-blocking perception

0 comments

The pith

A cross-attention backbone lets visual features enter through a side channel so perception and generation run on separate non-blocking pathways.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that real-time video understanding requires perception to proceed without waiting for text generation to finish. Its proposed solution routes visual features into a cross-attention side channel instead of the main autoregressive token sequence. This separation lowers how often visual tokens must be processed and creates an explicit interface for compressing the vision stream on its own. The authors also introduce a data pipeline that rewrites dense captions into question-answer pairs whose answers update only when new frames arrive, then fine-tune an existing model on those pairs to produce continuous perception, answer revision, and timely silence. On one GPU the resulting model delivers a 5x reduction in time-to-first-token and 2.7x higher decoding throughput with little loss on offline benchmarks.

Core claim

Perception must not be blocked by generation; its natural realization is a two-channel architecture in which visual features enter through a side channel rather than joining the autoregressive sequence, so perception and generation run on separate, non-blocking pathways, reducing the frequency of visual processing and exposing a clean channel-wise interface for independent compression.

What carries the argument

Cross-attention backbone that routes visual features through a side channel separate from the autoregressive text sequence.

If this is right

Visual processing frequency drops because frames no longer enter the main sequence.
A clean channel-wise interface appears that supports independent compression of the vision stream.
The model acquires behaviors absent from offline models: continuous perception, answer revision on new evidence, and timely silence.
Time to first token drops by roughly 5x and decoding throughput rises by 2.7x on a single H200 with 256-frame inputs.
Offline video and multimodal understanding remain competitive with strong decoder-only baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same side-channel design could be applied to audio or other streaming modalities without retraining the language backbone.
Independent compression opens the possibility of running the vision encoder at a lower rate or on a separate device.
Real-time revision behavior may transfer to live camera feeds or interactive agents once the data pipeline is adapted.

Load-bearing premise

Converting dense captions into real-time QA pairs whose answers are revised to match only what the model has perceived so far will produce genuine real-time behavior when an offline model is specialized on them.

What would settle it

A controlled run in which the fine-tuned model either processes every new frame at the same rate as token generation or fails to revise an answer once contradictory frames arrive.

read the original abstract

Video understanding is shifting from the offline paradigm -- taking a fully recorded video as input and producing a single answer after it ends -- toward real-time interaction, in which the model perceives new frames while still replying, revises its answer as new evidence appears, and remains silent when there is nothing to say. We present MOSS-Video-Preview to validate this paradigm. Our central claim is that perception must not be blocked by generation; its natural realization is a two-channel architecture. We argue that a cross-attention backbone is better suited to real-time vision-language fusion than the prevailing decoder-only design: visual features enter through a side channel rather than joining the autoregressive sequence, so perception and generation run on separate, non-blocking pathways -- reducing the frequency of visual processing and exposing a clean channel-wise interface for independent compression. We complement this with a data synthesis pipeline that converts dense captions into real-time understanding QA whose answers are revised to match what the model has perceived so far, and we specialize an offline model on these data to elicit real-time behavior. Our model trails the strong Qwen2.5-VL-7B baseline overall -- a gap we attribute primarily to data and scale rather than the architecture -- yet attains competitive offline video and multimodal understanding, remains robust on the spatial and fine-grained temporal reasoning central to real-time use, and acquires behaviors that offline models lack: continuous perception, answer revision, and timely silence. On a single H200 with 256 frames per video, it achieves about a 5x speedup in time to first token and 2.7x higher decoding throughput, with negligible degradation in offline ability. Our study of paradigm, architecture, and data outlines a viable path toward real-time video understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's two-channel cross-attention design for non-blocking real-time video fusion is the actual new piece, but the data synthesis pipeline that is supposed to produce revision and silence behaviors rests on thin evidence.

read the letter

The punchline is that this work tries to move video-language models from offline single-answer mode to something that can keep perceiving while generating, revise answers on new frames, and stay quiet when nothing new arrives. They do this with a cross-attention backbone where visual tokens come in a separate side channel rather than being appended to the text sequence. That separation is the concrete architectural move.

What stands out is the explicit argument for why decoder-only models are a bad fit here: perception gets blocked by generation, and the model has to re-process visuals at every step. The side-channel approach reduces visual processing frequency and gives a clean interface for compression. They also describe a synthesis method that turns dense captions into QA pairs whose answers are rewritten to reflect only the frames seen so far, then fine-tune an offline model on that data. On a single H200 they get roughly 5x lower time-to-first-token and 2.7x higher throughput at 256 frames while keeping offline video and spatial reasoning scores competitive with Qwen2.5-VL-7B.

The speed numbers and the acquisition of revision/silence behaviors are the parts that feel like measurable progress. The architecture description is straightforward and the motivation about non-blocking pathways is easy to follow.

The soft spot is the data step. Because the source captions contain the entire video, the revised QA pairs may simply teach pattern matching on complete information rather than forcing the model to handle genuine streaming uncertainty. The paper itself notes that the model still trails the baseline and attributes the gap to data and scale, but there are no ablations that isolate whether the synthesis method actually produces incremental evidence handling or just surface-level revision. Without error analysis on when the model revises correctly versus incorrectly, or comparisons against simpler fine-tuning on real streaming data, the behavioral claims rest more on the pipeline than on the architecture.

This is for people already working on real-time multimodal systems who want an alternative to decoder-only scaling. It is worth sending to peer review because the paradigm question is live and the speed results are concrete enough to test further, even if the current evidence for the central behavioral advantage is preliminary.

Referee Report

3 major / 2 minor

Summary. The paper presents MOSS-Video-Preview, a two-channel cross-attention architecture for real-time video understanding. It argues that visual features should enter via a side channel rather than the autoregressive sequence, enabling non-blocking perception and generation pathways that reduce visual processing frequency and allow independent compression. A data synthesis pipeline converts dense captions into real-time QA (with answers revised to match perceived frames so far) to specialize an offline model and elicit behaviors such as continuous perception, answer revision, and timely silence. The model trails Qwen2.5-VL-7B overall (gap attributed to data/scale) but achieves competitive offline performance, remains robust on spatial and fine-grained temporal reasoning, and delivers ~5x TTFT speedup and 2.7x decoding throughput on a single H200 with 256 frames per video.

Significance. If the architecture and synthesis approach hold, the work outlines a concrete path to efficient real-time vision-language models by separating perception and generation channels, with reported speedups and modularity benefits for compression. The explicit two-channel design and the attempt to induce streaming behaviors from offline pretraining are notable contributions, though the absence of detailed quantitative results, error analysis, or ablations limits the strength of the evidence for the central paradigm claim.

major comments (3)

[Abstract and §3] Abstract and §3 (data synthesis pipeline): the behavioral claims (continuous perception, answer revision, timely silence) rest on the synthesis step that revises answers to match perceived frames so far, yet no ablation or comparison to standard fine-tuning is reported to show that this elicits genuine incremental evidence handling under streaming uncertainty rather than pattern-matching from dense captions that contain future information.
[Abstract] Abstract: the claim that the cross-attention design is 'better suited to real-time vision-language fusion' is undercut by the model trailing the Qwen2.5-VL-7B baseline overall, with the performance gap attributed to data and scale without isolating the architecture's contribution via controlled experiments.
[Abstract] Abstract: while 5x TTFT and 2.7x throughput gains are reported, no detailed quantitative results, error analysis, or per-task breakdowns are supplied to support that the model 'remains robust on the spatial and fine-grained temporal reasoning central to real-time use.'

minor comments (2)

[Methods] Notation for the two-channel cross-attention (visual side channel vs. language autoregressive path) should be formalized with equations in the methods section for reproducibility.
[Experiments] The manuscript would benefit from a table comparing real-time behaviors (revision frequency, silence rate) against decoder-only baselines on the same synthetic QA.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below with clarifications on our contributions and indicate where revisions to the manuscript are planned.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (data synthesis pipeline): the behavioral claims (continuous perception, answer revision, timely silence) rest on the synthesis step that revises answers to match perceived frames so far, yet no ablation or comparison to standard fine-tuning is reported to show that this elicits genuine incremental evidence handling under streaming uncertainty rather than pattern-matching from dense captions that contain future information.

Authors: The synthesis pipeline explicitly constructs QA pairs by revising answers to align only with frames perceived up to the current timestep, a step that standard fine-tuning on full dense captions does not perform. This design targets incremental reasoning under partial information. We agree that an explicit ablation against unmodified fine-tuning would provide additional support and will add a discussion of this distinction in the revised manuscript, along with any feasible comparative results. revision: partial
Referee: [Abstract] Abstract: the claim that the cross-attention design is 'better suited to real-time vision-language fusion' is undercut by the model trailing the Qwen2.5-VL-7B baseline overall, with the performance gap attributed to data and scale without isolating the architecture's contribution via controlled experiments.

Authors: The suitability claim rests on the architectural separation of perception and generation channels, which directly enables the non-blocking pathways and the measured efficiency gains (5x TTFT, 2.7x throughput). The overall accuracy comparison is to a larger-scale model trained under different data regimes; we attribute the gap primarily to those factors rather than the backbone. A controlled same-data, same-scale isolation experiment is computationally prohibitive at this stage and is noted as future work, but the paper supplies the design rationale and concrete efficiency evidence. revision: no
Referee: [Abstract] Abstract: while 5x TTFT and 2.7x throughput gains are reported, no detailed quantitative results, error analysis, or per-task breakdowns are supplied to support that the model 'remains robust on the spatial and fine-grained temporal reasoning central to real-time use.'

Authors: The abstract condenses the offline evaluation results reported in the main body, which include competitive performance on spatial and temporal tasks. We concur that expanded per-task breakdowns and error analysis would strengthen the robustness statement and will incorporate additional quantitative details and breakdowns from our existing experiments in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: architecture motivated by independent design arguments; data synthesis is a separate training method

full rationale

The paper's central derivation is a design argument that cross-attention enables non-blocking perception-generation pathways via a side channel, stated directly in the abstract and introduction without reference to fitted quantities or self-referential equations. The data synthesis pipeline that converts dense captions into revised QA is presented as an empirical complement to elicit behaviors, not as a prediction that reduces to its own inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text. Performance claims are benchmarked externally against Qwen2.5-VL-7B and attributed to data/scale differences, keeping the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract, the central claims rest on the effectiveness of separating perception and generation via cross-attention and on the data synthesis method producing genuine real-time capabilities; no free parameters, additional axioms, or invented entities beyond the architecture itself are detailed.

axioms (1)

domain assumption Cross-attention can effectively fuse vision and language in a non-blocking manner for real-time tasks.
Invoked as the basis for preferring cross-attention over decoder-only designs.

invented entities (1)

Two-channel cross-attention architecture for real-time video no independent evidence
purpose: To enable separate non-blocking pathways for perception and generation.
Newly proposed as the core architectural solution.

pith-pipeline@v0.9.1-grok · 5927 in / 1387 out tokens · 38532 ms · 2026-06-28T15:20:33.274335+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 32 linked inside Pith

[1]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2304.08485,https://arxiv.org/abs/2304.08485

Pith/arXiv arXiv 2023
[2]

Video Instruction Tuning With Synthetic Data.Transactions on Machine Learning Research (TMLR), 2024

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video Instruction Tuning With Synthetic Data.Transactions on Machine Learning Research (TMLR), 2024. arXiv:2410.02713,https://arxiv.org/ abs/2410.02713

Pith/arXiv arXiv 2024
[3]

LongVideoBench: A benchmark for long-context interleaved video-languageunderstanding

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. LongVideoBench: A benchmark for long-context interleaved video-languageunderstanding. InAdvancesinNeuralInformationProcessingSystems(NeurIPS),DatasetsandBenchmarks Track, 2024. arXiv:2407.15754,https://arxiv.org/abs/2407.15754

Pith/arXiv arXiv 2024
[4]

VideoLLM-online: Online video large language model for streaming video

Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. VideoLLM-online: Online video large language model for streaming video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. arXiv:2406.11816, https://arxiv.org/abs/2406.11816

arXiv 2024
[5]

VideoLLM knows when to speak: Enhancing time-sensitive video comprehension with video-text duet interaction format

Yueqian Wang, Xiaojun Meng, Yuxuan Wang, Jianxin Liang, Jiansheng Wei, Huishuai Zhang, and Dongyan Zhao. VideoLLM knows when to speak: Enhancing time-sensitive video comprehension with video-text duet interaction format. InFindings of the Association for Computational Linguistics (EMNLP), 2025. arXiv:2411.17991, https://arxiv.org/abs/2411.17991

arXiv 2025
[6]

Dispider: Enabling video LLMs with active real-time interaction via disentangled perception, decision, and reaction

Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispider: Enabling video LLMs with active real-time interaction via disentangled perception, decision, and reaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2501.03218, https://arxiv.org/abs/...

arXiv 2025
[7]

Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025.https://arxiv

Shuai Bai, Qwen Team, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025.https://arxiv. org/abs/2511.21631

Pith/arXiv arXiv 2025
[8]

LLaVA-OneVision-2: Towards next-generation perceptual intelligence.arXiv preprint arXiv:2605.25979, 2026.https://arxiv.org/abs/2605.25979

Xiang An, Yin Xie, Feilong Tang, et al. LLaVA-OneVision-2: Towards next-generation perceptual intelligence.arXiv preprint arXiv:2605.25979, 2026.https://arxiv.org/abs/2605.25979

Pith/arXiv arXiv 2026
[9]

Flamingo: a visual language model for few-shot learning

Jean-BaptisteAlayrac,JeffDonahue,PaulineLuc,AntoineMiech,IainBarr,YanaHasson,KarelLenc,ArthurMensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. arXiv:2204.14198,https://arxiv.org/abs/2204.14198

Pith/arXiv arXiv 2022
[10]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024.https://arxiv.org/abs/2407.21783; Section on multi-modal extensions describes the cross-attention design later released as Llama 3.2-Vision (11B/90B)

Pith/arXiv arXiv 2024
[11]

StreamingBench: Assessing the gap for MLLMs to achieve streaming video understanding.arXiv preprint arXiv:2411.03628, 2024

JunmingLin,ZhengFang,ChiChen,ZihaoWan,FuwenLuo,PengLi,YangLiu,andMaosongSun. StreamingBench: Assessing the gap for MLLMs to achieve streaming video understanding.arXiv preprint arXiv:2411.03628, 2024. https://arxiv.org/abs/2411.03628

arXiv 2024
[12]

OVO-Bench: How far is your video-LLMs from real-world online video understanding? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, et al. OVO-Bench: How far is your video-LLMs from real-world online video understanding? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2501.05510,https: //arxiv.org/abs/2501.05510

arXiv 2025
[13]

Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025.https://arxiv.org/abs/2502.13923

ShuaiBai,KeqinChen,XuejingLiu,JialinWang,WenbinGe,SiboSong,KaiDang,PengWang,ShĳieWang,JunTang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025.https://arxiv.org/abs/2502.13923

Pith/arXiv arXiv 2025
[14]

RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. arXiv:2104.09864,https://arxiv.org/abs/ 2104.09864

Pith/arXiv arXiv 2024
[15]

Multimodal C4: An open, billion-scale corpus of images interleaved with text

Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal C4: An open, billion-scale corpus of images interleaved with text. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023. arXiv:2304.06939,https://arxi...

arXiv 2023
[16]

Rush, Douwe Kiela, et al

Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, et al. OBELICS: An open web-scale filtered dataset of interleaved image-text documents. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023. arXiv:2306...

arXiv 2023
[17]

UnifiedVisual: A framework for constructing unified vision- language datasets.arXiv preprint arXiv:2509.14738, 2025.https://arxiv.org/abs/2509.14738

Pengyu Wang, Shaojun Zhou, Chenkun Tan, et al. UnifiedVisual: A framework for constructing unified vision- language datasets.arXiv preprint arXiv:2509.14738, 2025.https://arxiv.org/abs/2509.14738

arXiv 2025
[18]

ShareGPT4V: Improving large multi-modal models with better captions

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. ShareGPT4V: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision (ECCV), 2024. arXiv:2311.12793,https://arxiv.org/abs/2311.12793

Pith/arXiv arXiv 2024
[19]

ShareGPT4Video: Improving video understanding and generation with better captions

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. ShareGPT4Video: Improving video understanding and generation with better captions. In Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2024. arXiv:2406.04325, https://arxiv.org/abs/2406.04325

arXiv 2024
[20]

DecoupledProxyAlignment: Mitigatinglanguagepriorconflictfor multimodal alignment in MLLM.arXiv preprint arXiv:2509.14735, 2025.https://arxiv.org/abs/2509.14735

ChenkunTan,PengyuWang,ShaojunZhou,etal. DecoupledProxyAlignment: Mitigatinglanguagepriorconflictfor multimodal alignment in MLLM.arXiv preprint arXiv:2509.14735, 2025.https://arxiv.org/abs/2509.14735

arXiv 2025
[21]

ZeRO: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2020. arXiv:1910.02054,https://arxiv.org/abs/1910.02054

Pith/arXiv arXiv 2020
[22]

LLaVA-OneVision- 1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025.https: //arxiv.org/abs/2509.23661

Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zhengxue Cheng, et al. LLaVA-OneVision- 1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025.https: //arxiv.org/abs/2509.23661

Pith/arXiv arXiv 2025
[23]

OCRBench: On the hidden mystery of OCR in large multimodal models.Science China Information Sciences, 2024

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. OCRBench: On the hidden mystery of OCR in large multimodal models.Science China Information Sciences, 2024. arXiv:2305.07895,https://arxiv.org/abs/2305.07895

Pith/arXiv arXiv 2024
[24]

Are we on the right way for evaluating large vision-language models? InAdvances in Neural Information Processing Systems (NeurIPS), 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models? InAdvances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2403.20330,https://arxiv.org/abs/2403.20330

Pith/arXiv arXiv 2024
[25]

MMBench: Isyourmulti-modalmodelanall-aroundplayer? InEuropeanConferenceonComputer Vision (ECCV), 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He,ZiweiLiu,etal. MMBench: Isyourmulti-modalmodelanall-aroundplayer? InEuropeanConferenceonComputer Vision (ECCV), 2024. arXiv:2307.06281,https://arxiv.org/abs/2307.06281

Pith/arXiv arXiv 2024
[26]

MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. arXiv:2311.16502,https://arx...

Pith/arXiv arXiv 2024
[27]

RealWorldQA: A new benchmark for real-world multimodal understanding

xAI. RealWorldQA: A new benchmark for real-world multimodal understanding. Hugging Face dataset, 2024. Released alongside the Grok-1.5 Vision announcement; no accompanying paper.https://huggingface.co/ datasets/xai-org/RealworldQA

2024
[28]

Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al

Fei Wang, Xingyu Fu, James Y. Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al. MuirBench: A comprehensive benchmark for robust multi-image understanding.arXiv preprint arXiv:2406.09411, 2024.https://arxiv.org/abs/2406.09411

Pith/arXiv arXiv 2024
[29]

SEED-Bench: Benchmarkingmultimodal LLMs with generative comprehension.arXiv preprint arXiv:2307.16125, 2023.https://arxiv.org/abs/2307

BohaoLi,RuiWang,GuangzhiWang,YuyingGe,YixiaoGe,andYingShan. SEED-Bench: Benchmarkingmultimodal LLMs with generative comprehension.arXiv preprint arXiv:2307.16125, 2023.https://arxiv.org/abs/2307. 16125

Pith/arXiv arXiv 2023
[30]

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. MME-RealWorld: Could your multimodal LLM challenge high-resolution real-world scenarios that are difficult for humans? InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2408.13257,https://arxiv.or...

Pith/arXiv arXiv 2025
[31]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023. arXiv:2305.10355,https://arxiv.org/abs/2305.10355

Pith/arXiv arXiv 2023
[32]

Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2406.16860, https://arxiv.org/abs/2406.16860; CV-Bench is...

Pith/arXiv arXiv 2024
[33]

V*: Guided visual search as a core mechanism in multimodal LLMs

Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal LLMs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. arXiv:2312.14135,https: //arxiv.org/abs/2312.14135

arXiv 2024
[34]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Min Joon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean Conference on Computer Vision (ECCV), 2016. arXiv:1603.07396,https: //arxiv.org/abs/1603.07396

Pith/arXiv arXiv 2016
[35]

VisuLogic: A benchmark for evaluating visual reasoning in multi-modal large language models.arXiv preprint arXiv:2504.15279, 2025.https://arxiv.org/abs/2504.15279

Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aĳun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, et al. VisuLogic: A benchmark for evaluating visual reasoning in multi-modal large language models.arXiv preprint arXiv:2504.15279, 2025.https://arxiv.org/abs/2504.15279

arXiv 2025
[36]

Vision language models are blind

Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind. InAsian Conference on Computer Vision (ACCV), 2024. arXiv:2407.06581,https://arxiv.org/ abs/2407.06581

arXiv 2024
[37]

ZeroBench: An impossible visual benchmark for contemporary large multimodal models.arXiv preprint arXiv:2502.09696, 2025.https://arxiv.org/abs/2502.09696

Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion- Vlad Bogolin, Jialu Tang, et al. ZeroBench: An impossible visual benchmark for contemporary large multimodal models.arXiv preprint arXiv:2502.09696, 2025.https://arxiv.org/abs/2502.09696

arXiv 2025
[38]

Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
[39]

arXiv:2405.21075,https://arxiv.org/abs/2405.21075

Pith/arXiv arXiv
[40]

EgoSchema: A diagnostic benchmark for very long-form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. EgoSchema: A diagnostic benchmark for very long-form video language understanding. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023. arXiv:2308.09126,https://arxiv.org/abs/2308.09126

arXiv 2023
[41]

MLVU: Benchmarking multi-task long video understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. MLVU: Benchmarking multi-task long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2406.04264, https://arxiv.org/abs/2406.04264

Pith/arXiv arXiv 2025
[42]

LVBench: An extreme long video understanding benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. LVBench: An extreme long video understanding benchmark. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. arXiv:2406.08035,https: //arxiv.org/abs/2406.08035

Pith/arXiv arXiv 2025
[43]

TempCompass: Do video LLMs really understand videos? InFindings of the Association for Computational Linguistics (ACL), 2024

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. TempCompass: Do video LLMs really understand videos? InFindings of the Association for Computational Linguistics (ACL), 2024. arXiv:2403.00476,https://arxiv.org/abs/2403.00476

Pith/arXiv arXiv 2024
[44]

Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2412.14171,https://arxiv.org/abs/2412.14171; introduces the VSI-Bench benchmark

Pith/arXiv arXiv 2025
[45]

Video-Holmes: Can MLLM think like holmes for complex video reasoning?arXiv preprint arXiv:2505.21374, 2025.https://arxiv.org/abs/2505

Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-Holmes: Can MLLM think like holmes for complex video reasoning?arXiv preprint arXiv:2505.21374, 2025.https://arxiv.org/abs/2505. 21374. 24

Pith/arXiv arXiv 2025
[46]

FlashAttention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023.https://arxiv.org/abs/2307.08691

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023.https://arxiv.org/abs/2307.08691

Pith/arXiv arXiv 2023
[47]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017.https://arxiv.org/abs/1707.06347

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017.https://arxiv.org/abs/1707.06347

Pith/arXiv arXiv 2017
[48]

DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025.https://arxiv.org/abs/ 2501.12948

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025.https://arxiv.org/abs/ 2501.12948

Pith/arXiv arXiv 2025
[49]

Megatron- LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019.https://arxiv.org/abs/1909.08053

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019.https://arxiv.org/abs/1909.08053

Pith/arXiv arXiv 1909
[50]

Efficient large-scale language model training on GPU clusters using Megatron-LM

DeepakNarayanan,MohammadShoeybi,JaredCasper,PatrickLeGresley,MostofaPatwary,VĳayAnandKorthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training on GPU clusters using Megatron-LM. InProceedings of the International Conference for High Performance Computing, Networking, Storage a...

arXiv 2021

[1] [1]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2304.08485,https://arxiv.org/abs/2304.08485

Pith/arXiv arXiv 2023

[2] [2]

Video Instruction Tuning With Synthetic Data.Transactions on Machine Learning Research (TMLR), 2024

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video Instruction Tuning With Synthetic Data.Transactions on Machine Learning Research (TMLR), 2024. arXiv:2410.02713,https://arxiv.org/ abs/2410.02713

Pith/arXiv arXiv 2024

[3] [3]

LongVideoBench: A benchmark for long-context interleaved video-languageunderstanding

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. LongVideoBench: A benchmark for long-context interleaved video-languageunderstanding. InAdvancesinNeuralInformationProcessingSystems(NeurIPS),DatasetsandBenchmarks Track, 2024. arXiv:2407.15754,https://arxiv.org/abs/2407.15754

Pith/arXiv arXiv 2024

[4] [4]

VideoLLM-online: Online video large language model for streaming video

Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. VideoLLM-online: Online video large language model for streaming video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. arXiv:2406.11816, https://arxiv.org/abs/2406.11816

arXiv 2024

[5] [5]

VideoLLM knows when to speak: Enhancing time-sensitive video comprehension with video-text duet interaction format

Yueqian Wang, Xiaojun Meng, Yuxuan Wang, Jianxin Liang, Jiansheng Wei, Huishuai Zhang, and Dongyan Zhao. VideoLLM knows when to speak: Enhancing time-sensitive video comprehension with video-text duet interaction format. InFindings of the Association for Computational Linguistics (EMNLP), 2025. arXiv:2411.17991, https://arxiv.org/abs/2411.17991

arXiv 2025

[6] [6]

Dispider: Enabling video LLMs with active real-time interaction via disentangled perception, decision, and reaction

Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispider: Enabling video LLMs with active real-time interaction via disentangled perception, decision, and reaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2501.03218, https://arxiv.org/abs/...

arXiv 2025

[7] [7]

Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025.https://arxiv

Shuai Bai, Qwen Team, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025.https://arxiv. org/abs/2511.21631

Pith/arXiv arXiv 2025

[8] [8]

LLaVA-OneVision-2: Towards next-generation perceptual intelligence.arXiv preprint arXiv:2605.25979, 2026.https://arxiv.org/abs/2605.25979

Xiang An, Yin Xie, Feilong Tang, et al. LLaVA-OneVision-2: Towards next-generation perceptual intelligence.arXiv preprint arXiv:2605.25979, 2026.https://arxiv.org/abs/2605.25979

Pith/arXiv arXiv 2026

[9] [9]

Flamingo: a visual language model for few-shot learning

Jean-BaptisteAlayrac,JeffDonahue,PaulineLuc,AntoineMiech,IainBarr,YanaHasson,KarelLenc,ArthurMensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. arXiv:2204.14198,https://arxiv.org/abs/2204.14198

Pith/arXiv arXiv 2022

[10] [10]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024.https://arxiv.org/abs/2407.21783; Section on multi-modal extensions describes the cross-attention design later released as Llama 3.2-Vision (11B/90B)

Pith/arXiv arXiv 2024

[11] [11]

StreamingBench: Assessing the gap for MLLMs to achieve streaming video understanding.arXiv preprint arXiv:2411.03628, 2024

JunmingLin,ZhengFang,ChiChen,ZihaoWan,FuwenLuo,PengLi,YangLiu,andMaosongSun. StreamingBench: Assessing the gap for MLLMs to achieve streaming video understanding.arXiv preprint arXiv:2411.03628, 2024. https://arxiv.org/abs/2411.03628

arXiv 2024

[12] [12]

OVO-Bench: How far is your video-LLMs from real-world online video understanding? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, et al. OVO-Bench: How far is your video-LLMs from real-world online video understanding? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2501.05510,https: //arxiv.org/abs/2501.05510

arXiv 2025

[13] [13]

Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025.https://arxiv.org/abs/2502.13923

ShuaiBai,KeqinChen,XuejingLiu,JialinWang,WenbinGe,SiboSong,KaiDang,PengWang,ShĳieWang,JunTang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025.https://arxiv.org/abs/2502.13923

Pith/arXiv arXiv 2025

[14] [14]

RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. arXiv:2104.09864,https://arxiv.org/abs/ 2104.09864

Pith/arXiv arXiv 2024

[15] [15]

Multimodal C4: An open, billion-scale corpus of images interleaved with text

Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal C4: An open, billion-scale corpus of images interleaved with text. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023. arXiv:2304.06939,https://arxi...

arXiv 2023

[16] [16]

Rush, Douwe Kiela, et al

Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, et al. OBELICS: An open web-scale filtered dataset of interleaved image-text documents. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023. arXiv:2306...

arXiv 2023

[17] [17]

UnifiedVisual: A framework for constructing unified vision- language datasets.arXiv preprint arXiv:2509.14738, 2025.https://arxiv.org/abs/2509.14738

Pengyu Wang, Shaojun Zhou, Chenkun Tan, et al. UnifiedVisual: A framework for constructing unified vision- language datasets.arXiv preprint arXiv:2509.14738, 2025.https://arxiv.org/abs/2509.14738

arXiv 2025

[18] [18]

ShareGPT4V: Improving large multi-modal models with better captions

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. ShareGPT4V: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision (ECCV), 2024. arXiv:2311.12793,https://arxiv.org/abs/2311.12793

Pith/arXiv arXiv 2024

[19] [19]

ShareGPT4Video: Improving video understanding and generation with better captions

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. ShareGPT4Video: Improving video understanding and generation with better captions. In Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2024. arXiv:2406.04325, https://arxiv.org/abs/2406.04325

arXiv 2024

[20] [20]

DecoupledProxyAlignment: Mitigatinglanguagepriorconflictfor multimodal alignment in MLLM.arXiv preprint arXiv:2509.14735, 2025.https://arxiv.org/abs/2509.14735

ChenkunTan,PengyuWang,ShaojunZhou,etal. DecoupledProxyAlignment: Mitigatinglanguagepriorconflictfor multimodal alignment in MLLM.arXiv preprint arXiv:2509.14735, 2025.https://arxiv.org/abs/2509.14735

arXiv 2025

[21] [21]

ZeRO: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2020. arXiv:1910.02054,https://arxiv.org/abs/1910.02054

Pith/arXiv arXiv 2020

[22] [22]

LLaVA-OneVision- 1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025.https: //arxiv.org/abs/2509.23661

Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zhengxue Cheng, et al. LLaVA-OneVision- 1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025.https: //arxiv.org/abs/2509.23661

Pith/arXiv arXiv 2025

[23] [23]

OCRBench: On the hidden mystery of OCR in large multimodal models.Science China Information Sciences, 2024

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. OCRBench: On the hidden mystery of OCR in large multimodal models.Science China Information Sciences, 2024. arXiv:2305.07895,https://arxiv.org/abs/2305.07895

Pith/arXiv arXiv 2024

[24] [24]

Are we on the right way for evaluating large vision-language models? InAdvances in Neural Information Processing Systems (NeurIPS), 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models? InAdvances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2403.20330,https://arxiv.org/abs/2403.20330

Pith/arXiv arXiv 2024

[25] [25]

MMBench: Isyourmulti-modalmodelanall-aroundplayer? InEuropeanConferenceonComputer Vision (ECCV), 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He,ZiweiLiu,etal. MMBench: Isyourmulti-modalmodelanall-aroundplayer? InEuropeanConferenceonComputer Vision (ECCV), 2024. arXiv:2307.06281,https://arxiv.org/abs/2307.06281

Pith/arXiv arXiv 2024

[26] [26]

MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. arXiv:2311.16502,https://arx...

Pith/arXiv arXiv 2024

[27] [27]

RealWorldQA: A new benchmark for real-world multimodal understanding

xAI. RealWorldQA: A new benchmark for real-world multimodal understanding. Hugging Face dataset, 2024. Released alongside the Grok-1.5 Vision announcement; no accompanying paper.https://huggingface.co/ datasets/xai-org/RealworldQA

2024

[28] [28]

Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al

Fei Wang, Xingyu Fu, James Y. Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al. MuirBench: A comprehensive benchmark for robust multi-image understanding.arXiv preprint arXiv:2406.09411, 2024.https://arxiv.org/abs/2406.09411

Pith/arXiv arXiv 2024

[29] [29]

SEED-Bench: Benchmarkingmultimodal LLMs with generative comprehension.arXiv preprint arXiv:2307.16125, 2023.https://arxiv.org/abs/2307

BohaoLi,RuiWang,GuangzhiWang,YuyingGe,YixiaoGe,andYingShan. SEED-Bench: Benchmarkingmultimodal LLMs with generative comprehension.arXiv preprint arXiv:2307.16125, 2023.https://arxiv.org/abs/2307. 16125

Pith/arXiv arXiv 2023

[30] [30]

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. MME-RealWorld: Could your multimodal LLM challenge high-resolution real-world scenarios that are difficult for humans? InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2408.13257,https://arxiv.or...

Pith/arXiv arXiv 2025

[31] [31]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023. arXiv:2305.10355,https://arxiv.org/abs/2305.10355

Pith/arXiv arXiv 2023

[32] [32]

Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2406.16860, https://arxiv.org/abs/2406.16860; CV-Bench is...

Pith/arXiv arXiv 2024

[33] [33]

V*: Guided visual search as a core mechanism in multimodal LLMs

Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal LLMs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. arXiv:2312.14135,https: //arxiv.org/abs/2312.14135

arXiv 2024

[34] [34]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Min Joon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean Conference on Computer Vision (ECCV), 2016. arXiv:1603.07396,https: //arxiv.org/abs/1603.07396

Pith/arXiv arXiv 2016

[35] [35]

VisuLogic: A benchmark for evaluating visual reasoning in multi-modal large language models.arXiv preprint arXiv:2504.15279, 2025.https://arxiv.org/abs/2504.15279

Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aĳun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, et al. VisuLogic: A benchmark for evaluating visual reasoning in multi-modal large language models.arXiv preprint arXiv:2504.15279, 2025.https://arxiv.org/abs/2504.15279

arXiv 2025

[36] [36]

Vision language models are blind

Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind. InAsian Conference on Computer Vision (ACCV), 2024. arXiv:2407.06581,https://arxiv.org/ abs/2407.06581

arXiv 2024

[37] [37]

ZeroBench: An impossible visual benchmark for contemporary large multimodal models.arXiv preprint arXiv:2502.09696, 2025.https://arxiv.org/abs/2502.09696

Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion- Vlad Bogolin, Jialu Tang, et al. ZeroBench: An impossible visual benchmark for contemporary large multimodal models.arXiv preprint arXiv:2502.09696, 2025.https://arxiv.org/abs/2502.09696

arXiv 2025

[38] [38]

Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),

[39] [39]

arXiv:2405.21075,https://arxiv.org/abs/2405.21075

Pith/arXiv arXiv

[40] [40]

EgoSchema: A diagnostic benchmark for very long-form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. EgoSchema: A diagnostic benchmark for very long-form video language understanding. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023. arXiv:2308.09126,https://arxiv.org/abs/2308.09126

arXiv 2023

[41] [41]

MLVU: Benchmarking multi-task long video understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. MLVU: Benchmarking multi-task long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2406.04264, https://arxiv.org/abs/2406.04264

Pith/arXiv arXiv 2025

[42] [42]

LVBench: An extreme long video understanding benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. LVBench: An extreme long video understanding benchmark. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. arXiv:2406.08035,https: //arxiv.org/abs/2406.08035

Pith/arXiv arXiv 2025

[43] [43]

TempCompass: Do video LLMs really understand videos? InFindings of the Association for Computational Linguistics (ACL), 2024

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. TempCompass: Do video LLMs really understand videos? InFindings of the Association for Computational Linguistics (ACL), 2024. arXiv:2403.00476,https://arxiv.org/abs/2403.00476

Pith/arXiv arXiv 2024

[44] [44]

Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2412.14171,https://arxiv.org/abs/2412.14171; introduces the VSI-Bench benchmark

Pith/arXiv arXiv 2025

[45] [45]

Video-Holmes: Can MLLM think like holmes for complex video reasoning?arXiv preprint arXiv:2505.21374, 2025.https://arxiv.org/abs/2505

Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-Holmes: Can MLLM think like holmes for complex video reasoning?arXiv preprint arXiv:2505.21374, 2025.https://arxiv.org/abs/2505. 21374. 24

Pith/arXiv arXiv 2025

[46] [46]

FlashAttention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023.https://arxiv.org/abs/2307.08691

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023.https://arxiv.org/abs/2307.08691

Pith/arXiv arXiv 2023

[47] [47]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017.https://arxiv.org/abs/1707.06347

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017.https://arxiv.org/abs/1707.06347

Pith/arXiv arXiv 2017

[48] [48]

DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025.https://arxiv.org/abs/ 2501.12948

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025.https://arxiv.org/abs/ 2501.12948

Pith/arXiv arXiv 2025

[49] [49]

Megatron- LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019.https://arxiv.org/abs/1909.08053

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019.https://arxiv.org/abs/1909.08053

Pith/arXiv arXiv 1909

[50] [50]

Efficient large-scale language model training on GPU clusters using Megatron-LM

DeepakNarayanan,MohammadShoeybi,JaredCasper,PatrickLeGresley,MostofaPatwary,VĳayAnandKorthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training on GPU clusters using Megatron-LM. InProceedings of the International Conference for High Performance Computing, Networking, Storage a...

arXiv 2021