Recognition: no theorem link
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
Pith reviewed 2026-05-13 06:46 UTC · model grok-4.3
The pith
Video large language models fail to maintain temporal object consistency across frames.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TOC-Bench consists of 2,323 high-quality QA pairs over 1,951 videos, each grounded in object tracks and temporal timelines. After removing over 60% of candidates with a three-layer protocol to ensure temporal necessity, tests on representative Video-LLMs reveal major weaknesses in temporal object consistency.
What carries the argument
The three-layer temporal-necessity filtering protocol applied to object-track grounded questions to retain only those requiring temporally ordered visual evidence.
If this is right
- General video benchmarks miss key limitations in object consistency.
- Targeted improvements are needed in identity-sensitive and ordering tasks.
- TOC-Bench offers a way to diagnose and advance object-aware temporal reasoning.
Where Pith is reading between the lines
- Models might benefit from explicit object state tracking modules integrated into their architecture.
- The approach could be adapted to test consistency in other sequential data like audio or text narratives.
- Longer videos with more reappearances would likely show even lower performance.
Load-bearing premise
The three-layer temporal-necessity filtering protocol successfully eliminates all questions answerable without requiring temporally ordered visual evidence from the video frames.
What would settle it
Finding that top-performing Video-LLMs on general benchmarks also score highly on TOC-Bench without specific training for object tracking would challenge the claim of it being a major unsolved challenge.
Figures
read the original abstract
Video large language models (Video-LLMs) have made strong progress in general video understanding, but their ability to maintain temporal object consistency remains underexplored. Existing benchmarks often emphasize event recognition, action understanding, or coarse temporal reasoning, while rarely testing whether models can preserve the identity, state, and continuity of the same object across occlusion, disappearance, reappearance, state transitions, and cross-object interactions. We introduce TOC-Bench, a diagnostic benchmark for evaluating temporal object consistency in Video-LLMs. TOC-Bench is object-track grounded: each queried subject is linked to a per-frame trajectory and a structured temporal event timeline. To ensure that questions require temporally ordered visual evidence rather than language priors, single-frame shortcuts, or unordered frame cues, we design a three-layer temporal-necessity filtering protocol, which removes 60.7% of candidate QA pairs and retains 17,900 temporally dependent items across 10 diagnostic dimensions. From this pool, we construct a human-verified benchmark with 2,323 high-quality QA pairs over 1,951 videos. Experiments on representative Video-LLMs show that temporal object consistency remains a major unsolved challenge, with notable weaknesses in event counting, event ordering, identity-sensitive reasoning, and hallucination-aware verification, even when models perform well on general video understanding benchmarks. These results suggest that object-centric temporal coherence is a key bottleneck for current Video-LLMs, and that TOC-Bench provides a focused platform for diagnosing and improving object-aware temporal reasoning. The resource is available at https://github.com/cjzcjz666/toc_bench.git.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TOC-Bench, a diagnostic benchmark of 2,323 human-verified QA pairs over 1,951 videos for evaluating temporal object consistency in Video-LLMs. Each item is object-track grounded with per-frame trajectories and event timelines; a three-layer temporal-necessity filtering protocol removes 60.7% of candidates to retain only questions requiring temporally ordered visual evidence of object identity, state, and continuity across 10 diagnostic dimensions. Experiments on representative Video-LLMs demonstrate persistent weaknesses in event counting, event ordering, identity-sensitive reasoning, and hallucination-aware verification, even when models perform well on general video understanding tasks.
Significance. If the filtering protocol is shown to be effective, TOC-Bench would supply a focused, object-centric diagnostic that isolates a previously underexplored bottleneck in Video-LLMs. The track-grounded construction and explicit separation from language priors or single-frame shortcuts would make the benchmark a useful complement to existing video QA resources and a concrete target for model improvement.
major comments (3)
- [Methods (three-layer filtering protocol)] The three-layer temporal-necessity filtering protocol (described in the methods) removes 60.7% of candidates and retains 2,323 pairs, yet the manuscript reports no quantitative validation—such as model accuracy on frame-shuffled inputs, single-frame inputs, or text-only baselines—for the retained set. Without these controls, it remains possible that a non-negligible fraction of items can still be solved via language priors or unordered cues, undermining the attribution of observed failures specifically to temporal object consistency.
- [Benchmark construction and human verification] The human-verification step that produces the final 2,323 QA pairs lacks reported inter-annotator agreement rates, exact annotation guidelines, or disagreement-resolution procedures. These details are required to establish that the retained items genuinely demand temporally ordered visual evidence rather than subjective interpretation.
- [Experiments and results] The experimental results section states that models exhibit notable weaknesses in event counting, ordering, and identity reasoning, but provides no per-dimension accuracy tables, comparison against strong text-only or image-only baselines, or statistical significance tests. These omissions make it difficult to quantify how much the reported gaps exceed general video-LLM limitations.
minor comments (2)
- [Abstract and §3] The abstract and methods would benefit from an explicit enumeration of the ten diagnostic dimensions and one or two concrete QA examples per dimension to illustrate the temporal requirements.
- [Results figures] Figure captions and axis labels in the results figures should include exact numerical values rather than relying solely on bar heights for readability.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript. We address each major comment point by point below. We agree that additional validations, details, and tables will strengthen the paper and will incorporate them in the revised version.
read point-by-point responses
-
Referee: The three-layer temporal-necessity filtering protocol (described in the methods) removes 60.7% of candidates and retains 2,323 pairs, yet the manuscript reports no quantitative validation—such as model accuracy on frame-shuffled inputs, single-frame inputs, or text-only baselines—for the retained set. Without these controls, it remains possible that a non-negligible fraction of items can still be solved via language priors or unordered cues, undermining the attribution of observed failures specifically to temporal object consistency.
Authors: We thank the referee for this observation. The three-layer protocol was explicitly designed to eliminate candidates solvable without temporally ordered visual evidence of object identity and continuity, but we did not report explicit quantitative controls such as frame-shuffled, single-frame, or text-only baselines on the final set. We will add these experiments to the revised methods and results sections, including accuracy tables showing performance drops under these conditions to confirm that retained items require temporal object consistency. revision: yes
-
Referee: The human-verification step that produces the final 2,323 QA pairs lacks reported inter-annotator agreement rates, exact annotation guidelines, or disagreement-resolution procedures. These details are required to establish that the retained items genuinely demand temporally ordered visual evidence rather than subjective interpretation.
Authors: We agree that these details are necessary for rigor and reproducibility. The verification process used multiple annotators with guidelines focused on confirming temporal dependency, object-track grounding, and rejection of language-prior shortcuts, with disagreements resolved via discussion and majority vote. In the revision we will include the full annotation guidelines, inter-annotator agreement rates (e.g., Fleiss' kappa), and the resolution procedure. revision: yes
-
Referee: The experimental results section states that models exhibit notable weaknesses in event counting, ordering, and identity reasoning, but provides no per-dimension accuracy tables, comparison against strong text-only or image-only baselines, or statistical significance tests. These omissions make it difficult to quantify how much the reported gaps exceed general video-LLM limitations.
Authors: We acknowledge that the current results section would benefit from greater granularity. The revised manuscript will add per-dimension accuracy tables for all 10 diagnostic dimensions, direct comparisons against text-only and image-only baselines, and statistical significance tests (e.g., paired t-tests) to quantify how the observed weaknesses in temporal object consistency exceed general video understanding performance. revision: yes
Circularity Check
No circularity: benchmark construction is self-contained with no derivations or self-referential steps
full rationale
The manuscript constructs TOC-Bench via data sourcing, a three-layer filtering protocol, human verification, and empirical testing on external Video-LLMs. No equations, fitted parameters, predictions, or mathematical derivations appear. The filtering step is described as a methodological design choice to retain temporally dependent QA pairs (removing 60.7% of candidates), but the paper does not reduce this claim to a self-definition, a fitted input renamed as prediction, or a load-bearing self-citation chain. Evaluation results are reported against independent models on general benchmarks, providing external comparison. This is a standard benchmark paper with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The three-layer filtering protocol removes questions answerable without temporally ordered visual evidence
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
ByteDance Seed Team. Seed2.0. https://seed.bytedance.com/en/seed2, 2026. Official model page for the Seed2.0 series, including Pro, Lite, and Mini. Accessed: 2026-05-02
work page 2026
-
[4]
Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models,
Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, Yao Dou, Jaden Park, Jianfeng Gao, Yong Jae Lee, and Jianwei Yang. Temporalbench: Towards fine-grained temporal understanding for multimodal video models.arXiv preprint arXiv:2410.10818, 2024
-
[5]
SAM 3: Segment Anything with Concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024
work page 2024
-
[7]
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Daniel Cores, Michael Dorkenwald, Manuel Mucientes, Cees G. M. Snoek, and Yuki M. Asano. Tvbench: Redesigning video-language evaluation, 2024
work page 2024
-
[9]
MOSE: A new dataset for video object segmentation in complex scenes
Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip HS Torr, and Song Bai. MOSE: A new dataset for video object segmentation in complex scenes. InICCV, 2023
work page 2023
-
[10]
Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree
Shuangrui Ding, Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Yuwei Guo, Dahua Lin, and Jiaqi Wang. Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13614–13624, 2025
work page 2025
-
[11]
Mmbench-video: A long-form multi-shot benchmark for holistic video understanding
Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video understanding. Advances in Neural Information Processing Systems, 37:89098–89124, 2024
work page 2024
-
[12]
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025
work page 2025
-
[14]
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, et al. Video-mme-v2: Towards the next stage in benchmarks for comprehensive video understanding.arXiv preprint arXiv:2604.05015, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[15]
Google. Gemini 3.1 Pro Preview. https://ai.google.dev/gemini-api/docs/models/ gemini-3.1-pro-preview , 2026. Google AI for Developers model documentation. Accessed: 2026-05-02. 10
work page 2026
-
[16]
Tgif-qa: Toward spatio-temporal reasoning in visual question answering
Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2758–2766, 2017
work page 2017
-
[17]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023
work page 2023
-
[18]
Benno Krojer, Mojtaba Komeili, Candace Ross, Quentin Garrido, Koustuv Sinha, Nicolas Ballas, and Mahmoud Assran. A shortcut-aware video-qa benchmark for physical understanding via minimal video pairs.arXiv preprint arXiv:2506.09987, 2025
-
[19]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Chaoyu Li, Eun Woo Im, and Pooyan Fazli. Vidhalluc: Evaluating temporal hallucinations in multimodal large language models for video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13723–13733, 2025
work page 2025
-
[21]
Videochat: Chat-centric video understanding.Science China Information Sciences, 68(10):200102, 2025
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.Science China Information Sciences, 68(10):200102, 2025
work page 2025
-
[22]
Mvbench: A comprehensive multi-modal video understanding benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024
work page 2024
-
[23]
Vitatecs: A diagnostic dataset for temporal concept understanding of video-language models
Shicheng Li, Lei Li, Yi Liu, Shuhuai Ren, Yuanxin Liu, Rundong Gao, Xu Sun, and Lu Hou. Vitatecs: A diagnostic dataset for temporal concept understanding of video-language models. InEuropean Conference on Computer Vision, pages 331–348. Springer, 2024
work page 2024
-
[24]
Videochat-flash: Hierarchical compression for long-context video modeling,
Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. Videochat-flash: Hierarchical compression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2024
-
[25]
Video-llava: Learning united visual representation by alignment before projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 5971–5984, 2024
work page 2024
-
[26]
Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024
work page 2024
-
[27]
Video-chatgpt: Towards detailed video understanding via large vision and language models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024
work page 2024
-
[28]
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023
work page 2023
-
[29]
Moonshot AI. Kimi-K2.6 Model Card. https://huggingface.co/moonshotai/Kimi-K2. 6, 2026. Hugging Face model card. Accessed: 2026-05-02
work page 2026
-
[30]
OpenAI. GPT-5.5 Model Documentation. https://developers.openai.com/api/docs/ models/gpt-5.5, 2026. Official API documentation. Accessed: 2026-05-02
work page 2026
-
[31]
Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Recasens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. Perception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36:42748–42761, 2023. 11
work page 2023
-
[32]
Occluded video instance segmentation: A benchmark
Jiyang Qi, Yan Gao, Yao Hu, Xinggang Wang, Xiaoyu Liu, Xiang Bai, Serge Belongie, Alan Yuille, Philip Torr, and Song Bai. Occluded video instance segmentation: A benchmark. International Journal of Computer Vision, 2022
work page 2022
-
[33]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Michael S Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Jongwoo Park, Kanchana Ranasinghe, Silvio Savarese, Ran Xu, et al. xgen-mm-vid (blip-3-video): You only need 32 tokens to represent a video even in vlms.arXiv preprint arXiv:2410.16267, 2024
-
[35]
Sigurdsson, Gül Varol, Xiaolong Wang, Ivan Laptev, Ali Farhadi, and Abhinav Gupta
Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ivan Laptev, Ali Farhadi, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding.ArXiv e-prints, 2016
work page 2016
-
[36]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Lvbench: An extreme long video understanding benchmark
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understanding benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22958– 22967, 2025
work page 2025
-
[38]
Internvideo2: Scaling foundation models for multimodal video understanding
Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for multimodal video understanding. InEuropean conference on computer vision, pages 396–416. Springer, 2024
work page 2024
-
[39]
InternVideo2.5: Empowering video MLLMs with long and rich context modeling.arXiv:2501.12386, 2025
Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, et al. Internvideo2. 5: Empowering video mllms with long and rich context modeling.arXiv preprint arXiv:2501.12386, 2025
-
[40]
Yuxuan Wang, Yueqian Wang, Dongyan Zhao, Cihang Xie, and Zilong Zheng. Videohallucer: Evaluating intrinsic and extrinsic hallucinations in large video-language models.arXiv preprint arXiv:2406.16338, 2024
-
[41]
STAR: A Benchmark for Situ- ated Reasoning in Real-World Videos.arXiv e-prints, art
Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. Star: A benchmark for situated reasoning in real-world videos.arXiv preprint arXiv:2405.09711, 2024
-
[42]
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long- context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37:28828–28857, 2024
work page 2024
- [43]
-
[44]
Next-qa: Next phase of question- answering to explaining temporal actions
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9777–9786, June 2021
work page 2021
-
[45]
Xiaomi. Xiaomi MiMo-V2-Omni. https://mimo.xiaomi.com/mimo-v2-omni, 2026. Of- ficial model page. Accessed: 2026-05-02
work page 2026
-
[46]
Video question answering via gradually refined attention over appearance and motion
Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, pages 1645–1653, 2017
work page 2017
-
[47]
Msr-vtt: A large video description dataset for bridging video and language
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016. 12
work page 2016
-
[48]
Shuhang Xun, Sicheng Tao, Jungang Li, Yibo Shi, Zhixin Lin, Zhanhui Zhu, Yibo Yan, Hanqian Li, Linghao Zhang, Shikang Wang, et al. Rtv-bench: Benchmarking mllm continuous perception, understanding and reasoning through real-time video.arXiv preprint arXiv:2505.02064, 2025
-
[49]
Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception,
Ziang Yan, Xinhao Li, Yinan He, Zhengrong Yue, Xiangyu Zeng, Yali Wang, Yu Qiao, Limin Wang, and Yi Wang. Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception.arXiv preprint arXiv:2509.21100, 2025
-
[50]
Track anything: Segment anything meets videos.arXiv preprint arXiv:2304.11968, 2023
Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing Wang, and Feng Zheng. Track anything: Segment anything meets videos.arXiv preprint arXiv:2304.11968, 2023
-
[51]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Activitynet-qa: A dataset for understanding complex web videos via question answering
Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 9127–9134, 2019
work page 2019
-
[53]
Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, and Yuan Lin. Tarsier2: Advanc- ing large vision-language models from detailed video description to comprehensive video understanding.arXiv preprint arXiv:2501.07888, 2025
-
[54]
Videorefer suite: Advancing spatial-temporal object understanding with video llm
Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, et al. Videorefer suite: Advancing spatial-temporal object understanding with video llm. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18970–18980, 2025
work page 2025
-
[55]
Z.AI. GLM-5V-Turbo. https://docs.z.ai/guides/vlm/glm-5v-turbo, 2026. Official developer documentation. Accessed: 2026-05-02
work page 2026
-
[56]
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Llava-next: A strong zero-shot video understanding model, April 2024
Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava-next: A strong zero-shot video understanding model, April 2024
work page 2024
-
[58]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava- video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 13 Figure 7: The exact system prompt for VLM. A More Details about Event Detection A.1 Noun-Phrase ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.