When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding

Hai "Helen" Li; Lichen Zhu; Yiheng Wang; Yiran Chen; Yudong Liu; Yueqian Lin

arxiv: 2606.08239 · v1 · pith:ROWV2SZCnew · submitted 2026-06-06 · 💻 cs.AI · cs.CL· cs.CV

When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding

Yiheng Wang , Yueqian Lin , Lichen Zhu , Yudong Liu , Hai "Helen" Li , Yiran Chen This is my paper

Pith reviewed 2026-06-27 19:39 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CV

keywords MLLMsvideo understandingabsent answer detectionmultiple-choice evaluationtemporal reasoningchain-of-thought prompting

0 comments

The pith

MLLMs in video understanding select distractors instead of detecting when no answer option is correct.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests multimodal large language models on video benchmarks where the correct answer has been removed from the choices. It evaluates behavior in three setups: multiple-choice questions that add a none-of-the-above option, open-ended responses prompted to detect absence, and standard evaluation without extra guidance. Across many models and tasks the models still pick plausible wrong answers rather than recognizing that nothing fits. The problem grows worse on temporal reasoning questions and when more video frames are supplied. Even chain-of-thought prompting raises detection rates only modestly, leaving the core limitation unaddressed.

Core claim

Across a diverse set of models and benchmarks, MLLMs overwhelmingly select plausible distractors rather than detecting the absent answer. This failure appears in all three tested settings, is more pronounced in temporal reasoning tasks, and worsens with denser frame sampling. Chain-of-thought prompting improves detection but does not bring performance to a satisfactory level, indicating that prompting alone cannot solve the issue.

What carries the argument

Absent answer detection tested through three evaluation settings on video understanding benchmarks.

If this is right

Explicit detection mechanisms must be added to multimodal systems beyond current prompting techniques.
Temporal reasoning tasks expose the limitation more than other video understanding tasks.
Increasing the number of sampled frames increases the rate of incorrect selections.
Prompt-based mitigation strategies improve results but leave performance unsatisfactory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployed video QA systems could produce confidently incorrect answers whenever the true information is absent from the options.
Training procedures that reward abstention when no option matches may be needed to correct the behavior.
The same failure pattern may appear in other multimodal tasks that use multiple-choice formats.

Load-bearing premise

The benchmark questions are constructed so that none of the listed options is correct, and a capable model is expected to notice this absence rather than always choose the best available match.

What would settle it

A controlled test set of video questions where one option is verifiably missing and models are scored on whether they output an explicit refusal or none-of-the-above response at rates clearly above chance.

read the original abstract

Multimodal large language models (MLLMs) have made substantial advancements in video understanding, yet the reliability of their responses remains underexplored. This work presents a diagnostic study of absent answer detection for MLLMs in video understanding, where the correct answer is deliberately excluded from the candidate set and a reliable model is expected to recognize that no valid option exists. We evaluate the absent answer detection behavior under three settings: multiple-choice questions augmented with an ``None of the Above'' option, open-ended generation with a detection instruction, and standard evaluation without any guidance. Across a diverse set of models and benchmarks, we find that MLLMs overwhelmingly select plausible distractors rather than detecting the absent answer. This failure is more pronounced in temporal reasoning tasks and worsens with denser frame sampling. We further explore chain-of-thought prompting as a mitigation strategy and find that while it substantially improves detection rates, performance remains unsatisfactory, suggesting that prompting-based strategies alone are insufficient to fully address this limitation. These findings expose a systematic failure in absent answer detection and highlight the need for explicit detection mechanisms in multimodal systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MLLMs pick distractors instead of flagging absent answers in video tasks, but the tests need checks that the modified questions truly have no valid option.

read the letter

The main point is that MLLMs in video understanding keep choosing plausible wrong answers even when the correct one has been removed from the options. They do not reliably detect that nothing fits.

What stands out is the narrow focus on absent-answer detection across three concrete setups: multiple-choice with a None option added, open-ended with an explicit detection prompt, and plain evaluation. The patterns hold across models and benchmarks, and the paper notes the issue is sharper on temporal reasoning and gets worse with denser sampling. Testing chain-of-thought as a partial fix and showing it helps but falls short is a straightforward addition.

The main soft spot is the question construction itself. The abstract states the correct answer was excluded, yet there is no mention of human checks or error analysis confirming the remaining choices are never acceptable given the video. Temporal items are especially open to multiple readings, so some distractors could still be reasonable. If that happens even at modest rates, the reported failure rates mix real detection problems with cases where a choice is actually valid.

This work is aimed at people building or evaluating multimodal video systems who care about reliability under uncertainty. It flags a practical gap without claiming to solve it.

The paper is coherent on its own terms and the empirical observation is worth referee time, so it should go to peer review for the methods and verification details to be examined.

Referee Report

1 major / 0 minor

Summary. The manuscript presents a diagnostic empirical study of absent answer detection in MLLMs for video understanding. By modifying existing benchmarks to exclude the ground-truth answer from the option set, the authors evaluate models across three settings (MCQ with added 'None of the Above', open-ended generation with explicit detection instructions, and standard evaluation) and report that models overwhelmingly select plausible distractors rather than indicating absence. The failure is reported as more severe on temporal-reasoning tasks and with denser frame sampling; chain-of-thought prompting improves rates but leaves performance unsatisfactory.

Significance. If the central empirical patterns are confirmed, the work usefully documents a systematic reliability gap in current MLLMs when required to recognize insufficient information in video QA. The breadth of models and benchmarks tested yields consistent observations without reliance on fitted parameters or derivations. The inclusion of a mitigation experiment provides a concrete baseline, though the paper itself notes its limitations.

major comments (1)

[§3 (Benchmark construction and question modification)] §3 (Benchmark construction and question modification): The paper states that the correct answer is deliberately excluded while retaining distractors (or adding 'None of the Above'), yet reports no human verification, inter-annotator agreement, or error analysis confirming that none of the remaining options is valid given the video content. This is load-bearing for the central claim; without it, especially on temporal-reasoning items where multiple interpretations are possible, the reported 'failure to detect absence' rates could be confounded by cases where a selected distractor is actually acceptable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on benchmark construction. We address the concern point-by-point below and outline planned revisions.

read point-by-point responses

Referee: [§3 (Benchmark construction and question modification)] §3 (Benchmark construction and question modification): The paper states that the correct answer is deliberately excluded while retaining distractors (or adding 'None of the Above'), yet reports no human verification, inter-annotator agreement, or error analysis confirming that none of the remaining options is valid given the video content. This is load-bearing for the central claim; without it, especially on temporal-reasoning items where multiple interpretations are possible, the reported 'failure to detect absence' rates could be confounded by cases where a selected distractor is actually acceptable.

Authors: We agree this is an important point. The benchmark modifications rely on the original human-annotated ground-truth answers provided by the source datasets (e.g., ActivityNet-QA, NExT-QA), removing only the designated correct option while retaining the distractors as-is. By construction of the original MCQ benchmarks, the distractors were not selected as correct by annotators. However, we acknowledge that temporal-reasoning questions can admit multiple plausible interpretations, and the absence of new human verification on the modified sets leaves open the possibility of confounding. We will revise §3 to explicitly state this reliance on original annotations, add a dedicated limitations paragraph discussing the potential for alternative valid interpretations in temporal tasks, and include a small-scale post-hoc error analysis (sampling 100 temporal items across two benchmarks for manual review by two authors) to quantify how often a retained distractor could reasonably be viewed as acceptable. These changes will be incorporated in the revision. revision: partial

Circularity Check

0 steps flagged

Empirical evaluation study with no derivations or fitted predictions

full rationale

This is a purely empirical diagnostic study measuring MLLM behavior on modified video QA benchmarks under absent-answer conditions. No equations, parameter fitting, uniqueness theorems, or ansatzes are present; results are reported as direct observations against external models and benchmarks. The central claim does not reduce to any self-referential construction or self-citation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work is a diagnostic empirical evaluation.

pith-pipeline@v0.9.1-grok · 5748 in / 963 out tokens · 18060 ms · 2026-06-27T19:39:23.611123+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 11 linked inside Pith

[1]

Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution.arXiv preprint arXiv:2409.12191, 2024

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jian Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution.arXiv preprint arXiv:2409.12191, 2024

Pith/arXiv arXiv 2024
[2]

Qwen2.5-VL Technical Report.arXiv preprint arXiv:2502.13923, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL Technical Report.ar...

Pith/arXiv arXiv 2025
[3]

Qwen2.5-OmniTechnicalReport.arXiv preprint arXiv:2503.20215, 2025

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, KaiDang, BinZhang, XiongWang, YunfeiChu, andJunyangLin. Qwen2.5-OmniTechnicalReport.arXiv preprint arXiv:2503.20215, 2025

Pith/arXiv arXiv 2025
[4]

InternVL3: ExploringAdvancedTrainingandTest-TimeRecipesforOpen-Source Multimodal Models.arXiv preprint arXiv:2504.10479, 2025

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, WeijieSu,JieShao,etal. InternVL3: ExploringAdvancedTrainingandTest-TimeRecipesforOpen-Source Multimodal Models.arXiv preprint arXiv:2504.10479, 2025

Pith/arXiv arXiv 2025
[5]

InternVL3.5: AdvancingOpen-SourceMultimodalModelsinVersatility, Reasoning, and Efficiency.arXiv preprint arXiv:2508.18265, 2025

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing,ShenglongYe,JieShao,etal. InternVL3.5: AdvancingOpen-SourceMultimodalModelsinVersatility, Reasoning, and Efficiency.arXiv preprint arXiv:2508.18265, 2025

Pith/arXiv arXiv 2025
[6]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...

Pith/arXiv arXiv 2025
[7]

MiMo-VL Technical Report.arXiv preprint arXiv:2506.03569, 2025

Core Team, Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhixian Zheng, Zhichao Song, Zhenbo Luo,YueYu,YudongWang,YuanyuanTian,YuTu,YihanYan,YiHuang,XuWang,Xinzh...

arXiv 2025
[8]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark ofMulti-ModalLLMsinVideoAnalysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-MME: The First-Ever Comprehensive Evaluation Benchmark ofMulti-ModalLLMsinVideoAnalysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24108–24118, 2025

2025
[9]

EgoSchema: A Diagnostic Benchmark for very Long-Form Video Language Understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. EgoSchema: A Diagnostic Benchmark for very Long-Form Video Language Understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023

2023
[10]

LongVideoBench: A Benchmark for Long-Context Interleaved Video-Language Understanding.Advances in Neural Information Processing Systems, 37: 28828–28857, 2024

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. LongVideoBench: A Benchmark for Long-Context Interleaved Video-Language Understanding.Advances in Neural Information Processing Systems, 37: 28828–28857, 2024

2024
[11]

LLMs May Perform MCQA by Selecting the Least Incorrect Option

Haochun Wang, Sendong Zhao, Zewen Qiang, Nuwa Xi, Bing Qin, and Ting Liu. LLMs May Perform MCQA by Selecting the Least Incorrect Option. InProceedings of the 31st International Conference on Computational Linguistics, pages 5852–5862, 2025

2025
[12]

Wait, that’s not an option: LLMs Robustness with Incorrect Multiple-Choice Options

Gracjan Góral, Emilia Wiśnios, Piotr Sankowski, and Paweł Budzianowski. Wait, that’s not an option: LLMs Robustness with Incorrect Multiple-Choice Options. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 1495–1515, 2025. 8 When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Vide...

2025
[13]

Unsolvable Problem Detection: Robust Understanding Evaluation for LargeMultimodalModels

Atsuyuki Miyai, Jingkang Yang, Jingyang Zhang, Yifei Ming, Qing Yu, Go Irie, Yixuan Li, Hai Helen Li, Ziwei Liu, and Kiyoharu Aizawa. Unsolvable Problem Detection: Robust Understanding Evaluation for LargeMultimodalModels. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 6497–6540, 2025

2025
[14]

MVBench: A Comprehensive Multi-Modal Video Understanding Benchmark

KunchangLi,YaliWang,YinanHe,YizhuoLi,YiWang,YiLiu,ZunWang,JilanXu,GuoChen,PingLuo, et al. MVBench: A Comprehensive Multi-Modal Video Understanding Benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22195–22206, 2024

2024
[15]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos.arXiv preprint arXiv:2501.13826, 2025

Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos.arXiv preprint arXiv:2501.13826, 2025

Pith/arXiv arXiv 2025
[16]

Mirage: The illusion of visual understanding.arXiv preprint arXiv:2603.21687, 2026

Mohammad Asadi, Jack W O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Fardi, Fei-Fei Li, Ehsan Adeli, and Euan Ashley. Mirage: The illusion of visual understanding.arXiv preprint arXiv:2603.21687, 2026

arXiv 2026
[17]

VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding.arXiv preprint arXiv:2603.07071, 2026

Xueqing Yu, Bohan Li, Yan Li, and Zhenheng Yang. VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding.arXiv preprint arXiv:2603.07071, 2026

arXiv 2026
[18]

Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models

Eunseop Yoon, Hee Suk Yoon, Mark A Hasegawa-Johnson, and Chang D Yoo. Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models. InProceedings of the International Conference on Learning Representations (ICLR), 2025

2025
[19]

None of the Above, Less of the Right: Parallel Patterns in Human and LLM Performance on Multi-Choice Questions Answering

Zhi Rui Tam, Cheng-Kuang Wu, Chieh-Yen Lin, and Yun-Nung Chen. None of the Above, Less of the Right: Parallel Patterns in Human and LLM Performance on Multi-Choice Questions Answering. In Findings of the Association for Computational Linguistics, pages 20112–20134, 2025

2025
[20]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.arXiv preprint arXiv:2507.06261, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.arXiv preprint arXiv:2507.06261, 2025

Pith/arXiv arXiv 2025
[21]

Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025
[22]

Qwen3-Omni Technical Report.arXiv preprint arXiv:2509.17765, 2025

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-Omni Technical Report.arXiv preprint arXiv:2509.17765, 2025

Pith/arXiv arXiv 2025
[23]

None of the above

Qwen Team. Qwen3.5-Omni Technical Report.arXiv preprint arXiv:2604.15804, 2026. A. Appendix A.1. Prompt Figure3illustrates the prompt templates used across all four evaluation settings. In thebaseline setting (a), the model is presented with the original candidate set containing the ground-truth answer and asked to select an option directly. In themulti-c...

Pith/arXiv arXiv 2026

[1] [1]

Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution.arXiv preprint arXiv:2409.12191, 2024

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jian Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution.arXiv preprint arXiv:2409.12191, 2024

Pith/arXiv arXiv 2024

[2] [2]

Qwen2.5-VL Technical Report.arXiv preprint arXiv:2502.13923, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL Technical Report.ar...

Pith/arXiv arXiv 2025

[3] [3]

Qwen2.5-OmniTechnicalReport.arXiv preprint arXiv:2503.20215, 2025

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, KaiDang, BinZhang, XiongWang, YunfeiChu, andJunyangLin. Qwen2.5-OmniTechnicalReport.arXiv preprint arXiv:2503.20215, 2025

Pith/arXiv arXiv 2025

[4] [4]

InternVL3: ExploringAdvancedTrainingandTest-TimeRecipesforOpen-Source Multimodal Models.arXiv preprint arXiv:2504.10479, 2025

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, WeijieSu,JieShao,etal. InternVL3: ExploringAdvancedTrainingandTest-TimeRecipesforOpen-Source Multimodal Models.arXiv preprint arXiv:2504.10479, 2025

Pith/arXiv arXiv 2025

[5] [5]

InternVL3.5: AdvancingOpen-SourceMultimodalModelsinVersatility, Reasoning, and Efficiency.arXiv preprint arXiv:2508.18265, 2025

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing,ShenglongYe,JieShao,etal. InternVL3.5: AdvancingOpen-SourceMultimodalModelsinVersatility, Reasoning, and Efficiency.arXiv preprint arXiv:2508.18265, 2025

Pith/arXiv arXiv 2025

[6] [6]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...

Pith/arXiv arXiv 2025

[7] [7]

MiMo-VL Technical Report.arXiv preprint arXiv:2506.03569, 2025

Core Team, Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhixian Zheng, Zhichao Song, Zhenbo Luo,YueYu,YudongWang,YuanyuanTian,YuTu,YihanYan,YiHuang,XuWang,Xinzh...

arXiv 2025

[8] [8]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark ofMulti-ModalLLMsinVideoAnalysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-MME: The First-Ever Comprehensive Evaluation Benchmark ofMulti-ModalLLMsinVideoAnalysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24108–24118, 2025

2025

[9] [9]

EgoSchema: A Diagnostic Benchmark for very Long-Form Video Language Understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. EgoSchema: A Diagnostic Benchmark for very Long-Form Video Language Understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023

2023

[10] [10]

LongVideoBench: A Benchmark for Long-Context Interleaved Video-Language Understanding.Advances in Neural Information Processing Systems, 37: 28828–28857, 2024

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. LongVideoBench: A Benchmark for Long-Context Interleaved Video-Language Understanding.Advances in Neural Information Processing Systems, 37: 28828–28857, 2024

2024

[11] [11]

LLMs May Perform MCQA by Selecting the Least Incorrect Option

Haochun Wang, Sendong Zhao, Zewen Qiang, Nuwa Xi, Bing Qin, and Ting Liu. LLMs May Perform MCQA by Selecting the Least Incorrect Option. InProceedings of the 31st International Conference on Computational Linguistics, pages 5852–5862, 2025

2025

[12] [12]

Wait, that’s not an option: LLMs Robustness with Incorrect Multiple-Choice Options

Gracjan Góral, Emilia Wiśnios, Piotr Sankowski, and Paweł Budzianowski. Wait, that’s not an option: LLMs Robustness with Incorrect Multiple-Choice Options. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 1495–1515, 2025. 8 When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Vide...

2025

[13] [13]

Unsolvable Problem Detection: Robust Understanding Evaluation for LargeMultimodalModels

Atsuyuki Miyai, Jingkang Yang, Jingyang Zhang, Yifei Ming, Qing Yu, Go Irie, Yixuan Li, Hai Helen Li, Ziwei Liu, and Kiyoharu Aizawa. Unsolvable Problem Detection: Robust Understanding Evaluation for LargeMultimodalModels. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 6497–6540, 2025

2025

[14] [14]

MVBench: A Comprehensive Multi-Modal Video Understanding Benchmark

KunchangLi,YaliWang,YinanHe,YizhuoLi,YiWang,YiLiu,ZunWang,JilanXu,GuoChen,PingLuo, et al. MVBench: A Comprehensive Multi-Modal Video Understanding Benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22195–22206, 2024

2024

[15] [15]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos.arXiv preprint arXiv:2501.13826, 2025

Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos.arXiv preprint arXiv:2501.13826, 2025

Pith/arXiv arXiv 2025

[16] [16]

Mirage: The illusion of visual understanding.arXiv preprint arXiv:2603.21687, 2026

Mohammad Asadi, Jack W O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Fardi, Fei-Fei Li, Ehsan Adeli, and Euan Ashley. Mirage: The illusion of visual understanding.arXiv preprint arXiv:2603.21687, 2026

arXiv 2026

[17] [17]

VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding.arXiv preprint arXiv:2603.07071, 2026

Xueqing Yu, Bohan Li, Yan Li, and Zhenheng Yang. VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding.arXiv preprint arXiv:2603.07071, 2026

arXiv 2026

[18] [18]

Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models

Eunseop Yoon, Hee Suk Yoon, Mark A Hasegawa-Johnson, and Chang D Yoo. Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models. InProceedings of the International Conference on Learning Representations (ICLR), 2025

2025

[19] [19]

None of the Above, Less of the Right: Parallel Patterns in Human and LLM Performance on Multi-Choice Questions Answering

Zhi Rui Tam, Cheng-Kuang Wu, Chieh-Yen Lin, and Yun-Nung Chen. None of the Above, Less of the Right: Parallel Patterns in Human and LLM Performance on Multi-Choice Questions Answering. In Findings of the Association for Computational Linguistics, pages 20112–20134, 2025

2025

[20] [20]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.arXiv preprint arXiv:2507.06261, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.arXiv preprint arXiv:2507.06261, 2025

Pith/arXiv arXiv 2025

[21] [21]

Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025

[22] [22]

Qwen3-Omni Technical Report.arXiv preprint arXiv:2509.17765, 2025

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-Omni Technical Report.arXiv preprint arXiv:2509.17765, 2025

Pith/arXiv arXiv 2025

[23] [23]

None of the above

Qwen Team. Qwen3.5-Omni Technical Report.arXiv preprint arXiv:2604.15804, 2026. A. Appendix A.1. Prompt Figure3illustrates the prompt templates used across all four evaluation settings. In thebaseline setting (a), the model is presented with the original candidate set containing the ground-truth answer and asked to select an option directly. In themulti-c...

Pith/arXiv arXiv 2026