When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding
Pith reviewed 2026-06-27 19:39 UTC · model grok-4.3
The pith
MLLMs in video understanding select distractors instead of detecting when no answer option is correct.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across a diverse set of models and benchmarks, MLLMs overwhelmingly select plausible distractors rather than detecting the absent answer. This failure appears in all three tested settings, is more pronounced in temporal reasoning tasks, and worsens with denser frame sampling. Chain-of-thought prompting improves detection but does not bring performance to a satisfactory level, indicating that prompting alone cannot solve the issue.
What carries the argument
Absent answer detection tested through three evaluation settings on video understanding benchmarks.
If this is right
- Explicit detection mechanisms must be added to multimodal systems beyond current prompting techniques.
- Temporal reasoning tasks expose the limitation more than other video understanding tasks.
- Increasing the number of sampled frames increases the rate of incorrect selections.
- Prompt-based mitigation strategies improve results but leave performance unsatisfactory.
Where Pith is reading between the lines
- Deployed video QA systems could produce confidently incorrect answers whenever the true information is absent from the options.
- Training procedures that reward abstention when no option matches may be needed to correct the behavior.
- The same failure pattern may appear in other multimodal tasks that use multiple-choice formats.
Load-bearing premise
The benchmark questions are constructed so that none of the listed options is correct, and a capable model is expected to notice this absence rather than always choose the best available match.
What would settle it
A controlled test set of video questions where one option is verifiably missing and models are scored on whether they output an explicit refusal or none-of-the-above response at rates clearly above chance.
read the original abstract
Multimodal large language models (MLLMs) have made substantial advancements in video understanding, yet the reliability of their responses remains underexplored. This work presents a diagnostic study of absent answer detection for MLLMs in video understanding, where the correct answer is deliberately excluded from the candidate set and a reliable model is expected to recognize that no valid option exists. We evaluate the absent answer detection behavior under three settings: multiple-choice questions augmented with an ``None of the Above'' option, open-ended generation with a detection instruction, and standard evaluation without any guidance. Across a diverse set of models and benchmarks, we find that MLLMs overwhelmingly select plausible distractors rather than detecting the absent answer. This failure is more pronounced in temporal reasoning tasks and worsens with denser frame sampling. We further explore chain-of-thought prompting as a mitigation strategy and find that while it substantially improves detection rates, performance remains unsatisfactory, suggesting that prompting-based strategies alone are insufficient to fully address this limitation. These findings expose a systematic failure in absent answer detection and highlight the need for explicit detection mechanisms in multimodal systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a diagnostic empirical study of absent answer detection in MLLMs for video understanding. By modifying existing benchmarks to exclude the ground-truth answer from the option set, the authors evaluate models across three settings (MCQ with added 'None of the Above', open-ended generation with explicit detection instructions, and standard evaluation) and report that models overwhelmingly select plausible distractors rather than indicating absence. The failure is reported as more severe on temporal-reasoning tasks and with denser frame sampling; chain-of-thought prompting improves rates but leaves performance unsatisfactory.
Significance. If the central empirical patterns are confirmed, the work usefully documents a systematic reliability gap in current MLLMs when required to recognize insufficient information in video QA. The breadth of models and benchmarks tested yields consistent observations without reliance on fitted parameters or derivations. The inclusion of a mitigation experiment provides a concrete baseline, though the paper itself notes its limitations.
major comments (1)
- [§3 (Benchmark construction and question modification)] §3 (Benchmark construction and question modification): The paper states that the correct answer is deliberately excluded while retaining distractors (or adding 'None of the Above'), yet reports no human verification, inter-annotator agreement, or error analysis confirming that none of the remaining options is valid given the video content. This is load-bearing for the central claim; without it, especially on temporal-reasoning items where multiple interpretations are possible, the reported 'failure to detect absence' rates could be confounded by cases where a selected distractor is actually acceptable.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on benchmark construction. We address the concern point-by-point below and outline planned revisions.
read point-by-point responses
-
Referee: [§3 (Benchmark construction and question modification)] §3 (Benchmark construction and question modification): The paper states that the correct answer is deliberately excluded while retaining distractors (or adding 'None of the Above'), yet reports no human verification, inter-annotator agreement, or error analysis confirming that none of the remaining options is valid given the video content. This is load-bearing for the central claim; without it, especially on temporal-reasoning items where multiple interpretations are possible, the reported 'failure to detect absence' rates could be confounded by cases where a selected distractor is actually acceptable.
Authors: We agree this is an important point. The benchmark modifications rely on the original human-annotated ground-truth answers provided by the source datasets (e.g., ActivityNet-QA, NExT-QA), removing only the designated correct option while retaining the distractors as-is. By construction of the original MCQ benchmarks, the distractors were not selected as correct by annotators. However, we acknowledge that temporal-reasoning questions can admit multiple plausible interpretations, and the absence of new human verification on the modified sets leaves open the possibility of confounding. We will revise §3 to explicitly state this reliance on original annotations, add a dedicated limitations paragraph discussing the potential for alternative valid interpretations in temporal tasks, and include a small-scale post-hoc error analysis (sampling 100 temporal items across two benchmarks for manual review by two authors) to quantify how often a retained distractor could reasonably be viewed as acceptable. These changes will be incorporated in the revision. revision: partial
Circularity Check
Empirical evaluation study with no derivations or fitted predictions
full rationale
This is a purely empirical diagnostic study measuring MLLM behavior on modified video QA benchmarks under absent-answer conditions. No equations, parameter fitting, uniqueness theorems, or ansatzes are present; results are reported as direct observations against external models and benchmarks. The central claim does not reduce to any self-referential construction or self-citation chain.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jian Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution.arXiv preprint arXiv:2409.12191, 2024
Pith/arXiv arXiv 2024
-
[2]
Qwen2.5-VL Technical Report.arXiv preprint arXiv:2502.13923, 2025
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL Technical Report.ar...
Pith/arXiv arXiv 2025
-
[3]
Qwen2.5-OmniTechnicalReport.arXiv preprint arXiv:2503.20215, 2025
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, KaiDang, BinZhang, XiongWang, YunfeiChu, andJunyangLin. Qwen2.5-OmniTechnicalReport.arXiv preprint arXiv:2503.20215, 2025
Pith/arXiv arXiv 2025
-
[4]
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, WeijieSu,JieShao,etal. InternVL3: ExploringAdvancedTrainingandTest-TimeRecipesforOpen-Source Multimodal Models.arXiv preprint arXiv:2504.10479, 2025
Pith/arXiv arXiv 2025
-
[5]
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing,ShenglongYe,JieShao,etal. InternVL3.5: AdvancingOpen-SourceMultimodalModelsinVersatility, Reasoning, and Efficiency.arXiv preprint arXiv:2508.18265, 2025
Pith/arXiv arXiv 2025
-
[6]
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...
Pith/arXiv arXiv 2025
-
[7]
MiMo-VL Technical Report.arXiv preprint arXiv:2506.03569, 2025
Core Team, Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhixian Zheng, Zhichao Song, Zhenbo Luo,YueYu,YudongWang,YuanyuanTian,YuTu,YihanYan,YiHuang,XuWang,Xinzh...
arXiv 2025
-
[8]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark ofMulti-ModalLLMsinVideoAnalysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-MME: The First-Ever Comprehensive Evaluation Benchmark ofMulti-ModalLLMsinVideoAnalysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24108–24118, 2025
2025
-
[9]
EgoSchema: A Diagnostic Benchmark for very Long-Form Video Language Understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. EgoSchema: A Diagnostic Benchmark for very Long-Form Video Language Understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023
2023
-
[10]
LongVideoBench: A Benchmark for Long-Context Interleaved Video-Language Understanding.Advances in Neural Information Processing Systems, 37: 28828–28857, 2024
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. LongVideoBench: A Benchmark for Long-Context Interleaved Video-Language Understanding.Advances in Neural Information Processing Systems, 37: 28828–28857, 2024
2024
-
[11]
LLMs May Perform MCQA by Selecting the Least Incorrect Option
Haochun Wang, Sendong Zhao, Zewen Qiang, Nuwa Xi, Bing Qin, and Ting Liu. LLMs May Perform MCQA by Selecting the Least Incorrect Option. InProceedings of the 31st International Conference on Computational Linguistics, pages 5852–5862, 2025
2025
-
[12]
Wait, that’s not an option: LLMs Robustness with Incorrect Multiple-Choice Options
Gracjan Góral, Emilia Wiśnios, Piotr Sankowski, and Paweł Budzianowski. Wait, that’s not an option: LLMs Robustness with Incorrect Multiple-Choice Options. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 1495–1515, 2025. 8 When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Vide...
2025
-
[13]
Unsolvable Problem Detection: Robust Understanding Evaluation for LargeMultimodalModels
Atsuyuki Miyai, Jingkang Yang, Jingyang Zhang, Yifei Ming, Qing Yu, Go Irie, Yixuan Li, Hai Helen Li, Ziwei Liu, and Kiyoharu Aizawa. Unsolvable Problem Detection: Robust Understanding Evaluation for LargeMultimodalModels. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 6497–6540, 2025
2025
-
[14]
MVBench: A Comprehensive Multi-Modal Video Understanding Benchmark
KunchangLi,YaliWang,YinanHe,YizhuoLi,YiWang,YiLiu,ZunWang,JilanXu,GuoChen,PingLuo, et al. MVBench: A Comprehensive Multi-Modal Video Understanding Benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22195–22206, 2024
2024
-
[15]
Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos.arXiv preprint arXiv:2501.13826, 2025
Pith/arXiv arXiv 2025
-
[16]
Mirage: The illusion of visual understanding.arXiv preprint arXiv:2603.21687, 2026
Mohammad Asadi, Jack W O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Fardi, Fei-Fei Li, Ehsan Adeli, and Euan Ashley. Mirage: The illusion of visual understanding.arXiv preprint arXiv:2603.21687, 2026
arXiv 2026
-
[17]
Xueqing Yu, Bohan Li, Yan Li, and Zhenheng Yang. VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding.arXiv preprint arXiv:2603.07071, 2026
arXiv 2026
-
[18]
Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models
Eunseop Yoon, Hee Suk Yoon, Mark A Hasegawa-Johnson, and Chang D Yoo. Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models. InProceedings of the International Conference on Learning Representations (ICLR), 2025
2025
-
[19]
None of the Above, Less of the Right: Parallel Patterns in Human and LLM Performance on Multi-Choice Questions Answering
Zhi Rui Tam, Cheng-Kuang Wu, Chieh-Yen Lin, and Yun-Nung Chen. None of the Above, Less of the Right: Parallel Patterns in Human and LLM Performance on Multi-Choice Questions Answering. In Findings of the Association for Computational Linguistics, pages 20112–20134, 2025
2025
-
[20]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.arXiv preprint arXiv:2507.06261, 2025
Pith/arXiv arXiv 2025
-
[21]
Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631, 2025
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631, 2025
Pith/arXiv arXiv 2025
-
[22]
Qwen3-Omni Technical Report.arXiv preprint arXiv:2509.17765, 2025
Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-Omni Technical Report.arXiv preprint arXiv:2509.17765, 2025
Pith/arXiv arXiv 2025
-
[23]
Qwen Team. Qwen3.5-Omni Technical Report.arXiv preprint arXiv:2604.15804, 2026. A. Appendix A.1. Prompt Figure3illustrates the prompt templates used across all four evaluation settings. In thebaseline setting (a), the model is presented with the original candidate set containing the ground-truth answer and asked to select an option directly. In themulti-c...
Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.