OmniHalluc-L: Counterfactual Benchmarking and Modality-Perturbation Reliability Calibration for Long-Form Omni Hallucination
Pith reviewed 2026-06-28 07:29 UTC · model grok-4.3
The pith
Long-video omni models misbind correct audio-visual evidence to wrong speakers or moments, as shown by strict-pair accuracy on paired counterfactual claims, and a frozen calibration using modality perturbations raises their scores without r
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the counterfactual event-binding protocol, open-weight Omni models remain weak at pair-level binding: Qwen2.5-Omni-7B reaches 32.06% and Qwen3-Omni-Instruct reaches 41.55%, versus 76.54% for a closed-source reference. Modality-Perturbation Reliability Calibration lifts Qwen2.5-Omni-7B to 36.22% and Qwen3 to 51.09% on the benchmark, and improves target-adapted MCQ accuracy on OmniVideoBench (+2.20) and WorldSense (+1.51) with Qwen3.
What carries the argument
Counterfactual event-binding protocol that constructs matched supported and counterfactual claims from identical audio-visual event evidence and evaluates models by strict-pair accuracy to isolate misbinding.
If this is right
- Strict-pair accuracy penalizes acceptance of both a claim and its near-counterfactual even when local evidence is correct.
- The calibration operates on a frozen backbone by selecting audio-negative probes inside video-level folds and combining response shifts with native confidence.
- Gains from the calibration appear on the main benchmark and transfer to adapted MCQ versions of OmniVideoBench and WorldSense.
- The method narrows but does not close the gap between open-weight and closed-source systems on binding tasks.
Where Pith is reading between the lines
- The same perturbation-based reliability estimation could be tested on binding problems in other multimodal settings such as audio-only or image-text tasks.
- Extending the protocol to videos longer than the current average of 24 minutes might reveal additional binding failures not captured here.
- The persistent gap to closed-source performance suggests that explicit binding objectives during pretraining or fine-tuning may be needed beyond post-hoc calibration.
Load-bearing premise
Paired supported and counterfactual claims can be constructed from the same evidence so that strict-pair accuracy measures only binding mistakes rather than artifacts introduced during claim creation.
What would settle it
A new model that scores above 70 percent strict-pair accuracy on the 3600-item benchmark while showing comparable item-level performance, or the calibration producing no measurable lift on the benchmark or the two transfer tasks, would falsify the reported weakness and improvement.
Figures
read the original abstract
Long-video Omni assistants often fail not by inventing content, but by misbinding real evidence: they hear the right utterance and see the right event, yet attach it to the wrong speaker, moment, or modality. These \emph{almost-true} errors evade standard video QA because local evidence remains valid, so item-level scoring can reward both a supported claim and its near-counterfactual. We introduce a counterfactual event-binding protocol that constructs paired supported/counterfactual claims from the same audio-visual event evidence and evaluates them by strict-pair accuracy. We instantiate it as \bench, a benchmark for long-video Omni hallucination, with 3{,}600 single-claim QA items from 638 long-form videos averaging 24.16 minutes and covering 256.87 hours. Under this protocol, open-weight Omni models remain weak at pair-level binding: Qwen2.5-Omni-7B reaches 32.06\% and Qwen3-Omni-Instruct reaches 41.55\%, versus 76.54\% for a closed-source reference. To narrow this gap without updating the backbone, we propose \method, Modality-Perturbation Reliability Calibration, a frozen-backbone framework that selects audio-negative probes within video-level folds and fuses their response shifts with native audio-visual confidence into per-claim support estimates. \method lifts Qwen2.5-Omni-7B to 36.22\% and Qwen3 to 51.09\% on \bench, and improves target-adapted MCQ accuracy on OmniVideoBench ($+$2.20) and WorldSense ($+$1.51) with Qwen3.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OmniHalluc-L (ench), a benchmark for long-form omni hallucination using a counterfactual event-binding protocol that constructs paired supported/counterfactual claims from identical audio-visual evidence and scores via strict-pair accuracy to isolate misbinding errors. It reports low pair-level performance for open-weight models (Qwen2.5-Omni-7B at 32.06%, Qwen3-Omni-Instruct at 41.55%) versus 76.54% for a closed-source reference on 3,600 items from 638 videos (avg. 24.16 min), and proposes Modality-Perturbation Reliability Calibration (ench) that fuses audio-negative probe shifts with native confidence to lift scores to 36.22% and 51.09% without backbone updates, plus gains on OmniVideoBench (+2.20) and WorldSense (+1.51).
Significance. If the protocol successfully isolates binding errors, the benchmark fills a gap in multimodal hallucination evaluation by targeting almost-true misbindings that standard item-level QA misses; the calibration method offers a practical, frozen-backbone improvement with cross-benchmark transfer.
major comments (2)
- [Abstract] Abstract: the claim that strict-pair accuracy isolates misbinding without confounding factors from the claim construction process (reader's weakest assumption) lacks any description of validation steps, inter-annotator agreement, or controls for pair construction; this is load-bearing for interpreting the reported accuracy gaps and lifts.
- [Abstract] Abstract: no details are supplied on selection criteria for audio-negative probes within video-level folds or error analysis of the 3,600 items, preventing assessment of whether the +4.16% and +9.54% gains under ench are supported.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for greater transparency in our benchmark construction and calibration details. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that strict-pair accuracy isolates misbinding without confounding factors from the claim construction process (reader's weakest assumption) lacks any description of validation steps, inter-annotator agreement, or controls for pair construction; this is load-bearing for interpreting the reported accuracy gaps and lifts.
Authors: We agree that the abstract provides insufficient detail on validation steps, inter-annotator agreement, and controls for pair construction, which are necessary to substantiate that strict-pair accuracy isolates misbinding errors. The full manuscript (Section 3) outlines the counterfactual event-binding protocol but does not explicitly report these elements. In revision we will add a dedicated validation subsection reporting inter-annotator agreement statistics, controls for pair construction (e.g., evidence-matching checks and counterfactual plausibility filters), and any sensitivity analyses performed. revision: yes
-
Referee: [Abstract] Abstract: no details are supplied on selection criteria for audio-negative probes within video-level folds or error analysis of the 3,600 items, preventing assessment of whether the +4.16% and +9.54% gains under ench are supported.
Authors: We concur that the abstract omits selection criteria for audio-negative probes and error analysis of the 3,600 items, limiting evaluation of the reported gains under Modality-Perturbation Reliability Calibration. Section 4 describes probe selection within video-level folds and the fusion procedure, yet lacks explicit criteria and item-level error breakdowns. We will revise to include the precise selection criteria (e.g., temporal and semantic mismatch thresholds) and a summary error analysis of the dataset to better justify the observed improvements. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's core contributions are an empirical benchmark protocol (counterfactual event-binding with strict-pair accuracy on paired supported/counterfactual claims) and a calibration method (Modality-Perturbation Reliability Calibration that fuses response shifts from audio-negative probes with native confidence). Both are presented as constructed and evaluated on external video data, with performance gains reported as measured outcomes on ench and transfer benchmarks (OmniVideoBench, WorldSense). No equations, self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or high-level description that would reduce any claimed result to its inputs by construction. The protocol's isolation of binding errors is an assumption about data construction, not a circular derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Counterfactual claims constructed from the same audio-visual evidence correctly test misbinding without introducing extraneous errors.
Reference graph
Works this paper leans on
-
[1]
Evaluating object hallucination in large vision-language models
8 Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,
2023
-
[2]
Yuxuan Wang, Yueqian Wang, Dongyan Zhao, Cihang Xie, and Zilong Zheng. Videohallucer: Evaluating intrinsic and extrinsic hallucinations in large video-language models.arXiv preprint arXiv:2406.16338,
-
[3]
Hao Lu, Jiahao Wang, Yaolun Zhang, Ruohui Wang, Xuanyu Zheng, Yepeng Tang, Dahua Lin, and Lewei Lu. Elv-halluc: Benchmarking semantic aggregation hallucinations in long video understanding.arXiv preprint arXiv:2508.21496,
-
[4]
Keda Tao, Yuhua Zheng, Jia Xu, Wenjie Du, Kele Shao, Hesong Wang, Xueyi Chen, Xin Jin, Junhan Zhu, Bohan Yu, Weiqiang Wang, Jian Liu, Can Qin, Yulun Zhang, Ming-Hsuan Yang, and Huan Wang. Lvomnibench: Pioneering long audio-video understanding evaluation for omnimodal llms.arXiv preprint arXiv:2603.19217,
-
[5]
Arushi Goel, Sreyan Ghosh, Vatsal Agarwal, Nishit Anand, Kaousheik Jayakumar, Lasha Koroshinadze, Yao Xu, et al. Mmou: A massive multi-task omni understanding and reasoning benchmark for long and complex real-world videos.arXiv preprint arXiv:2603.14145,
-
[6]
Sangyun Chung, Se Yeon Kim, Youngchae Chee, and Yong Man Ro. Mad: Modality-adaptive decod- ing for mitigating cross-modal hallucinations in multimodal large language models.arXiv preprint arXiv:2601.21181,
-
[7]
Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang, Subin An, Jeongik Lee, Yohan Ban, SeungEun Chung, Thanh-Huy Nguyen, JuWan Maeng, Soochahn Lee, and Yong Jae Lee. See, hear, and understand: Benchmarking audiovisual human speech understanding in multimodal large language models.arXiv preprint arXiv:2512.02231,
-
[8]
Zixuan Chen, Depeng Wang, Hao Lin, Li Luo, Ke Xu, Ya Guo, Huijia Zhu, Tanfeng Sun, and Xinghao Jiang. Avid: A benchmark for omni-modal audio-visual inconsistency understanding via agent-driven construction.arXiv preprint arXiv:2604.13593,
-
[9]
Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, and Jitao Sang. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397,
-
[10]
Eventhallusion: Diagnosing event hallucinations in video llms.arXiv preprint arXiv:2409.16597,
Jiacheng Zhang, Yang Jiao, Shaoxiang Chen, Na Zhao, and Jingjing Chen. Eventhallusion: Diagnosing event hallucinations in video llms.arXiv preprint arXiv:2409.16597,
-
[11]
Wenbin Xing, Quanxing Zha, Lizheng Zu, Mengran Li, Ming Li, and Junchi Yan. Learning to de- code against compositional hallucination in video multimodal large language models.arXiv preprint arXiv:2602.00559,
-
[12]
Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia Tsvetkov, Luke Zettlemoyer, and Scott Wen-tau Yih
doi: 10.1609/aaai.v40i14.38183. Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia Tsvetkov, Luke Zettlemoyer, and Scott Wen-tau Yih. Trusting your evidence: Hallucinate less with context-aware decoding. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 783–791,
-
[13]
Trusting Your Evidence: Hallucinate Less with Context-aware Decoding
doi: 10.18653/v1/2024.naacl-short.69. Avshalom Manevich and Reut Tsarfaty. Mitigating hallucinations in large vision-language models via language-contrastive decoding.arXiv preprint arXiv:2408.04664,
-
[14]
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report. arXiv preprint arXiv:2503.20215, 2025a. Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv...
-
[15]
Minicpm-o 4.5: Towards real-time full-duplex omni-modal interaction
Junbo Cui, Bokai Xu, Chongyi Wang, Tianyu Yu, Weiyue Sun, Yingjing Xu, Tianran Wang, Zhihui He, Wenshuo Ma, Tianchi Cai, et al. Minicpm-o 4.5: Towards real-time full-duplex omni-modal interaction. arXiv preprint arXiv:2604.27393,
-
[16]
Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025a
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.