Recognition: unknown
Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs
Pith reviewed 2026-05-14 18:03 UTC · model grok-4.3
The pith
Omnimodal LLMs encode premise-perception mismatches in hidden states but almost never reject the conflicting claims in their outputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Hidden states in omnimodal LLMs reliably encode premise-perception mismatches even when the same models almost never reject the false claim in their outputs, revealing a representation-action gap that is modality-asymmetric and prompt-resistant.
What carries the argument
The Representation-Action Gap: internal hidden-state encoding of premise-perception conflict versus behavioral failure to reject the false premise during answer generation.
If this is right
- A probe-guided logit adjustment that re-injects the mismatch signal improves rejection rates across models.
- Audio grounding lags behind vision in both detection and rejection.
- Seven prompt variants fail to close the gap, indicating the issue is not easily fixed at inference time.
- Models split into under-rejection (accepting false premises) and over-rejection (rejecting valid questions too) patterns.
Where Pith is reading between the lines
- The bottleneck for omnimodal grounding is in how encoded mismatches are translated into output decisions rather than in perception itself.
- Similar gaps may appear in other agentic settings where internal state must control external actions.
- Interventions that directly use hidden-state probes could serve as lightweight safety checks for multimodal systems.
Load-bearing premise
The IMAVB clips and questions cleanly separate grounding failures from confounds in clip choice, phrasing, or training data, and that probe detection of hidden-state signals accurately tracks what the model functionally knows.
What would settle it
A direct comparison showing that the accuracy of a linear probe on hidden states for mismatch detection does not predict the model's actual rejection rate on the same misleading-premise questions better than random guessing.
Figures
read the original abstract
When an omnimodal large language model accepts a question whose textual premise contradicts what it actually sees or hears, does the failure lie in perception or in action? Recent omnimodal models are positioned as perception-grounded agents that jointly process video, audio, and text, yet a basic form of grounding remains untested: catching a textual claim that conflicts with the model's own sensory input. We introduce IMAVB, a curated 500-clip benchmark of long-form movies with a 2x2 design crossing target modality (vision, audio) and premise condition (standard, misleading), which lets us measure conflict detection separately from ordinary multimodal comprehension. Across eight open-source omnimodal LLMs and Gemini 3.1 Pro, we document a Representation-Action Gap: hidden states reliably encode premise-perception mismatches even when the same models almost never reject the false claim in their outputs. Behaviorally, models fall into two failure modes: under-rejection, in which they answer misleading questions as if the false premise were true; and over-rejection, in which they reject more often but also reject standard questions, sacrificing ordinary comprehension accuracy. The gap is modality-asymmetric (audio grounding underperforms vision) and prompt-resistant across seven variants. As an initial diagnostic intervention, a probe-guided logit adjustment (PGLA) re-injects the encoded mismatch signal into decoding and consistently improves rejection behavior. Together, these results suggest the bottleneck for omnimodal grounding lies in translation, not perception.
Editorial analysis
A structured set of objections, weighed in public.
Circularity Check
Empirical benchmark study with no circular derivation steps
full rationale
The paper introduces the IMAVB benchmark and reports empirical measurements of hidden-state probe accuracy versus output rejection rates across models. No equations, fitted parameters, or self-referential definitions are used; the Representation-Action Gap is presented as a direct observational result from applying probes to existing models on curated clips. The 2x2 design and PGLA intervention are experimental interventions, not derivations that reduce to their own inputs by construction. This is a standard empirical analysis with no load-bearing self-citations or ansatz smuggling.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Probe-based detection of hidden-state mismatches accurately reflects functional encoding of premise-perception conflicts.
Reference graph
Works this paper leans on
-
[1]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. OpenAI GPT-5 system card. arXiv preprint arXiv:2601.03267, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
A new era of intelligence with Gemini 3, 2026
Google Blog. A new era of intelligence with Gemini 3, 2026. URL https://blog.google/ products-and-platforms/products/gemini/gemini-3/
2026
-
[3]
Qwen3-omni technical report, 2025
Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, et al. Qwen3-omni technical report, 2025
2025
-
[4]
Mitchell
Amos Azaria and Tom M. Mitchell. The internal state of an llm knows when it’s lying, 2023
2023
-
[5]
Discovering latent knowledge in language models without supervision, 2022
Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision, 2022
2022
-
[6]
The geometry of truth: Emergent linear structure in large language model representations of true/false datasets, 2023
Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets, 2023
2023
-
[7]
Representation engineering: A top-down approach to ai transparency, 2023
Andy Zou, Long Phan, Sarah Chen, James Campbell, et al. Representation engineering: A top-down approach to ai transparency, 2023
2023
-
[8]
Inference- time intervention: Eliciting truthful answers from a language model, 2023
Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model, 2023
2023
-
[9]
LLMs know more than they show: On the intrinsic representation of LLM hallucinations, 2024
Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov. LLMs know more than they show: On the intrinsic representation of LLM hallucinations, 2024
2024
-
[10]
Farhad Nooralahzadeh, Omid Rohanian, Yi Zhang, Jonathan Fürst, and Kurt Stockinger. Arbi- tration failure, not perceptual blindness: How vision-language models resolve visual-linguistic conflicts.arXiv preprint arXiv:2604.09364, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[11]
Haruka Kawasaki, Ryota Tanaka, and Kyosuke Nishida. Responses fall short of understand- ing: Revealing the gap between internal representations and responses in visual document understanding.arXiv preprint arXiv:2604.04411, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
Jing Tang, Kun Wang, Haolang Lu, Hongjin Chen, KaiTao Chen, Zhongxiang Sun, Qiankun Li, Lingjuan Lyu, Guoshun Nan, and Zhigang Zeng. Diagnosing knowledge conflict in multimodal long-chain reasoning.arXiv preprint arXiv:2602.14518, 2026
-
[13]
Crepe: Open-domain question answering with false presuppositions
Xinyan Yu, Sewon Min, Luke Zettlemoyer, and Hannaneh Hajishirzi. Crepe: Open-domain question answering with false presuppositions. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10457–10480, 2023
2023
-
[14]
Bowman, and Jackson Petty
Najoung Kim, Phu Mon Htut, Samuel R. Bowman, and Jackson Petty. (QA) 2: Question answering with questionable assumptions. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8466–8487, 2023
2023
-
[15]
Discovering language model behaviors with model-written evaluations
Ethan Perez, Sam Ringer, Kamil˙e Lukoši¯ut˙e, et al. Discovering language model behaviors with model-written evaluations. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13387–13434, 2023
2023
-
[16]
Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models, 2023
2023
-
[17]
Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, and Quoc V . Le. Simple synthetic data reduces sycophancy in large language models, 2023. 10
2023
-
[18]
Have the vlms lost confidence? a study of sycophancy in vlms, 2024
Shuo Li, Tao Ji, Xiaoran Fan, Linsheng Lu, Leyi Yang, Yuming Yang, Zhiheng Xi, Rui Zheng, Yuran Wang, Xiaohui Zhao, Tao Gui, Qi Zhang, and Xuanjing Huang. Have the vlms lost confidence? a study of sycophancy in vlms, 2024
2024
-
[19]
Sycophancy in vision-language models: A systematic analysis and an inference-time mitigation framework, 2024
Yunpu Zhao, Rui Zhang, Junbin Xiao, et al. Sycophancy in vision-language models: A systematic analysis and an inference-time mitigation framework, 2024
2024
-
[20]
Truthfulqa: Measuring how models mimic human falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 3214–3252, 2022
2022
-
[21]
Mohobench: Assessing honesty of multimodal large language models via unanswerable visual questions.Proceedings of the AAAI Conference on Artificial Intelligence, 40(34), 2026
Yanxu Zhu, Shitong Duan, Xiangxu Zhang, Jitao Sang, Peng Zhang, Tun Lu, Xiao Zhou, Jing Yao, Xiaoyuan Yi, and Xing Xie. Mohobench: Assessing honesty of multimodal large language models via unanswerable visual questions.Proceedings of the AAAI Conference on Artificial Intelligence, 40(34), 2026
2026
-
[22]
Mvbench: A comprehensive multi-modal video understanding benchmark, 2023
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark, 2023
2023
-
[23]
Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models, 2023
Munan Ning, Bin Zhu, Yujia Xie, et al. Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models, 2023
2023
-
[24]
MMAU: A massive multi-task audio understanding and reasoning benchmark
S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. MMAU: A massive multi-task audio understanding and reasoning benchmark. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=TeVAZXr3yv
2025
-
[25]
Air-bench: Benchmarking large audio-language models via generative comprehension
Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, et al. Air-bench: Benchmarking large audio-language models via generative comprehension. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1979–1998, 2024
1979
-
[26]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis, 2025
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis, 2025
2025
-
[27]
Omnibench: Towards the future of universal omni-language models, 2024
Yizhi Li, Yinghao Ma, Ge Zhang, Ruibin Yuan, et al. Omnibench: Towards the future of universal omni-language models, 2024
2024
-
[28]
Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities, 2025
Ziwei Zhou, Rui Wang, Zuxuan Wu, and Yu-Gang Jiang. Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities, 2025
2025
-
[29]
Worldsense: Evaluating real-world omnimodal understanding for multimodal llms, 2025
Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omnimodal understanding for multimodal llms, 2025
2025
-
[30]
Avhbench: A cross-modal hallucination benchmark for audio-visual large language models, 2024
Kim Sung-Bin, Oh Hyun-Bin, JungMok Lee, Arda Senocak, Joon Son Chung, and Tae-Hyun Oh. Avhbench: A cross-modal hallucination benchmark for audio-visual large language models, 2024
2024
-
[31]
Towards video thinking test: A holistic benchmark for advanced video reasoning and understanding
Yuanhan Zhang, Yunice Chew, Yuhao Dong, Aria Leo, Bo Hu, and Ziwei Liu. Towards video thinking test: A holistic benchmark for advanced video reasoning and understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20626– 20636, 2025
2025
-
[32]
Binge society (youtube channel), 2026
Binge Society. Binge society (youtube channel), 2026. URL https://www.youtube.com/ @bingesociety
2026
-
[33]
Boxoffice movie scenes (youtube channel), 2026
Boxoffice Movie Scenes. Boxoffice movie scenes (youtube channel), 2026. URL https: //www.youtube.com/@BoxofficeMoviesScenes
2026
-
[34]
Condensed movies: Story based retrieval with contextual embeddings
Max Bain, Arsha Nagrani, Andrew Brown, and Andrew Zisserman. Condensed movies: Story based retrieval with contextual embeddings. InProceedings of the Asian Conference on Computer Vision, 2020. 11
2020
-
[35]
Ola: Pushing the frontiers of omni-modal language model.arXiv preprint arXiv:2502.04328, 2025
Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Ola: Pushing the frontiers of omni-modal language model.arXiv preprint arXiv:2502.04328, 2025
-
[36]
Omnivinci: Enhancing architecture and data for omni-modal understanding llm, 2025
Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, et al. Omnivinci: Enhancing architecture and data for omni-modal understanding llm, 2025
2025
-
[37]
Qwen2.5-omni technical report, 2025
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, et al. Qwen2.5-omni technical report, 2025
2025
-
[38]
Minicpm-v: A gpt-4v level mllm on your phone, 2024
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, et al. Minicpm-v: A gpt-4v level mllm on your phone, 2024
2024
-
[39]
Uni-moe-2.0-omni: Scaling language-centric omnimodal large model with advanced moe, training and data, 2025
Yunxin Li, Xinyu Chen, Shenyuan Jiang, et al. Uni-moe-2.0-omni: Scaling language-centric omnimodal large model with advanced moe, training and data, 2025
2025
-
[40]
Baichuan-omni-1.5 technical report, 2025
Yadong Li, Jun Liu, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, et al. Baichuan-omni-1.5 technical report, 2025
2025
-
[41]
video-salmonn 2: Caption-enhanced audio-visual large language models, 2025
Changli Tang, Yixuan Li, Yudong Yang, et al. video-salmonn 2: Caption-enhanced audio-visual large language models, 2025
2025
-
[42]
LMMs-eval: Reality check on the evaluation of large multimodal models
Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. LMMs-eval: Reality check on the evaluation of large multimodal models. InFindings of the Association for Computational Linguistics: NAACL 2025, 2025. doi: 10.18653/v1/2025.findings-naacl.51. URLhttps://aclan...
-
[43]
Eliciting latent predictions from transform- ers with the tuned lens, 2023
Nora Belrose, Igor Ostrovsky, Lev McKinney, et al. Eliciting latent predictions from transform- ers with the tuned lens, 2023
2023
-
[44]
Designing and interpreting probes with control tasks, 2019
John Hewitt and Percy Liang. Designing and interpreting probes with control tasks, 2019
2019
-
[45]
Amnesic probing: Behavioral explanation with amnesic counterfactuals, 2020
Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and Yoav Goldberg. Amnesic probing: Behavioral explanation with amnesic counterfactuals, 2020
2020
-
[46]
Mitigating object hallucinations in large vision-language models through visual contrastive decoding
Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. InCVPR, 2024
2024
-
[47]
Avcd: Mitigating hallucinations in audio-visual large language models through contrastive decoding, 2025
Chaeyoung Jung, Youngjoon Jang, and Joon Son Chung. Avcd: Mitigating hallucinations in audio-visual large language models through contrastive decoding, 2025
2025
-
[48]
Gpt-4o system card, 2024
OpenAI. Gpt-4o system card, 2024
2024
-
[49]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5
2026
-
[50]
The revolution of multimodal large language models: A survey, 2024
Davide Caffagni, Federico Cocchi, Luca Barsellotti, et al. The revolution of multimodal large language models: A survey, 2024
2024
-
[51]
VideoLLaMA 2: Advancing spatial-temporal modeling and audio understanding in Video-LLMs, 2024
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, et al. VideoLLaMA 2: Advancing spatial-temporal modeling and audio understanding in Video-LLMs, 2024
2024
-
[52]
Connector-s: A survey of connectors in multi-modal large language models
Xun Zhu, Zheng Zhang, Xi Chen, Yiming Shi, Miao Li, and Ji Wu. Connector-s: A survey of connectors in multi-modal large language models. InProceedings of IJCAI-25, 2025
2025
-
[53]
Modality laziness: Everybody’s business is nobody’s business, 2022
Chenzhuang Du, Jiaye Teng, Tingle Li, Yichen Liu, Yue Wang, Yang Yuan, and Hang Zhao. Modality laziness: Everybody’s business is nobody’s business, 2022. URL https: //openreview.net/forum?id=1eGFH6yYAJn
2022
-
[54]
The curse of multi-modalities: Evaluating hallucinations of large multimodal models across language, visual, and audio, 2024
Sicong Leng, Yun Xing, Zesen Cheng, Yang Zhou, et al. The curse of multi-modalities: Evaluating hallucinations of large multimodal models across language, visual, and audio, 2024
2024
-
[55]
A survey of hallucination in large foundation models, 2023
Vipula Rawte, Amit Sheth, and Amitava Das. A survey of hallucination in large foundation models, 2023. 12
2023
-
[56]
The troubling emergence of hallucination in large language models: An extensive definition, quantification, and prescriptive remediations
Vipula Rawte, Swagata Chakraborty, Agnibh Pathak, et al. The troubling emergence of hallucination in large language models: An extensive definition, quantification, and prescriptive remediations. InEMNLP, 2023
2023
-
[57]
Cross-modal information flow in multimodal large language models, 2024
Zhi Zhang, Srishti Yadav, Fengze Han, and Ekaterina Shutova. Cross-modal information flow in multimodal large language models, 2024
2024
-
[58]
Diagnosing and mitigating modality interference in multimodal large language models, 2025
Rui Cai, Bangzheng Li, Xiaofei Wen, Muhao Chen, and Zhe Zhao. Diagnosing and mitigating modality interference in multimodal large language models, 2025
2025
-
[59]
Mllms are deeply affected by modality bias, 2025
Xu Zheng, Chenfei Liao, Yuqian Fu, et al. Mllms are deeply affected by modality bias, 2025
2025
-
[60]
Assessing modality bias in video question answering benchmarks with multimodal large language models, 2024
Jean Park, Kuk Jin Jang, Basam Alasaly, et al. Assessing modality bias in video question answering benchmarks with multimodal large language models, 2024
2024
-
[61]
Bench- marking gaslighting negation attacks against multimodal large language models, 2025
Bin Zhu, Yinxuan Gui, Huiyan Qi, Jingjing Chen, Chong-Wah Ngo, and Ee-Peng Lim. Bench- marking gaslighting negation attacks against multimodal large language models, 2025
2025
-
[62]
Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting, 2023
2023
-
[63]
Mitigating hallucinations in large vision-language models with instruction contrastive decoding
Xintong Wang, Jingheng Pan, Liang Ding, and Chris Biemann. Mitigating hallucinations in large vision-language models with instruction contrastive decoding. InFindings of ACL, 2024
2024
-
[64]
Chaeyoung Jung, Youngjoon Jang, Jongmin Choi, and Joon Son Chung. Fork-merge decoding: Enhancing multimodal understanding in audio-visual large language models.arXiv preprint arXiv:2505.20873, 2025
-
[65]
Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation
Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13418–13427, 2024
2024
-
[66]
Self-introspective decoding: Alleviating hallucinations for large vision-language models, 2024
Fushuo Huo, Wenchao Xu, Zhong Zhang, Haozhao Wang, Zhicheng Chen, and Peilin Zhao. Self-introspective decoding: Alleviating hallucinations for large vision-language models, 2024
2024
-
[67]
Dola: Decoding by contrasting layers improves factuality in large language models, 2023
Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. Dola: Decoding by contrasting layers improves factuality in large language models, 2023
2023
-
[68]
Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models, 2024
Mu Cai, Reuben Tan, Jianrui Zhang, et al. Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models, 2024
2024
-
[69]
Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, and Michael S. Ryoo. Understanding long videos with multimodal language models, 2024
2024
-
[70]
From seconds to hours: Reviewing multimodal large language models on comprehensive long video understanding, 2024
Heqing Zou, Tianze Luo, et al. From seconds to hours: Reviewing multimodal large language models on comprehensive long video understanding, 2024
2024
-
[71]
Video-holmes: Can mllm think like holmes for complex video reasoning?, 2025
Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?, 2025
2025
-
[72]
How to protect yourself from 5g radiation? investigating llm responses to implicit misinformation
Ruohao Guo, Wei Xu, and Alan Ritter. How to protect yourself from 5g radiation? investigating llm responses to implicit misinformation. InEMNLP, 2025
2025
-
[73]
Multihoax: A dataset of multi-hop false-premise questions
Mohammadamin Shafiei, Hamidreza Saffari, and Nafise Sadat Moosavi. Multihoax: A dataset of multi-hop false-premise questions. InFindings of ACL, 2025
2025
-
[74]
The visual/audio detail in the question is incorrect
Yunkai Dang, Mengxi Gao, Yibo Yan, et al. Exploring response uncertainty in mllms: An empirical evaluation under misleading scenarios, 2024. 13 A Appendix Overview The appendices provide full implementation details, extended results, and supporting analyses for all experiments reported in the main text. Table 5 lists each appendix with its scope and the m...
2024
-
[76]
What you hear: speech (transcribe it), sounds, music Write a natural description combining visual and audio. Pass 1 (Omni Caption): Subsequent Segments User prompt: For context, here is what happened in the last 10s of the video: {previous_caption} Use this context to understand continuity (same people, ongoing actions, conversation flow). Now describe TH...
-
[77]
What you see: people, actions, setting, objects
-
[78]
Pass 2: Detail Enhancement System prompt: You are an expert video caption enhancer
What you hear: speech (transcribe it), sounds, music Write a natural description combining visual and audio. Pass 2: Detail Enhancement System prompt: You are an expert video caption enhancer. Add specific details while maintaining the primary source’s accuracy. User prompt: You are enhancing a video segment caption by combining details from multiple sour...
-
[79]
Primary Caption = TRUTH - Start with this as your base
-
[80]
Add visual details from Vision Caption
-
[81]
Add audio details ONLY if they match the scene
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.