Recognition: no theorem link
Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs
Pith reviewed 2026-05-12 03:52 UTC · model grok-4.3
The pith
Large multimodal models accurately perceive video content but systematically fail to use it for causal reasoning, with stronger post-training increasing reliance on textual shortcuts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Models faithfully perceive video content yet systematically underexploit it during causal reasoning. Stronger post-training amplifies rather than mitigates textual prior reliance, and higher baseline performance correlates with greater fragility under perturbation. ProCauEval decomposes these contributions through five controlled configurations that manipulate visual and textual modalities independently while keeping the underlying causal structure fixed. ADPO augments reinforcement learning by maximizing divergence between policy distributions on original inputs and on visually corrupted counterfactuals, forcing greater grounding in visual evidence.
What carries the argument
ProCauEval, a perturbation protocol using five configurations that independently alter visual and textual inputs to isolate each modality's contribution to causal discovery answers.
If this is right
- Models with stronger baseline performance on causal tasks show larger drops when visual information is perturbed, indicating greater dependence on textual patterns.
- Post-training that improves general video description tends to widen the gap between perception and causal engagement.
- ADPO raises visual engagement on causal tasks while preserving performance on standard comprehension benchmarks.
- The gap between perception accuracy and causal use persists across open-source and proprietary LMMs of varying sizes.
Where Pith is reading between the lines
- Current alignment and scaling practices may reward fluent text-based answers more than actual cross-modal grounding, creating an incentive for shortcut behavior.
- Benchmarks that only score final accuracy will continue to mask this deficit unless they incorporate modality-isolation tests like those in ProCauEval.
- ADPO-style divergence training could be applied to other tasks where models can answer correctly from language priors alone, such as visual question answering with strong textual biases.
Load-bearing premise
The five controlled configurations in ProCauEval cleanly separate the effects of visual and textual information on causal judgments without creating new biases or altering the true causal relations.
What would settle it
Run the same causal question on a video and on the identical video with its visual content corrupted while text stays fixed; if the model's causal answer changes substantially, that would contradict the claim of systematic visual under-exploitation.
Figures
read the original abstract
Although Large Multimodal Models (LMMs) have achieved strong performance on general video understanding, their susceptibility to textual prior shortcuts during causal discovery has been recognized as a critical deficit. The underlying mechanisms of this phenomenon remain incompletely understood, as existing benchmarks only measure response accuracy without revealing the sources and extent of the deficit. We introduce ProCauEval, a perturbation-based evaluation protocol that shifts from outcome assessment to mechanism diagnosis, probing causal discovery through five controlled configurations that systematically manipulate visual and textual modalities to decompose their respective contributions to model behavior and dissect the failure modes. Evaluating 17 mainstream LMMs, we find that models faithfully perceive video content yet systematically underexploit it during causal reasoning. We further observe that stronger post-training amplifies rather than mitigates textual prior reliance, and that higher baseline performance correlates with greater fragility under perturbation. To address these, we propose Anti-Distillation Policy Optimization (ADPO), a reinforcement learning framework built on negative teacher alignment, which augments GRPO by explicitly pushing the policy away from a prior-only counterfactual teacher induced by visual corruption. Specifically, ADPO maximizes the divergence between the policy distributions conditioned on the original and visually corrupted inputs, thereby forcing the model to ground its reasoning in visual evidence rather than textual shortcuts. Extensive experiments show that ADPO improves visual engagement without sacrificing fundamental comprehension, thus offering a preliminary step toward reliable causal discovery.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ProCauEval, a perturbation-based evaluation protocol using five controlled configurations that systematically manipulate visual and textual modalities in video-based causal discovery tasks. Evaluating 17 LMMs, it claims that models accurately perceive video content but systematically under-exploit visual information during causal reasoning, with stronger post-training amplifying textual prior reliance and higher baseline performance correlating with greater fragility. To mitigate this, the authors propose Anti-Distillation Policy Optimization (ADPO), a reinforcement learning method that augments GRPO by maximizing divergence between policy distributions on original and visually corrupted inputs to force visual grounding.
Significance. If the perturbations cleanly isolate modality contributions without altering causal structure or introducing new biases, the diagnostic shift from outcome accuracy to mechanism dissection would be a useful advance for understanding LMM limitations in causal tasks. The counter-intuitive finding that post-training exacerbates the deficit, if substantiated with controls, challenges prevailing views on alignment benefits. The ADPO framework offers a concrete, divergence-based training signal that could generalize beyond this benchmark.
major comments (2)
- [ProCauEval description (Methods)] ProCauEval description (Methods): The claim that the five configurations 'systematically manipulate visual and textual modalities to decompose their respective contributions' and that models 'faithfully perceive video content yet systematically underexploit it' requires that perturbations (visual corruption, text-only, etc.) preserve identical causal graphs, event timings, object relations, and ground-truth answers. No verification—such as human/expert labels confirming causal invariance or absence of new artifacts—is described. Without this, drops in causal accuracy under visual corruption may reflect missing information rather than under-exploitation, undermining the central 'perception without engagement' interpretation.
- [Experimental results and ADPO] Experimental results and ADPO section: The abstract reports results across 17 models and claims ADPO improvements, yet provides no quantitative details, error bars, baseline comparisons (e.g., to GRPO), or ablation studies on the divergence term. The claim that 'ADPO improves visual engagement without sacrificing fundamental comprehension' cannot be assessed for robustness or effect size without these, making it impossible to evaluate whether the method addresses the diagnosed deficit or merely trades one bias for another.
minor comments (1)
- [ADPO formulation] The notation for the 'prior-only counterfactual teacher induced by visual corruption' would benefit from an explicit equation or pseudocode definition to clarify how the teacher distribution is constructed and how the divergence objective is implemented in the RL update.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments on our manuscript. We address the two major comments point by point below, providing clarifications on the ProCauEval design and the experimental reporting for ADPO. We are committed to incorporating revisions where appropriate to strengthen the paper.
read point-by-point responses
-
Referee: The claim that the five configurations 'systematically manipulate visual and textual modalities to decompose their respective contributions' and that models 'faithfully perceive video content yet systematically underexploit it' requires that perturbations (visual corruption, text-only, etc.) preserve identical causal graphs, event timings, object relations, and ground-truth answers. No verification—such as human/expert labels confirming causal invariance or absence of new artifacts—is described. Without this, drops in causal accuracy under visual corruption may reflect missing information rather than under-exploitation, undermining the central 'perception without engagement' interpretation.
Authors: We agree that explicit empirical verification of causal invariance would strengthen the claims. The perturbations in ProCauEval are designed to preserve the causal structure: visual corruptions such as Gaussian blurring or random masking affect visual fidelity but do not alter the sequence of events, object interactions, or temporal relations in the video, as the underlying video content remains the same. Text-only and other configurations use the original textual descriptions. However, we did not include human validation studies in the submitted manuscript. In the revision, we will add a new subsection detailing the perturbation construction process and report results from expert annotations confirming that ground-truth causal answers remain identical across all five configurations. This will directly support the interpretation that performance drops indicate under-exploitation rather than information loss. revision: yes
-
Referee: The abstract reports results across 17 models and claims ADPO improvements, yet provides no quantitative details, error bars, baseline comparisons (e.g., to GRPO), or ablation studies on the divergence term. The claim that 'ADPO improves visual engagement without sacrificing fundamental comprehension' cannot be assessed for robustness or effect size without these, making it impossible to evaluate whether the method addresses the diagnosed deficit or merely trades one bias for another.
Authors: The abstract is intentionally concise and does not include specific numbers, which is standard practice. The full paper presents comprehensive results in the experimental section, including performance metrics for all 17 LMMs, direct comparisons against GRPO, standard error bars from repeated evaluations, and ablation studies that isolate the effect of the divergence term in ADPO. These demonstrate that ADPO enhances reliance on visual inputs as measured by our diagnostic metrics while preserving or improving accuracy on unperturbed tasks. If the editor prefers, we can include a few key quantitative results in the abstract during revision. revision: partial
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper defines ProCauEval as an independent perturbation protocol with five modality configurations and introduces ADPO as a new RL objective that explicitly maximizes distributional divergence between original and corrupted inputs. Both the benchmark and the training objective are constructed from first principles without reducing to fitted parameters or prior results by definition. Empirical findings on 17 LMMs consist of direct accuracy measurements on separate perception and causal tasks rather than tautological outputs; no self-citations, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing premises. The derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Llava-onevision: Easy visual task transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. Trans. Mach. Learn. Res., 2025, 2025. URL https://openreview.net/forum?id= zKv8qULV6n
work page 2025
-
[2]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
The Kinetics Human Action Video Dataset
Will Kay, João Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset.CoRR, abs/1705.06950, 2017. URL http://arxiv.org/abs/1705.06950
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[5]
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 9777–9786. Computer Vision Foundation / IEEE, 2021. doi: 10.1109/CVPR46437.2021.00965. URL https://openaccess.thecvf.com...
-
[6]
MSR-VTT: A large video description dataset for bridging video and language
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. MSR-VTT: A large video description dataset for bridging video and language. In2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV , USA, June 27-30, 2016, pages 5288–5296. IEEE Computer Society, 2016. doi: 10.1109/CVPR.2016.571. URL https://doi.org/10. 1109/CVPR.2016.571
-
[7]
Tieyuan Chen, Huabin Liu, Yi Wang, Yihang Chen, Tianyao He, Chaofan Gan, Huanyu He, and Weiyao Lin. Mecd+: Unlocking event-level causal graph discovery for video reasoning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
work page 2025
-
[8]
Causalstep: A benchmark for explicit stepwise causal reasoning in videos
Xuchen Li, Xuzhao Li, Shiyu Hu, Kaiqi Huang, and Wentao Zhang. Causalstep: A benchmark for explicit stepwise causal reasoning in videos. In Sven Koenig, Chad Jenkins, and Matthew E. Taylor, editors,Fortieth AAAI Conference on Artificial Intelligence, Thirty-Eighth Conference on Innovative Applications of Artificial Intelligence, Sixteenth Symposium on Edu...
work page 2026
-
[9]
AAAI Press, 2026. doi: 10.1609/AAAI.V40I8.37582. URL https://doi.org/10. 1609/aaai.v40i8.37582
-
[13]
Distilling the Knowledge in a Neural Network
Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network.CoRR, abs/1503.02531, 2015. URLhttp://arxiv.org/abs/1503.02531
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[14]
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection , isbn =
Jiangtong Li, Li Niu, and Liqing Zhang. From representation to reasoning: Towards both evidence and commonsense reasoning for video question-answering. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18- 24, 2022, pages 21241–21250. IEEE, 2022. doi: 10.1109/CVPR52688.2022.02059. URL https://doi.org/1...
-
[15]
Junbin Xiao, Angela Yao, Yicong Li, and Tat-Seng Chua. Can I trust your answer? visu- ally grounded video question answering. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 13204–13214. IEEE, 2024. doi: 10.1109/CVPR52733.2024.01254. URL https://doi.org/10.1109/ CVPR52733.2024.01254
-
[18]
Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, and Wei Chen. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization.CoRR, abs/2503.10615,
-
[21]
Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, et al. R1-reward: Training multimodal reward model through stable reinforcement learning.arXiv preprint arXiv:2505.02835, 2025
-
[22]
Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025
Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025
-
[23]
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
Team Glm, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
OneThinker: All-in-one Reasoning Model for Image and Video
Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Kwai keye-vl 1.5 technical report.arXiv preprint arXiv:2509.01563, 2025
Biao Yang, Bin Wen, Boyang Ding, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, et al. Kwai keye-vl 1.5 technical report.arXiv preprint arXiv:2509.01563, 2025
-
[26]
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025
work page internal anchor Pith review arXiv 2025
-
[27]
Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception,
Ziang Yan, Xinhao Li, Yinan He, Zhengrong Yue, Xiangyu Zeng, Yali Wang, Yu Qiao, Limin Wang, and Yi Wang. Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception.arXiv preprint arXiv:2509.21100, 2025. 11
-
[30]
Video models are zero-shot learners and reasoners
Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328, 2025
work page internal anchor Pith review arXiv 2025
-
[31]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machin...
work page 2021
-
[32]
Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025
work page 2025
-
[34]
What’s in the image? a deep-dive into the vision of vision language models
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shao- hui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. I...
-
[35]
Longvideobench: A benchmark for long-context interleaved video-language understanding
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Pro...
work page 2024
-
[36]
What’s in the image? a deep-dive into the vision of vision language models
Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, Chengye Wang, Ziyao Shangguan, Zhen- wen Liang, Yixin Liu, Chen Zhao, and Arman Cohan. MMVU: measuring expert-level multi-discipline video understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashvi...
-
[37]
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Lou, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark. InIEEE/CVF Conference on Computer Vision and Pat- tern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 22195–22206. IEEE, 2024. doi: 10.1109/CVP...
-
[39]
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 11941–11952. IEEE, 2023. doi: 10. 1109/ICCV51070.2023.01100. URL https://doi.org/10.1109/ICCV51070.2023. 01100
-
[43]
Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks
Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peidian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai, Chenhong He, Dong Zhang, Duo Zhang, Guoan Wang, Hao Tian, Haochen Zhao, Heng Qu, Hongshen...
work page internal anchor Pith review doi:10.48550/arxiv 2025
-
[44]
Longvila: Scaling long-context visual lan- guage models for long videos
Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Yihui He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, and Song Han. Longvila: Scaling long-context visual lan- guage models for long videos. InThe Thirteenth International Conference on Learning Representation...
work page 2025
-
[45]
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, Shuxun Yan...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.12793 2024
-
[46]
What’s in the image? a deep-dive into the vision of vision language models
Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor G. Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander To- shev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, and Alaaeldin El-Nouby. Multimodal autoregressive pre-training of large vision encoders. InIEEE/CVF Conference on Computer ...
-
[47]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
HealthBench: Evaluating Large Language Models Towards Improved Human Health
Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025. A Further Discussion, Limitations, and Future Work While our wor...
work page internal anchor Pith review arXiv 2025
-
[49]
Accuracy (1-10): Does the caption correctly describe the visual content without hallucination?-1-2: Major factual errors or hallucinations throughout-3-4: Several inaccuracies or noticeable hallucinations-5-6: Mostly correct with minor inaccuracies-7-8: Accurate with only trivial errors-9-10: Fully accurate, no errors or hallucinations Evaluate the captio...
-
[50]
Completeness (1-10): Does the caption cover the key events, actions, and objects in the video?-1-2: Misses most important content-3-4: Covers only a small portion of key content-5-6: Covers main events but misses notable details-7-8: Covers most content with only minor omissions-9-10: Comprehensively covers all important contentThink step by step before s...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.