pith. machine review for the scientific record. sign in

arxiv: 2605.09422 · v1 · submitted 2026-05-10 · 💻 cs.CL · cs.CV

Recognition: no theorem link

Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:52 UTC · model grok-4.3

classification 💻 cs.CL cs.CV
keywords large multimodal modelscausal discoverytextual priorsvisual engagementperturbation evaluationreinforcement learningcounterfactual training
0
0 comments X

The pith

Large multimodal models accurately perceive video content but systematically fail to use it for causal reasoning, with stronger post-training increasing reliance on textual shortcuts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that LMMs can describe what happens in videos but do not engage the visual details when figuring out cause-and-effect relationships, instead defaulting to patterns learned from text alone. It introduces ProCauEval, which runs the same causal questions under five different input setups that turn vision or language information on and off in controlled ways. Across 17 models, this reveals faithful perception paired with under-use of vision, and shows that models with higher baseline scores become more fragile when visuals are altered. The authors then present ADPO, a training method that penalizes answers based only on text by comparing them to versions where visuals have been corrupted.

Core claim

Models faithfully perceive video content yet systematically underexploit it during causal reasoning. Stronger post-training amplifies rather than mitigates textual prior reliance, and higher baseline performance correlates with greater fragility under perturbation. ProCauEval decomposes these contributions through five controlled configurations that manipulate visual and textual modalities independently while keeping the underlying causal structure fixed. ADPO augments reinforcement learning by maximizing divergence between policy distributions on original inputs and on visually corrupted counterfactuals, forcing greater grounding in visual evidence.

What carries the argument

ProCauEval, a perturbation protocol using five configurations that independently alter visual and textual inputs to isolate each modality's contribution to causal discovery answers.

If this is right

  • Models with stronger baseline performance on causal tasks show larger drops when visual information is perturbed, indicating greater dependence on textual patterns.
  • Post-training that improves general video description tends to widen the gap between perception and causal engagement.
  • ADPO raises visual engagement on causal tasks while preserving performance on standard comprehension benchmarks.
  • The gap between perception accuracy and causal use persists across open-source and proprietary LMMs of varying sizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Current alignment and scaling practices may reward fluent text-based answers more than actual cross-modal grounding, creating an incentive for shortcut behavior.
  • Benchmarks that only score final accuracy will continue to mask this deficit unless they incorporate modality-isolation tests like those in ProCauEval.
  • ADPO-style divergence training could be applied to other tasks where models can answer correctly from language priors alone, such as visual question answering with strong textual biases.

Load-bearing premise

The five controlled configurations in ProCauEval cleanly separate the effects of visual and textual information on causal judgments without creating new biases or altering the true causal relations.

What would settle it

Run the same causal question on a video and on the identical video with its visual content corrupted while text stays fixed; if the model's causal answer changes substantially, that would contradict the claim of systematic visual under-exploitation.

Figures

Figures reproduced from arXiv: 2605.09422 by Baoqi Ren, Bing Qin, Jiafeng Liang, Ming Liu, Runxuan Liu, See-kiong Ng, Shixin Jiang, Tao Ren, Zhihao Zhu, Zihan Zhang.

Figure 1
Figure 1. Figure 1: Correct and short￾cut output likelihood distribu￾tion gap between GRPO and our proposed ADPO. Building on this dissection, we further propose Anti-Distillation Policy Optimization (ADPO), a reinforcement learning framework that directly targets the prior-dependent failure mode exposed by PROCAUEVAL. In contrast to conventional distillation [12], which pulls a student toward a teacher, ADPO pushes the polic… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of PROCAUEVAL framework covering data elements and configurations. 3.2.2 Evaluation Configurations To comprehensively assess the model’s causal discovery capability, underlying mechanisms and potential failure modes, we establish five evaluation configurations (shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) Selection trends of the model toward correct and shortcut answers during GRPO training. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Discriminative capability between correct and shortcut answers measured by log￾likelihood gap (left) and attention weights on visual tokens under different methods (right). Complementary Evidence of Visual Engagement [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Prompt for GPT Score Gradient Decomposition. Taking the gradient of JADPO(θ) with respect to θ yields: ∇θJADPO = E   X i,t Ai ∇θ log πθ(oi,t | V, q, oi,<t)   | {z } ∇θJGRPO +λ ∇θE h DKL π (i,t) θ ∥ sg(π pert,(i,t) θ ) i | {z } ∇θLprcp . (10) The stop-gradient operator sg(·) ensures that the regularizer’s gradient flows only through the visually-grounded distribution πθ(· | V, q, oi,<t), explicitly pu… view at source ↗
read the original abstract

Although Large Multimodal Models (LMMs) have achieved strong performance on general video understanding, their susceptibility to textual prior shortcuts during causal discovery has been recognized as a critical deficit. The underlying mechanisms of this phenomenon remain incompletely understood, as existing benchmarks only measure response accuracy without revealing the sources and extent of the deficit. We introduce ProCauEval, a perturbation-based evaluation protocol that shifts from outcome assessment to mechanism diagnosis, probing causal discovery through five controlled configurations that systematically manipulate visual and textual modalities to decompose their respective contributions to model behavior and dissect the failure modes. Evaluating 17 mainstream LMMs, we find that models faithfully perceive video content yet systematically underexploit it during causal reasoning. We further observe that stronger post-training amplifies rather than mitigates textual prior reliance, and that higher baseline performance correlates with greater fragility under perturbation. To address these, we propose Anti-Distillation Policy Optimization (ADPO), a reinforcement learning framework built on negative teacher alignment, which augments GRPO by explicitly pushing the policy away from a prior-only counterfactual teacher induced by visual corruption. Specifically, ADPO maximizes the divergence between the policy distributions conditioned on the original and visually corrupted inputs, thereby forcing the model to ground its reasoning in visual evidence rather than textual shortcuts. Extensive experiments show that ADPO improves visual engagement without sacrificing fundamental comprehension, thus offering a preliminary step toward reliable causal discovery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces ProCauEval, a perturbation-based evaluation protocol using five controlled configurations that systematically manipulate visual and textual modalities in video-based causal discovery tasks. Evaluating 17 LMMs, it claims that models accurately perceive video content but systematically under-exploit visual information during causal reasoning, with stronger post-training amplifying textual prior reliance and higher baseline performance correlating with greater fragility. To mitigate this, the authors propose Anti-Distillation Policy Optimization (ADPO), a reinforcement learning method that augments GRPO by maximizing divergence between policy distributions on original and visually corrupted inputs to force visual grounding.

Significance. If the perturbations cleanly isolate modality contributions without altering causal structure or introducing new biases, the diagnostic shift from outcome accuracy to mechanism dissection would be a useful advance for understanding LMM limitations in causal tasks. The counter-intuitive finding that post-training exacerbates the deficit, if substantiated with controls, challenges prevailing views on alignment benefits. The ADPO framework offers a concrete, divergence-based training signal that could generalize beyond this benchmark.

major comments (2)
  1. [ProCauEval description (Methods)] ProCauEval description (Methods): The claim that the five configurations 'systematically manipulate visual and textual modalities to decompose their respective contributions' and that models 'faithfully perceive video content yet systematically underexploit it' requires that perturbations (visual corruption, text-only, etc.) preserve identical causal graphs, event timings, object relations, and ground-truth answers. No verification—such as human/expert labels confirming causal invariance or absence of new artifacts—is described. Without this, drops in causal accuracy under visual corruption may reflect missing information rather than under-exploitation, undermining the central 'perception without engagement' interpretation.
  2. [Experimental results and ADPO] Experimental results and ADPO section: The abstract reports results across 17 models and claims ADPO improvements, yet provides no quantitative details, error bars, baseline comparisons (e.g., to GRPO), or ablation studies on the divergence term. The claim that 'ADPO improves visual engagement without sacrificing fundamental comprehension' cannot be assessed for robustness or effect size without these, making it impossible to evaluate whether the method addresses the diagnosed deficit or merely trades one bias for another.
minor comments (1)
  1. [ADPO formulation] The notation for the 'prior-only counterfactual teacher induced by visual corruption' would benefit from an explicit equation or pseudocode definition to clarify how the teacher distribution is constructed and how the divergence objective is implemented in the RL update.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We address the two major comments point by point below, providing clarifications on the ProCauEval design and the experimental reporting for ADPO. We are committed to incorporating revisions where appropriate to strengthen the paper.

read point-by-point responses
  1. Referee: The claim that the five configurations 'systematically manipulate visual and textual modalities to decompose their respective contributions' and that models 'faithfully perceive video content yet systematically underexploit it' requires that perturbations (visual corruption, text-only, etc.) preserve identical causal graphs, event timings, object relations, and ground-truth answers. No verification—such as human/expert labels confirming causal invariance or absence of new artifacts—is described. Without this, drops in causal accuracy under visual corruption may reflect missing information rather than under-exploitation, undermining the central 'perception without engagement' interpretation.

    Authors: We agree that explicit empirical verification of causal invariance would strengthen the claims. The perturbations in ProCauEval are designed to preserve the causal structure: visual corruptions such as Gaussian blurring or random masking affect visual fidelity but do not alter the sequence of events, object interactions, or temporal relations in the video, as the underlying video content remains the same. Text-only and other configurations use the original textual descriptions. However, we did not include human validation studies in the submitted manuscript. In the revision, we will add a new subsection detailing the perturbation construction process and report results from expert annotations confirming that ground-truth causal answers remain identical across all five configurations. This will directly support the interpretation that performance drops indicate under-exploitation rather than information loss. revision: yes

  2. Referee: The abstract reports results across 17 models and claims ADPO improvements, yet provides no quantitative details, error bars, baseline comparisons (e.g., to GRPO), or ablation studies on the divergence term. The claim that 'ADPO improves visual engagement without sacrificing fundamental comprehension' cannot be assessed for robustness or effect size without these, making it impossible to evaluate whether the method addresses the diagnosed deficit or merely trades one bias for another.

    Authors: The abstract is intentionally concise and does not include specific numbers, which is standard practice. The full paper presents comprehensive results in the experimental section, including performance metrics for all 17 LMMs, direct comparisons against GRPO, standard error bars from repeated evaluations, and ablation studies that isolate the effect of the divergence term in ADPO. These demonstrate that ADPO enhances reliance on visual inputs as measured by our diagnostic metrics while preserving or improving accuracy on unperturbed tasks. If the editor prefers, we can include a few key quantitative results in the abstract during revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper defines ProCauEval as an independent perturbation protocol with five modality configurations and introduces ADPO as a new RL objective that explicitly maximizes distributional divergence between original and corrupted inputs. Both the benchmark and the training objective are constructed from first principles without reducing to fitted parameters or prior results by definition. Empirical findings on 17 LMMs consist of direct accuracy measurements on separate perception and causal tasks rather than tautological outputs; no self-citations, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing premises. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities. The counterfactual teacher in ADPO is a constructed entity whose implementation details and independence from the target result cannot be verified from the given text.

pith-pipeline@v0.9.0 · 5579 in / 1196 out tokens · 60971 ms · 2026-05-12T03:52:21.868351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 12 internal anchors

  1. [1]

    Llava-onevision: Easy visual task transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. Trans. Mach. Learn. Res., 2025, 2025. URL https://openreview.net/forum?id= zKv8qULV6n

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  3. [3]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  4. [4]

    The Kinetics Human Action Video Dataset

    Will Kay, João Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset.CoRR, abs/1705.06950, 2017. URL http://arxiv.org/abs/1705.06950

  5. [5]

    2021 , url =

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 9777–9786. Computer Vision Foundation / IEEE, 2021. doi: 10.1109/CVPR46437.2021.00965. URL https://openaccess.thecvf.com...

  6. [6]

    MSR-VTT: A large video description dataset for bridging video and language

    Jun Xu, Tao Mei, Ting Yao, and Yong Rui. MSR-VTT: A large video description dataset for bridging video and language. In2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV , USA, June 27-30, 2016, pages 5288–5296. IEEE Computer Society, 2016. doi: 10.1109/CVPR.2016.571. URL https://doi.org/10. 1109/CVPR.2016.571

  7. [7]

    Mecd+: Unlocking event-level causal graph discovery for video reasoning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Tieyuan Chen, Huabin Liu, Yi Wang, Yihang Chen, Tianyao He, Chaofan Gan, Huanyu He, and Weiyao Lin. Mecd+: Unlocking event-level causal graph discovery for video reasoning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  8. [8]

    Causalstep: A benchmark for explicit stepwise causal reasoning in videos

    Xuchen Li, Xuzhao Li, Shiyu Hu, Kaiqi Huang, and Wentao Zhang. Causalstep: A benchmark for explicit stepwise causal reasoning in videos. In Sven Koenig, Chad Jenkins, and Matthew E. Taylor, editors,Fortieth AAAI Conference on Artificial Intelligence, Thirty-Eighth Conference on Innovative Applications of Artificial Intelligence, Sixteenth Symposium on Edu...

  9. [9]

    doi: 10.1609/AAAI.V40I8.37582

    AAAI Press, 2026. doi: 10.1609/AAAI.V40I8.37582. URL https://doi.org/10. 1609/aaai.v40i8.37582

  10. [13]

    Distilling the Knowledge in a Neural Network

    Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network.CoRR, abs/1503.02531, 2015. URLhttp://arxiv.org/abs/1503.02531

  11. [14]

    MViTv2: Improved Multiscale Vision Transformers for Classification and Detection , isbn =

    Jiangtong Li, Li Niu, and Liqing Zhang. From representation to reasoning: Towards both evidence and commonsense reasoning for video question-answering. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18- 24, 2022, pages 21241–21250. IEEE, 2022. doi: 10.1109/CVPR52688.2022.02059. URL https://doi.org/1...

  12. [15]

    In: CVPR

    Junbin Xiao, Angela Yao, Yicong Li, and Tat-Seng Chua. Can I trust your answer? visu- ally grounded video question answering. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 13204–13214. IEEE, 2024. doi: 10.1109/CVPR52733.2024.01254. URL https://doi.org/10.1109/ CVPR52733.2024.01254

  13. [18]

    R1-onevision: Advancing gen- eralized multimodal reasoning through cross-modal formal- ization.arXiv preprint arXiv:2503.10615, 2025

    Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, and Wei Chen. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization.CoRR, abs/2503.10615,

  14. [21]

    R1-reward: Training multimodal reward model through stable reinforcement learning.CoRR, abs/2505.02835, 2025

    Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, et al. R1-reward: Training multimodal reward model through stable reinforcement learning.arXiv preprint arXiv:2505.02835, 2025

  15. [22]

    Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

    Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

  16. [23]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    Team Glm, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024

  17. [24]

    OneThinker: All-in-one Reasoning Model for Image and Video

    Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043, 2025

  18. [25]

    Kwai keye-vl 1.5 technical report.arXiv preprint arXiv:2509.01563, 2025

    Biao Yang, Bin Wen, Boyang Ding, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, et al. Kwai keye-vl 1.5 technical report.arXiv preprint arXiv:2509.01563, 2025

  19. [26]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

  20. [27]

    Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception,

    Ziang Yan, Xinhao Li, Yinan He, Zhengrong Yue, Xiangyu Zeng, Yali Wang, Yu Qiao, Limin Wang, and Yi Wang. Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception.arXiv preprint arXiv:2509.21100, 2025. 11

  21. [30]

    Video models are zero-shot learners and reasoners

    Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328, 2025

  22. [31]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machin...

  23. [32]

    Glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

  24. [34]

    What’s in the image? a deep-dive into the vision of vision language models

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shao- hui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. I...

  25. [35]

    Longvideobench: A benchmark for long-context interleaved video-language understanding

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Pro...

  26. [36]

    What’s in the image? a deep-dive into the vision of vision language models

    Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, Chengye Wang, Ziyao Shangguan, Zhen- wen Liang, Yixin Liu, Chen Zhao, and Arman Cohan. MMVU: measuring expert-level multi-discipline video understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashvi...

  27. [37]

    In: CVPR

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Lou, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark. InIEEE/CVF Conference on Computer Vision and Pat- tern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 22195–22206. IEEE, 2024. doi: 10.1109/CVP...

  28. [39]

    2023 , url =

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 11941–11952. IEEE, 2023. doi: 10. 1109/ICCV51070.2023.01100. URL https://doi.org/10.1109/ICCV51070.2023. 01100

  29. [43]

    Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

    Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peidian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai, Chenhong He, Dong Zhang, Duo Zhang, Guoan Wang, Hao Tian, Haochen Zhao, Heng Qu, Hongshen...

  30. [44]

    Longvila: Scaling long-context visual lan- guage models for long videos

    Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Yihui He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, and Song Han. Longvila: Scaling long-context visual lan- guage models for long videos. InThe Thirteenth International Conference on Learning Representation...

  31. [45]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, Shuxun Yan...

  32. [46]

    What’s in the image? a deep-dive into the vision of vision language models

    Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor G. Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander To- shev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, and Alaaeldin El-Nouby. Multimodal autoregressive pre-training of large vision encoders. InIEEE/CVF Conference on Computer ...

  33. [47]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

  34. [48]

    HealthBench: Evaluating Large Language Models Towards Improved Human Health

    Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025. A Further Discussion, Limitations, and Future Work While our wor...

  35. [49]

    Accuracy (1-10): Does the caption correctly describe the visual content without hallucination?-1-2: Major factual errors or hallucinations throughout-3-4: Several inaccuracies or noticeable hallucinations-5-6: Mostly correct with minor inaccuracies-7-8: Accurate with only trivial errors-9-10: Fully accurate, no errors or hallucinations Evaluate the captio...

  36. [50]

    accuracy

    Completeness (1-10): Does the caption cover the key events, actions, and objects in the video?-1-2: Misses most important content-3-4: Covers only a small portion of key content-5-6: Covers main events but misses notable details-7-8: Covers most content with only minor omissions-9-10: Comprehensively covers all important contentThink step by step before s...