Recognition: no theorem link
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
Pith reviewed 2026-05-10 19:55 UTC · model grok-4.3
The pith
Rewarding vision-language models for saliency map overlap with human boxes makes their reasoning more visually grounded.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Saliency-R1 introduces an efficient saliency map method that identifies image regions influencing each token without extra computation and extends it to trace visual evidence through the full reasoning chain. The overlap between these maps and human-annotated bounding boxes serves as the reward signal inside Group Relative Policy Optimization, training the model to align its thinking process with relevant visual content. Experiments report gains in faithfulness, interpretability, and task performance.
What carries the argument
Saliency-map alignment reward computed from overlap with human-annotated bounding boxes and optimized through Group Relative Policy Optimization (GRPO).
If this is right
- Reasoning chains show traceable flow from specific image regions to the final answer.
- Models produce fewer responses that rely primarily on textual shortcuts.
- Users can inspect saliency maps to verify whether the reasoning used the intended visual evidence.
- Task performance improves on problems that require tight image-text alignment.
Where Pith is reading between the lines
- The same overlap reward could be tested on video or audio models to enforce grounding across time.
- If bounding-box labels prove costly, the method might be extended with automatically generated pseudo-labels derived from the model's own outputs.
- Combining this visual reward with other signals could further reduce hallucination rates in open-ended generation.
Load-bearing premise
That greater overlap between the model's saliency maps and human bounding boxes means the reasoning process is actually grounded in the visual evidence rather than still driven by text cues.
What would settle it
A test case where high saliency overlap occurs but the model still produces incorrect or ungrounded answers that ignore the highlighted regions.
Figures
read the original abstract
Vision-language models (VLMs) have achieved remarkable success across diverse tasks. However, concerns about their trustworthiness persist, particularly regarding tendencies to lean more on textual cues than visual evidence and the risk of producing ungrounded or fabricated responses. To address these issues, we propose Saliency-R1, a framework for improving the interpretability and faithfulness of VLMs reasoning. Specifically, we introduce a novel saliency map technique that efficiently highlights critical image regions contributing to generated tokens without additional computational overhead. This can further be extended to trace how visual information flows through the reasoning process to the final answers, revealing the alignment between the thinking process and the visual context. We use the overlap between the saliency maps and human-annotated bounding boxes as the reward function, and apply Group Relative Policy Optimization (GRPO) to align the salient parts and critical regions, encouraging models to focus on relevant areas when conduct reasoning. Experiments show Saliency-R1 improves reasoning faithfulness, interpretability, and overall task performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Saliency-R1, a framework for VLMs that introduces a novel saliency-map technique to trace token contributions and visual information flow during chain-of-thought reasoning without extra overhead. It defines a reward as the overlap between these saliency maps and human-annotated bounding boxes, then optimizes the model via Group Relative Policy Optimization (GRPO) to encourage focus on relevant visual regions, with the goal of improving reasoning faithfulness, interpretability, and downstream task performance.
Significance. If the central claims hold after proper validation, the work could provide a lightweight RL-based mechanism for enforcing visual grounding in VLM reasoning, addressing a key trustworthiness gap. The absence of any reported datasets, baselines, metrics, ablations, or statistical tests in the manuscript, however, prevents assessment of whether the approach delivers meaningful gains over existing attention- or gradient-based saliency methods.
major comments (2)
- [Abstract] Abstract: the assertion that 'Experiments show Saliency-R1 improves reasoning faithfulness, interpretability, and overall task performance' is unsupported by any quantitative results, datasets, baselines, or ablation studies, rendering the central empirical claim unevaluable.
- [Abstract / Method] The reward definition (overlap between the novel saliency maps and human-annotated boxes) is load-bearing for the faithfulness claim, yet the manuscript provides no perturbation tests (e.g., masking salient regions and measuring answer change), no comparison of saliency fidelity against attention/gradient baselines, and no evidence that higher overlap correlates with reduced textual shortcut reliance. Without these, the GRPO objective may optimize the proxy while leaving actual grounding unchanged.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that the current manuscript requires stronger empirical support for its claims and will revise accordingly to include quantitative results, datasets, baselines, ablations, and validation experiments. Below we respond point by point to the major comments.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion that 'Experiments show Saliency-R1 improves reasoning faithfulness, interpretability, and overall task performance' is unsupported by any quantitative results, datasets, baselines, or ablation studies, rendering the central empirical claim unevaluable.
Authors: We acknowledge that the abstract's claim is not accompanied by specific quantitative results, dataset details, baselines, or ablations in the current manuscript. The experiments section will be expanded in the revision to report concrete metrics on faithfulness (e.g., grounding accuracy), interpretability (e.g., saliency alignment scores), and task performance across standard VLM benchmarks, with explicit comparisons to baselines and ablation studies. The abstract will be updated to reference these results more precisely. revision: yes
-
Referee: [Abstract / Method] The reward definition (overlap between the novel saliency maps and human-annotated boxes) is load-bearing for the faithfulness claim, yet the manuscript provides no perturbation tests (e.g., masking salient regions and measuring answer change), no comparison of saliency fidelity against attention/gradient baselines, and no evidence that higher overlap correlates with reduced textual shortcut reliance. Without these, the GRPO objective may optimize the proxy while leaving actual grounding unchanged.
Authors: We agree that validating the saliency overlap reward as a faithful proxy for visual grounding requires additional tests. In the revised manuscript we will add: (1) perturbation experiments masking high-saliency regions and measuring resulting changes in answer accuracy and reasoning paths; (2) quantitative comparisons of our saliency maps against attention- and gradient-based baselines using standard fidelity metrics; and (3) controlled analyses on shortcut-prone examples demonstrating that higher overlap scores correlate with reduced textual cue reliance. These will directly address whether GRPO optimizes for genuine grounding. revision: yes
Circularity Check
No significant circularity detected
full rationale
The derivation chain defines a novel saliency-map computation from token contributions, then sets the RL reward explicitly as overlap with independent human-annotated bounding boxes before applying GRPO. Because the human boxes constitute external data and the saliency extraction is presented as a new technique without self-referential equations or fitted parameters renamed as predictions, no load-bearing step reduces by construction to its own inputs. No self-citation chains, uniqueness theorems, or ansatz smuggling appear in the abstract or method outline. The central claim of improved faithfulness therefore rests on experimental outcomes measured against separate task metrics rather than tautological re-use of the reward signal itself.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Novel saliency map technique for VLMs
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Quantifying attention flow in transformers
Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020. 2, 5, 3
2020
-
[2]
Reduan Achtibat, Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Aakriti Jain, Thomas Wiegand, Sebas- tian Lapuschkin, and Wojciech Samek. Attnlrp: attention- aware layer-wise relevance propagation for transformers. arXiv preprint arXiv:2402.05602, 2024. 5, 3
-
[3]
Aligning What LLMs Do and Say: Towards Self-Consistent Explanations
Sahar Admoni, Ofra Amir, Assaf Hallak, and Yftah Ziser. Towards large language models with self-consistent natural language explanations.arXiv preprint arXiv:2506.07523,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Make your llm fully utilize the context.Advances in Neural Information Processing Sys- tems, 37:62160–62188, 2024
Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian- Guang Lou, and Weizhu Chen. Make your llm fully utilize the context.Advances in Neural Information Processing Sys- tems, 37:62160–62188, 2024. 2
2024
-
[5]
Introducing claude 3.5 sonnet, 2024
Anthropic. Introducing claude 3.5 sonnet, 2024. Accessed: 2025-08-28. 1, 6
2024
-
[6]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 5, 6, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Chain-of-thought is not explainability.Preprint, alphaXiv, page v2, 2025
Fazl Barez, Tung-Yu Wu, Iv´an Arcuschin, Michael Lan, Vin- cent Wang, Noah Siegel, Nicolas Collignon, Clement Neo, Isabelle Lee, Alasdair Paren, et al. Chain-of-thought is not explainability.Preprint, alphaXiv, page v2, 2025. 1
2025
-
[8]
Lvlm-intrepret: An interpretability tool for large vision-language models
Gabriela Ben Melech Stan, Estelle Aflalo, Raanan Yehezkel Rohekar, Anahita Bhiwandiwalla, Shao-Yen Tseng, Matthew Lyle Olson, Yaniv Gurwicz, Chenfei Wu, Nan Duan, and Vasudev Lal. Lvlm-intrepret: An interpretability tool for large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8182–8187, 2024. 2
2024
-
[9]
SAM 3: Segment Anything with Concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoub- hik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025. 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Fu Chaoyou, Chen Peixian, Shen Yunhang, Qin Yulei, Zhang Mengdan, Lin Xu, Yang Jinrui, Zheng Xiawu, Li Ke, Sun Xing, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models.arXiv preprint arXiv:2306.13394, 3, 2023. 5, 2
work page internal anchor Pith review arXiv 2023
-
[11]
Are we on the right way for evaluating large vision-language models?Advances in Neural Informa- tion Processing Systems, 37:27056–27087, 2024
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Informa- tion Processing Systems, 37:27056–27087, 2024. 5, 2
2024
-
[12]
Why is spatial reasoning hard for VLMs? an attention mechanism perspective on focus ar- eas
Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, and Manling Li. Why is spatial reasoning hard for VLMs? an attention mechanism perspective on focus ar- eas. InForty-second International Conference on Machine Learning, 2025. 2
2025
-
[13]
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- tam, Saurabh Gupta, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325, 2015. 5
work page internal anchor Pith review arXiv 2015
-
[14]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
A survey on multimodal large lan- guage models for autonomous driving
Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, Kuei-Da Liao, et al. A survey on multimodal large lan- guage models for autonomous driving. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 958–979, 2024. 1
2024
-
[16]
Yasser Dahou, Ngoc Dung Huynh, Phuc H Le-Khac, Wamiq Reyaz Para, Ankit Singh, and Sanath Narayan. Vision-language models can’t see the obvious.arXiv preprint arXiv:2507.04741, 2025. 5, 3
-
[17]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Insight-v: Ex- ploring long-chain visual reasoning with multimodal large language models
Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, and Ziwei Liu. Insight-v: Ex- ploring long-chain visual reasoning with multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9062–9072, 2025. 6
2025
-
[19]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 3
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[20]
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning
Chengqi Duan, Rongyao Fang, Yuqing Wang, Kun Wang, Linjiang Huang, Xingyu Zeng, Hongsheng Li, and Xihui Liu. Got-r1: Unleashing reasoning capability of mllm for vi- sual generation with reinforcement learning.arXiv preprint arXiv:2505.17022, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
A mathemati- cal framework for transformer circuits.Transformer Circuits Thread, 1(1):12, 2021
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathemati- cal framework for transformer circuits.Transformer Circuits Thread, 1(1):12, 2021. 2, 4
2021
-
[22]
Joseph Enguehard. Sequential integrated gradients: a simple but effective method for explaining language models.arXiv preprint arXiv:2305.15853, 2023. 2
-
[23]
Multi-modal hal- lucination control by visual information grounding
Alessandro Favero, Luca Zancato, Matthew Trager, Sid- dharth Choudhary, Pramuditha Perera, Alessandro Achille, Ashwin Swaminathan, and Stefano Soatto. Multi-modal hal- lucination control by visual information grounding. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14303–14312, 2024. 1
2024
-
[24]
Interpreting CLIP’s image representation via text-based de- composition
Yossi Gandelsman, Alexei A Efros, and Jacob Steinhardt. Interpreting CLIP’s image representation via text-based de- composition. InThe Twelfth International Conference on Learning Representations, 2024. 4
2024
-
[25]
Boosting the visual interpretability of CLIP via adversarial fine-tuning
Shizhan Gong, Haoyu LEI, Qi Dou, and Farzan Farnia. Boosting the visual interpretability of CLIP via adversarial fine-tuning. InThe Thirteenth International Conference on Learning Representations, 2025. 2
2025
-
[26]
Concepts from neurons: Building interpretable medical im- age diagnostic models by dissecting opaque neural networks
Shizhan Gong, Huayu Wang, Xiaofan Zhang, and Qi Dou. Concepts from neurons: Building interpretable medical im- age diagnostic models by dissecting opaque neural networks. InInternational Conference on Information Processing in Medical Imaging, pages 3–18. Springer, 2025. 1
2025
-
[27]
Vitoria Guardieiro, Adam Stein, Avishree Khare, and Eric Wong. Instruction following by boosting attention of large language models.arXiv preprint arXiv:2506.13734, 2025. 2
-
[28]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 1, 2, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Can MLLMs reason in multimodality? EMMA: An enhanced multimodal reasoning benchmark
Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, and Yu Cheng. Can MLLMs reason in multimodality? EMMA: An enhanced multimodal reasoning benchmark. InForty-second International Confer- ence on Machine Learning, 2025. 1
2025
-
[30]
Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 5
2022
-
[31]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,
work page internal anchor Pith review arXiv
-
[32]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, and Hong- sheng Li. T2i-r1: Reinforcing image generation with col- laborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025. 2
-
[34]
Unmasking clever hans predictors and as- sessing what machines really learn.Nature communications, 10(1):1096, 2019
Sebastian Lapuschkin, Stephan W ¨aldchen, Alexander Binder, Gr ´egoire Montavon, Wojciech Samek, and Klaus- Robert M ¨uller. Unmasking clever hans predictors and as- sessing what machines really learn.Nature communications, 10(1):1096, 2019. 2
2019
-
[35]
A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018
Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018. 3
2018
-
[36]
Large language models in finance (finllms).Neural Computing and Applications, pages 1–15, 2025
Jean Lee, Nicholas Stevens, and Soyeon Caren Han. Large language models in finance (finllms).Neural Computing and Applications, pages 1–15, 2025. 1
2025
-
[37]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
arXiv preprint arXiv:2503.14895 (2025)
Shuo Li, Jiajun Sun, Guodong Zheng, Xiaoran Fan, Yujiong Shen, Yi Lu, Zhiheng Xi, Yuming Yang, Wenming Tan, Tao Ji, et al. Mitigating object hallucinations in mllms via multi- frequency perturbations.arXiv preprint arXiv: 2503.14895,
-
[39]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models.arXiv preprint arXiv:2305.10355, 2023. 5, 2
work page internal anchor Pith review arXiv 2023
-
[40]
arXiv preprint arXiv:2506.23270 , year=
Yi Li, Hualiang Wang, Xinpeng Ding, Haonan Wang, and Xiaomeng Li. Token activation map to visually explain mul- timodal llms.arXiv preprint arXiv:2506.23270, 2025. 2, 5, 3
-
[41]
Multi-frequency contrastive decoding: Alleviating halluci- nations for large vision-language models
Bingqian Liu, Fu Zhang, Guoqing Chen, and Jingwei Cheng. Multi-frequency contrastive decoding: Alleviating halluci- nations for large vision-language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing, pages 28556–28572, 2025. 4
2025
-
[42]
Lost in the Middle: How Language Models Use Long Contexts
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.arXiv preprint arXiv:2307.03172, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233. Springer, 2024. 5, 2
2024
-
[44]
Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025. 2
-
[45]
Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. Visionreasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025. 2
-
[46]
Bozhi Luan, Hao Feng, Hong Chen, Yonghui Wang, Wen- gang Zhou, and Houqiang Li. Textcot: Zoom in for enhanced multimodal text-rich image understanding.arXiv preprint arXiv:2404.09797, 2024. 2
-
[47]
A unified approach to in- terpreting model predictions
Scott M Lundberg and Su-In Lee. A unified approach to in- terpreting model predictions. Curran Associates, Inc., 2017. 2
2017
- [48]
-
[49]
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforce- ment learning.arXiv preprint arXiv:2503.07365, 2025. 2
work page Pith review arXiv 2025
-
[50]
Compositional chain-of-thought prompting for large multimodal models
Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-thought prompting for large multimodal models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 14420–14431, 2024. 2
2024
-
[51]
Learning to reason with language models, 2024
OpenAI. Learning to reason with language models, 2024. Accessed: 2025-08-28. 2
2024
-
[52]
On measuring faith- fulness or self-consistency of natural language explanations
Letitia Parcalabescu and Anette Frank. On measuring faith- fulness or self-consistency of natural language explanations. InProceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 6048–6089, 2024. 1
2024
-
[53]
Do vision & language decoders use images and text equally? how self-consistent are their explanations? InThe Thirteenth International Con- ference on Learning Representations, 2025
Letitia Parcalabescu and Anette Frank. Do vision & language decoders use images and text equally? how self-consistent are their explanations? InThe Thirteenth International Con- ference on Learning Representations, 2025. 1
2025
-
[54]
Glamm: Pixel grounding large multimodal model
Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdel- rahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13009–13018, 2024. 5
2024
-
[55]
” why should i trust you?” explaining the predictions of any classifier
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. ” why should i trust you?” explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD interna- tional conference on knowledge discovery and data mining, pages 1135–1144, 2016. 2
2016
-
[56]
Scienceqa: A novel resource for question answering on scholarly articles.International Journal on Digital Libraries, 23(3):289–301, 2022
Tanik Saikh, Tirthankar Ghosal, Amish Mittal, Asif Ekbal, and Pushpak Bhattacharyya. Scienceqa: A novel resource for question answering on scholarly articles.International Journal on Digital Libraries, 23(3):289–301, 2022. 5, 2
2022
-
[57]
Grad-cam: Visual explanations from deep networks via gradient-based localization
Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE in- ternational conference on computer vision, pages 618–626,
-
[58]
Haz Sameen Shahgir, Khondker Salman Sayeed, Abhik Bhattacharjee, Wasi Uddin Ahmad, Yue Dong, and Ri- fat Shahriyar. Illusionvqa: A challenging optical illu- sion dataset for vision language models.arXiv preprint arXiv:2403.15952, 2024. 5, 2
-
[59]
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuo- fan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a com- prehensive dataset and benchmark for chain-of-thought rea- soning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024. 2, 5, 1
2024
-
[60]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[61]
Guanxi Shen. Glimpse: Gradient-layer importance map- ping for prompted visual saliency explanation for generative lvlms.arXiv preprint arXiv:2506.18985, 2025. 2
-
[62]
Keep the cost down: A review on methods to optimize llm’s kv-cache consumption
Luohe Shi, Hongyi Zhang, Yao Yao, Zuchao Li, and Hai Zhao. Keep the cost down: A review on methods to optimize llm’s kv-cache consumption.arXiv preprint arXiv:2407.18003, 2024. 4
-
[63]
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025. 2
work page internal anchor Pith review arXiv 2025
-
[64]
arXiv preprint arXiv:2506.16962 (2025)
Haoran Sun, Yankai Jiang, Wenjie Lou, Yujie Zhang, Wenjie Li, Lilong Wang, Mianxin Liu, Lei Liu, and Xiaosong Wang. Enhancing step-by-step and verifiable medical reasoning in mllms.arXiv preprint arXiv:2506.16962, 2025. 2
-
[65]
Axiomatic attribution for deep networks
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InInternational conference on machine learning, pages 3319–3328. PMLR, 2017. 2
2017
-
[66]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs,
Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav- o1: Rethinking step-by-step visual reasoning in llms.arXiv preprint arXiv:2501.06186, 2025. 2
-
[68]
Cambrian-1: A fully open, vision-centric explo- ration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024
Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric explo- ration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024. 6
2024
-
[69]
Reasonvqa: A multi-hop reasoning bench- mark with structural knowledge for visual question answer- ing
Duong T Tran, Trung-Kien Tran, Manfred Hauswirth, and Danh Le Phuoc. Reasonvqa: A multi-hop reasoning bench- mark with structural knowledge for visual question answer- ing. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 18793–18803, 2025. 3
2025
-
[70]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 3
2017
-
[71]
Praveen Venkateswaran and Danish Contractor. Spotlight your instructions: Instruction-following with dynamic atten- tion steering.arXiv preprint arXiv:2505.12025, 2025. 2
-
[72]
Trl: Trans- former reinforcement learning.https://github.com/ huggingface/trl, 2020
Leandro von Werra, Younes Belkada, Lewis Tunstall, Ed- ward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallou ´edec. Trl: Trans- former reinforcement learning.https://github.com/ huggingface/trl, 2020. 5
2020
-
[73]
Score-cam: Score-weighted visual explanations for convolutional neural networks
Haofan Wang, Zifan Wang, Mengnan Du, Fan Yang, Zijian Zhang, Sirui Ding, Piotr Mardziel, and Xia Hu. Score-cam: Score-weighted visual explanations for convolutional neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 24–25, 2020. 7
2020
-
[74]
Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 1
2022
-
[75]
Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025
Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, and Ziwei Liu. Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025. 2
-
[76]
Mirage: Towards ai-generated image detection in the wild.arXiv preprint arXiv:2508.13223, 2025
Cheng Xia, Manxi Lin, Jiexiang Tan, Xiaoxiong Du, Yang Qiu, Junjun Zheng, Xiangheng Kong, Yuning Jiang, and Bo Zheng. Mirage: Towards ai-generated image detection in the wild.arXiv preprint arXiv:2508.13223, 2025. 2
-
[77]
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023. 2
work page internal anchor Pith review arXiv 2023
-
[78]
Re-align: Aligning vision language models via retrieval-augmented direct preference optimization
Shuo Xing, Peiran Li, Yuping Wang, Ruizheng Bai, Yueqi Wang, Chan-Wei Hu, Chengxuan Qian, Huaxiu Yao, and Zhengzhong Tu. Re-align: Aligning vision language models via retrieval-augmented direct preference optimization. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2379–2397, 2025. 4
2025
-
[79]
Llava-o1: Let vision language models reason step-by-step
Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vi- sion language models reason step-by-step.arXiv preprint arXiv:2411.10440, 2024. 6, 1
-
[80]
Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, and Jiaya Jia. Visionthink: Smart and efficient vision lan- guage model via reinforcement learning.arXiv preprint arXiv:2507.13348, 2025. 4
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.