Attention-guided Fine-tuning of Multimodal Large Language Models Improves Chain-of-Thought Reasoning

Aidong Zhang; Bohan Liu; Guangzhi Xiong; Sanchit Sinha; Zhenghao He

arxiv: 2606.01558 · v1 · pith:36COOF52new · submitted 2026-06-01 · 💻 cs.CV

Attention-guided Fine-tuning of Multimodal Large Language Models Improves Chain-of-Thought Reasoning

Sanchit Sinha , Guangzhi Xiong , Bohan Liu , Zhenghao He , Aidong Zhang This is my paper

Pith reviewed 2026-06-28 15:43 UTC · model grok-4.3

classification 💻 cs.CV

keywords Chain-of-ThoughtMultimodal Large Language ModelsFine-tuningAttention guidanceVisual reasoningFailure modes

0 comments

The pith

Attention-guided fine-tuning improves chain-of-thought reasoning in multimodal models by fixing two failure modes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models often perform worse with chain-of-thought prompting than with direct answers on visual reasoning tasks. Analysis reveals two main problems: committing to an answer too early and not accessing visual tokens enough during reasoning. Standard fine-tuning helps only partially and can make models rely more on text instead of images. The proposed Att-CoT method adds attention guidance to training so that models delay answers and keep looking at visual information, resulting in better performance on benchmarks.

Core claim

The paper establishes that Attentive-CoT (Att-CoT) is an attention-guided fine-tuning objective which encourages CoT trajectories to delay answer commitment while maintaining sustained visual-token access. This approach can be added to any CoT-SFT run without architectural changes, and experiments on three visual reasoning benchmarks across six MLLMs demonstrate enhanced CoT performance over standard fine-tuning.

What carries the argument

Attentive-CoT (Att-CoT): an attention-guided fine-tuning objective that encourages CoT trajectories to delay answer commitment while maintaining sustained visual-token access.

If this is right

Att-CoT plugs directly into existing CoT-SFT training without any model architecture modifications.
It improves CoT performance on three visual reasoning benchmarks.
The gains hold across six MLLMs from three families at different scales.
Unlike standard CoT-SFT, it avoids increasing reliance on textual priors and maintains counterfactual visual dependence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Attention guidance during fine-tuning could be adapted to address similar reasoning issues in text-only language models.
The approach might extend to other multimodal tasks that require step-by-step visual analysis beyond the three benchmarks tested.

Load-bearing premise

The identified failure modes of premature answer commitment and limited direct visual-token access are the primary causes of CoT degradation in MLLMs, and guiding attention during fine-tuning will reliably mitigate them without introducing new issues or reducing performance on other tasks.

What would settle it

If experiments on the three visual reasoning benchmarks across the six MLLMs show no improvement or even degradation in CoT performance when using Att-CoT compared to standard fine-tuning, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2606.01558 by Aidong Zhang, Bohan Liu, Guangzhi Xiong, Sanchit Sinha, Zhenghao He.

**Figure 1.** Figure 1: Early answer commitment during CoT. We plot ℓt (denoted by ‘P(answer | prefix)’) and the corresponding relative rank (i.e., rank in the ordered probability list) of the answer token over CoT timesteps for a ChartQA test sample. Top: QVL-3B is already highly confident before CoT generation starts. Bottom: QVL-7B remains less committed early and gradually increases confidence as the answer is computed. The… view at source ↗

**Figure 2.** Figure 2: VTA values plotted across generation steps for [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Gold EC-AUC and ATR across baselines. Left: ChartQA, Right: CLEVR. 5.3 Visual Attention Analysis Next, we test the second desideratum: sustained visual reliance during reasoning [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of SFT and Att-CoT on an example from ChartQA. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Example from the obfuscation dataset. The [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: System prompt template for generating the Ground Truth CoT Data A.4 Inference Procedure A.4.1 Structured System Prompt To enforce a consistent and standard output format, we utilize the same system prompt shown in Fig12 [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: System prompt template for Chain-ofThought inference. A.4.2 Inference Settings Attention Implementation: For all our experiments, we utilize ‘eager’ attention implementation, which, albeit slower, gives a deterministic value to all computed attention weights as opposed to ‘FlashAttention’. Decoding: For QVL family, we utilize a maximum number of generated tokens of 256 for CoT and 32 for direct across al… view at source ↗

**Figure 8.** Figure 8: Early answer commitment differs by scale. Qwen2.5-VL-3B exhibits high confidence early in the think span (premature commitment), whereas Qwen2.5-VL-7B delays commitment until later. The dashed vertical line marks the transition from the think span to the answer span. In [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Representative example where SFT exhibits intermittent spikes during the thinking span, indicating [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Visual Token Attention across model scales. We plot layer-step heatmaps of the average attention mass assigned to visual tokens, At,vis, for Qwen2.5-VL-{3B, 7B, 72B} on the same CoT-prompted example. The horizontal axis is the generation step t, the vertical axis is the decoder layer, and color indicates At,vis (higher = more cumulative visual attention). The think and answer spans are delineated to show … view at source ↗

**Figure 11.** Figure 11: Attentive-CoT increases sustained visual attention during reasoning. Layer-step heatmaps of At,vis for the base model (top) versus Att-CoT (bottom) on the same CoT-prompted sample. Att-CoT shifts attention mass toward the think span and maintains more consistent visual grounding across steps, compared to the base model’s more transient/spiky visual attention near the answer span. 17 [PITH_FULL_IMAGE:figu… view at source ↗

**Figure 12.** Figure 12: Qualitative comparison of SFT vs. Att-CoT on a CLEVR example. SFT model commits to an incorrect [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative comparison of SFT vs. Att-CoT on a CV-Bench example. SFT model commits to an incorrect [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

**Figure 14.** Figure 14: Comparison of token-to-visual attention overlays for the base model, SFT, and Att-CoT on the same [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

read the original abstract

The effectiveness of Chain-of-Thought (CoT) prompting in Multimodal Large Language Models (MLLMs) remains uncertain: across several visual reasoning benchmarks, CoT prompting often degrades performance compared to direct prompting. In this paper, we provide a systematic analysis of CoT behavior in three modern MLLM families across model scales on datasets requiring step-wise visual evidence. Our analysis identifies two recurring failure modes: premature answer commitment and limited direct visual-token access during rationale generation. We further find that standard CoT-style Supervised Fine-Tuning (CoT-SFT) can mitigate these issues only partially, while often increasing reliance on textual priors and reducing counterfactual visual dependence. Motivated by these findings, we propose Attentive-CoT (Att-CoT), an attention-guided fine-tuning objective that encourages CoT trajectories to delay answer commitment while maintaining sustained visual-token access. Att-CoT can be plugged into any CoT-SFT training run without architectural changes. Experiments on three visual reasoning benchmarks across six MLLMs show that Att-CoT enhances CoT performance over standard fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Att-CoT is a practical attention tweak during CoT fine-tuning that lifts visual reasoning scores across six MLLMs on three benchmarks by pushing delayed answers and sustained image token use.

read the letter

The core contribution is an attention-guided fine-tuning objective that directly targets two observed CoT failure modes in MLLMs: early answer commitment and weak access to visual tokens during reasoning steps. They first document these patterns across model families and scales on step-wise visual datasets, then show that ordinary CoT-SFT only partially fixes them and can even increase text-prior reliance. Att-CoT adds an explicit penalty or reward on attention maps to keep the model looking at the image longer and deferring the final answer; it slots into any existing SFT run with no architecture change.

The analysis of failure modes is the strongest part. The patterns they describe are concrete and recur, and testing six models gives some breadth. The empirical claim is straightforward: Att-CoT beats standard fine-tuning on the reported benchmarks. That is useful for anyone already doing CoT-SFT on multimodal models.

The main gaps are the usual ones for an empirical methods paper. The abstract does not give effect sizes, confidence intervals, or ablation results that separate the attention term from simple changes in training data or length. It is also unclear how stable the gains are when the same models are evaluated on non-visual or non-reasoning tasks, or whether the claimed increase in visual dependence holds under counterfactual checks. Those details matter for judging whether the mechanism is as load-bearing as described.

This is aimed at researchers who fine-tune MLLMs for visual reasoning and want a drop-in objective rather than new architectures. It is solid enough on the experimental side to warrant peer review; the central result is testable and the method is cheap to try. I would send it out.

Referee Report

1 major / 2 minor

Summary. The paper analyzes CoT prompting behavior in MLLMs, identifying two failure modes (premature answer commitment and limited direct visual-token access) that cause performance degradation on visual reasoning tasks. It shows that standard CoT-SFT only partially mitigates these issues and can increase reliance on textual priors. Motivated by this, the authors propose Attentive-CoT (Att-CoT), a plug-in attention-guided fine-tuning objective that encourages delayed answer commitment and sustained visual-token access during rationale generation. Experiments across three visual reasoning benchmarks and six MLLMs demonstrate that Att-CoT improves CoT performance relative to standard fine-tuning without requiring architectural modifications.

Significance. If the empirical results hold after addressing experimental details, the work offers a practical, architecture-agnostic technique for improving multimodal chain-of-thought reasoning. The systematic failure-mode analysis and multi-model/multi-benchmark evaluation provide a useful empirical foundation; the no-architecture-change property is a clear strength for adoption.

major comments (1)

[Experiments / abstract] Experimental results (as described in the abstract and implied evaluation sections): performance gains are reported without statistical significance tests, error bars, run-to-run variance, or ablation studies isolating the attention-guidance components from other training differences. This is load-bearing for the central claim that Att-CoT specifically mitigates the identified failure modes rather than arising from uncontrolled factors such as data selection or optimization details.

minor comments (2)

[Method] The abstract and method description would benefit from an explicit equation or pseudocode formalizing the attention-guided loss term in Att-CoT.
[Experiments] Clarify the exact visual reasoning benchmarks and model scales used in the six-MLLM evaluation to allow direct replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. The concern regarding experimental rigor is valid and we will strengthen the manuscript accordingly.

read point-by-point responses

Referee: [Experiments / abstract] Experimental results (as described in the abstract and implied evaluation sections): performance gains are reported without statistical significance tests, error bars, run-to-run variance, or ablation studies isolating the attention-guidance components from other training differences. This is load-bearing for the central claim that Att-CoT specifically mitigates the identified failure modes rather than arising from uncontrolled factors such as data selection or optimization details.

Authors: We agree that the absence of statistical tests, error bars, variance reporting, and targeted ablations weakens the ability to isolate the contribution of the attention-guidance objective. The manuscript demonstrates consistent gains across six MLLMs and three benchmarks, but this does not substitute for the requested controls. In the revised version we will add: multiple random seeds with error bars and run-to-run variance; statistical significance testing (e.g., paired t-tests against CoT-SFT baselines); and ablations that hold data, optimizer, and schedule fixed while varying only the attention-guidance term. These results will be incorporated into the experiments section and abstract where appropriate. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents an empirical analysis of CoT failure modes in MLLMs followed by a proposed attention-guided fine-tuning objective (Att-CoT) that is evaluated on external visual reasoning benchmarks across six models. No equations, fitted parameters, or self-citations are used to derive the central performance claim; the improvement is demonstrated directly via controlled experiments rather than by construction from the paper's own inputs or prior self-referential results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical results from fine-tuning experiments. No explicit free parameters beyond standard training hyperparameters are described. The approach relies on domain assumptions about attention mechanisms in transformers and the validity of the identified failure modes.

axioms (1)

domain assumption Standard supervised fine-tuning loss functions and attention mechanisms in transformer-based MLLMs behave as expected under the proposed objective.
The method assumes these components can be guided without side effects, as invoked in the description of Att-CoT.

pith-pipeline@v0.9.1-grok · 5733 in / 1316 out tokens · 31832 ms · 2026-06-28T15:43:20.433675+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

85 extracted references · 42 canonical work pages · 15 internal anchors

[1]

Chen, Z.,et al.,

Bring reason to vision: Understanding perception and reasoning through model merging , author=. arXiv preprint arXiv:2505.05464 , year=

work page arXiv
[2]

URL https: //aclanthology.org/2025.acl-long.257/

Sun, Hai-Long and Sun, Zhun and Peng, Houwen and Ye, Han-Jia. Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long C o T Reasoning. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.257

work page doi:10.18653/v1/2025.acl-long.257 2025
[3]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

When Big Models Train Small Ones: Label-Free Model Parity Alignment for Efficient Visual Question Answering using Small VLMs , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[4]

Lora vs full fine-tuning: An illusion of equivalence

Lora vs full fine-tuning: An illusion of equivalence , author=. arXiv preprint arXiv:2410.21228 , year=

work page arXiv
[5]

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Llava-o1: Let vision language models reason step-by-step , author=. arXiv preprint arXiv:2411.10440 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Are Large Vision Language Models up to the Challenge of Chart Comprehension and Reasoning

Islam, Mohammed Saidul and Rahman, Raian and Masry, Ahmed and Laskar, Md Tahmid Rahman and Nayeem, Mir Tafseer and Hoque, Enamul. Are Large Vision Language Models up to the Challenge of Chart Comprehension and Reasoning. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.191

work page doi:10.18653/v1/2024.findings-emnlp.191 2024
[7]

C hart G emma: Visual Instruction-tuning for Chart Reasoning in the Wild

Masry, Ahmed and Thakkar, Megh and Bajaj, Aayush and Kartha, Aaryaman and Hoque, Enamul and Joty, Shafiq. C hart G emma: Visual Instruction-tuning for Chart Reasoning in the Wild. Proceedings of the 31st International Conference on Computational Linguistics: Industry Track. 2025

2025
[8]

Forty-second International Conference on Machine Learning , year=

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding , author=. Forty-second International Conference on Machine Learning , year=
[9]

Chart-based Reasoning: Transferring Capabilities from LLM s to VLM s

Carbune, Victor and Mansoor, Hassan and Liu, Fangyu and Aralikatte, Rahul and Baechler, Gilles and Chen, Jindong and Sharma, Abhanshu. Chart-based Reasoning: Transferring Capabilities from LLM s to VLM s. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.62

work page doi:10.18653/v1/2024.findings-naacl.62 2024
[10]

V - DPO : Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization

Xie, Yuxi and Li, Guanzhen and Xu, Xiao and Kan, Min-Yen. V - DPO : Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.775

work page doi:10.18653/v1/2024.findings-emnlp.775 2024
[11]

arXiv preprint arXiv:2503.16188 , year=

Think or not think: A study of explicit thinking in rule-based visual reinforcement fine-tuning , author=. arXiv preprint arXiv:2503.16188 , year=

work page arXiv
[12]

arXiv preprint arXiv:2503.08525 , year=

GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training , author=. arXiv preprint arXiv:2503.08525 , year=

work page arXiv
[13]

arXiv preprint arXiv:2203.08788 , year=

Are shortest rationales the best explanations for human understanding? , author=. arXiv preprint arXiv:2203.08788 , year=

work page arXiv
[14]

Visual-RFT: Visual Reinforcement Fine-Tuning

Visual-rft: Visual reinforcement fine-tuning , author=. arXiv preprint arXiv:2503.01785 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Evochart: A benchmark and a self-training approach towards real-world chart understanding , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[16]

International Conference on Learning Representations (ICLR) , year=

React: Synergizing reasoning and acting in language models , author=. International Conference on Learning Representations (ICLR) , year=
[17]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

C hart QAP ro: A More Diverse and Challenging Benchmark for Chart Question Answering

Masry, Ahmed and Islam, Mohammed Saidul and Ahmed, Mahir and Bajaj, Aayush and Kabir, Firoz and Kartha, Aaryaman and Laskar, Md Tahmid Rahman and Rahman, Mizanur and Rahman, Shadikur and Shahmohammadi, Mehrad and Thakkar, Megh and Parvez, Md Rizwan and Hoque, Enamul and Joty, Shafiq. C hart QAP ro: A More Diverse and Challenging Benchmark for Chart Questi...

work page doi:10.18653/v1/2025.findings-acl.978 2025
[19]

arXiv preprint arXiv:2312.15915 , year=

Chartbench: A benchmark for complex visual reasoning in charts , author=. arXiv preprint arXiv:2312.15915 , year=

work page arXiv
[20]

Qwen2.5-VL Technical Report

Qwen2.5-VL Technical Report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. arXiv preprint arXiv:2409.12191 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. arXiv preprint arXiv:2308.12966 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Findings of the Association for Computational Linguistics: ACL 2022 , pages=

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning , author=. Findings of the Association for Computational Linguistics: ACL 2022 , pages=

2022
[24]

GitHub repository , publisher =

Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou. GitHub repository , publisher =
[25]

and Kumar, Pratyush , title =

Methani, Nitesh and Ganguly, Pritha and Khapra, Mitesh M. and Kumar, Pratyush , title =. The IEEE Winter Conference on Applications of Computer Vision (WACV) , month =
[26]

Findings of the Association for Computational Linguistics: EACL 2023 , pages=

Reading and Reasoning over Chart Images for Evidence-based Automated Fact-Checking , author=. Findings of the Association for Computational Linguistics: EACL 2023 , pages=

2023
[27]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
[28]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023
[29]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[30]

International Conference on Machine Learning , pages=

Pix2struct: Screenshot parsing as pretraining for visual language understanding , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[31]

arXiv preprint arXiv:2504.13275 , year=

ChartQA-X: Generating Explanations for Charts , author=. arXiv preprint arXiv:2504.13275 , year=

work page arXiv
[32]

arXiv preprint arXiv:2504.06637 , year=

SCI-Reason: A Dataset with Chain-of-Thought Rationales for Complex Multimodal Reasoning in Academic Areas , author=. arXiv preprint arXiv:2504.06637 , year=

work page arXiv
[33]

The Exploration in AI Today Workshop at ICML 2025 , year=

Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models , author=. The Exploration in AI Today Workshop at ICML 2025 , year=

2025
[34]

Findings of the Association for Computational Linguistics ACL 2024 , pages=

ChartAssistant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning , author=. Findings of the Association for Computational Linguistics ACL 2024 , pages=

2024
[35]

T iny C hart: Efficient Chart Understanding with Program-of-Thoughts Learning and Visual Token Merging

Zhang, Liang and Hu, Anwen and Xu, Haiyang and Yan, Ming and Xu, Yichen and Jin, Qin and Zhang, Ji and Huang, Fei. T iny C hart: Efficient Chart Understanding with Program-of-Thoughts Learning and Visual Token Merging. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.112

work page doi:10.18653/v1/2024.emnlp-main.112 2024
[36]

Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

Graph-Based Multimodal Contrastive Learning for Chart Question Answering , author=. Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=
[37]

C hart I nsights: Evaluating Multimodal Large Language Models for Low-Level Chart Question Answering

Wu, Yifan and Yan, Lutao and Shen, Leixian and Wang, Yunhai and Tang, Nan and Luo, Yuyu. C hart I nsights: Evaluating Multimodal Large Language Models for Low-Level Chart Question Answering. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.710

work page doi:10.18653/v1/2024.findings-emnlp.710 2024
[38]

arXiv preprint arXiv:2502.12289 , year=

Evaluating step-by-step reasoning traces: A survey , author=. arXiv preprint arXiv:2502.12289 , year=

work page arXiv
[39]

arXiv preprint arXiv:2505.20777 , year=

TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs , author=. arXiv preprint arXiv:2505.20777 , year=

work page arXiv
[40]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Advances in Neural Information Processing Systems , volume=

Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting , author=. Advances in Neural Information Processing Systems , volume=
[42]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Sft memorizes, rl generalizes: A comparative study of foundation model post-training , author=. arXiv preprint arXiv:2501.17161 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

arXiv preprint arXiv:2504.02906 , year=

Boosting Chart-to-Code Generation in MLLM via Dual Preference-Guided Refinement , author=. arXiv preprint arXiv:2504.02906 , year=

work page arXiv
[44]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Vision-r1: Incentivizing reasoning capability in multimodal large language models , author=. arXiv preprint arXiv:2503.06749 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Reimers, Nils and Gurevych, Iryna. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019

2019
[46]

Gemma 3 Technical Report

Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency , author=. arXiv preprint arXiv:2508.18265 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

arXiv preprint arXiv:2505.21523 , year=

More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models , author=. arXiv preprint arXiv:2505.21523 , year=

work page arXiv
[49]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling , author=. arXiv preprint arXiv:2412.05271 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

Emergent Abilities of Large Language Models

Emergent abilities of large language models , author=. arXiv preprint arXiv:2206.07682 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

BigCharts-R1: Enhanced Chart Reasoning with Visual Reinforcement Finetuning , author=
[53]

Advances in Neural Information Processing Systems , volume=

Charxiv: Charting gaps in realistic chart understanding in multimodal llms , author=. Advances in Neural Information Processing Systems , volume=
[54]

The Eleventh International Conference on Learning Representations , year=

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=
[55]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Improve vision language model chain-of-thought reasoning , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[56]

Multimodal Chain-of-Thought Reasoning in Language Models

Multimodal chain-of-thought reasoning in language models , author=. arXiv preprint arXiv:2302.00923 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Measuring and improving chain-of-thought reasoning in vision-language models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2024
[58]

Advances in Neural Information Processing Systems , volume=

Cambrian-1: A fully open, vision-centric exploration of multimodal llms , author=. Advances in Neural Information Processing Systems , volume=
[59]

On the impact of fine-tuning on chain-of-thought reasoning , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025
[60]

arXiv preprint arXiv:2407.10490 , year=

Learning dynamics of llm finetuning , author=. arXiv preprint arXiv:2407.10490 , year=

work page arXiv
[61]

arXiv preprint arXiv:2512.12218 , year=

Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking , author=. arXiv preprint arXiv:2512.12218 , year=

work page arXiv
[62]

train on the test set

Benchmark Designers Should" Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts , author=. arXiv preprint arXiv:2511.04655 , year=

work page arXiv
[63]

arXiv preprint arXiv:2507.03019 , year=

Look-back: Implicit visual re-focusing in mllm reasoning , author=. arXiv preprint arXiv:2507.03019 , year=

work page arXiv
[64]

arXiv preprint arXiv:2503.01773 , year=

Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas , author=. arXiv preprint arXiv:2503.01773 , year=

work page arXiv
[65]

Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Li, Yong Jae , journal=
[66]

8th Annual Conference on Robot Learning , year=

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models , author =. 8th Annual Conference on Robot Learning , year=
[67]

European Conference on Computer Vision (ECCV) , year =

Chonghao Sima and Katrin Renz and Kashyap Chitta and Li Chen and Hanxue Zhang and Chengen Xie and Ping Luo and Andreas Geiger and Hongyang Li , title =. European Conference on Computer Vision (ECCV) , year =
[68]

2025 , eprint =

Distilling Multi-modal Large Language Models for Autonomous Driving , author =. 2025 , eprint =

2025
[69]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

Hua, Hang and Liu, Qing and Zhang, Lingzhi and Shi, Jing and Kim, Soo Ye and Zhang, Zhifei and Wang, Yilin and Zhang, Jianming and Lin, Zhe and Luo, Jiebo , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =
[70]

2025 , eprint =

Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception , author =. 2025 , eprint =

2025
[71]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Wang, Sheng and Li, Yilei and Yuan, Jiayi and Zhou, Pan and Chen, Xinyuan and Fu, Jingjing and Xu, Jie and Jiang, Lijuan , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =

2025
[72]

2024 IEEE/ACM Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE) , pages=

On Large Visual Language Models for Medical Imaging Analysis: An Empirical Study , author=. 2024 IEEE/ACM Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE) , pages=. 2024 , organization=

2024
[73]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
[74]

Advances in Neural Information Processing Systems , editor =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems , editor =
[75]

Advances in Neural Information Processing Systems , editor =

Large Language Models are Zero-Shot Reasoners , author =. Advances in Neural Information Processing Systems , editor =
[76]

and Girshick, Ross , title =

Johnson, Justin and Hariharan, Bharath and van der Maaten, Laurens and Fei-Fei, Li and Lawrence Zitnick, C. and Girshick, Ross , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =
[77]

and Manning, Christopher D

Hudson, Drew A. and Manning, Christopher D. , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =
[78]

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Zellers, Rowan and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , title =. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =
[79]

arXiv preprint arXiv:2404.18624 , year=

Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations? , author=. arXiv preprint arXiv:2404.18624 , year=

work page arXiv
[80]

arXiv preprint arXiv:2010.13984 , year=

Interpretation of NLP models through input marginalization , author=. arXiv preprint arXiv:2010.13984 , year=

work page arXiv 2010

Showing first 80 references.

[1] [1]

Chen, Z.,et al.,

Bring reason to vision: Understanding perception and reasoning through model merging , author=. arXiv preprint arXiv:2505.05464 , year=

work page arXiv

[2] [2]

URL https: //aclanthology.org/2025.acl-long.257/

Sun, Hai-Long and Sun, Zhun and Peng, Houwen and Ye, Han-Jia. Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long C o T Reasoning. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.257

work page doi:10.18653/v1/2025.acl-long.257 2025

[3] [3]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

When Big Models Train Small Ones: Label-Free Model Parity Alignment for Efficient Visual Question Answering using Small VLMs , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025

[4] [4]

Lora vs full fine-tuning: An illusion of equivalence

Lora vs full fine-tuning: An illusion of equivalence , author=. arXiv preprint arXiv:2410.21228 , year=

work page arXiv

[5] [5]

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Llava-o1: Let vision language models reason step-by-step , author=. arXiv preprint arXiv:2411.10440 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Are Large Vision Language Models up to the Challenge of Chart Comprehension and Reasoning

Islam, Mohammed Saidul and Rahman, Raian and Masry, Ahmed and Laskar, Md Tahmid Rahman and Nayeem, Mir Tafseer and Hoque, Enamul. Are Large Vision Language Models up to the Challenge of Chart Comprehension and Reasoning. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.191

work page doi:10.18653/v1/2024.findings-emnlp.191 2024

[7] [7]

C hart G emma: Visual Instruction-tuning for Chart Reasoning in the Wild

Masry, Ahmed and Thakkar, Megh and Bajaj, Aayush and Kartha, Aaryaman and Hoque, Enamul and Joty, Shafiq. C hart G emma: Visual Instruction-tuning for Chart Reasoning in the Wild. Proceedings of the 31st International Conference on Computational Linguistics: Industry Track. 2025

2025

[8] [8]

Forty-second International Conference on Machine Learning , year=

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding , author=. Forty-second International Conference on Machine Learning , year=

[9] [9]

Chart-based Reasoning: Transferring Capabilities from LLM s to VLM s

Carbune, Victor and Mansoor, Hassan and Liu, Fangyu and Aralikatte, Rahul and Baechler, Gilles and Chen, Jindong and Sharma, Abhanshu. Chart-based Reasoning: Transferring Capabilities from LLM s to VLM s. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.62

work page doi:10.18653/v1/2024.findings-naacl.62 2024

[10] [10]

V - DPO : Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization

Xie, Yuxi and Li, Guanzhen and Xu, Xiao and Kan, Min-Yen. V - DPO : Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.775

work page doi:10.18653/v1/2024.findings-emnlp.775 2024

[11] [11]

arXiv preprint arXiv:2503.16188 , year=

Think or not think: A study of explicit thinking in rule-based visual reinforcement fine-tuning , author=. arXiv preprint arXiv:2503.16188 , year=

work page arXiv

[12] [12]

arXiv preprint arXiv:2503.08525 , year=

GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training , author=. arXiv preprint arXiv:2503.08525 , year=

work page arXiv

[13] [13]

arXiv preprint arXiv:2203.08788 , year=

Are shortest rationales the best explanations for human understanding? , author=. arXiv preprint arXiv:2203.08788 , year=

work page arXiv

[14] [14]

Visual-RFT: Visual Reinforcement Fine-Tuning

Visual-rft: Visual reinforcement fine-tuning , author=. arXiv preprint arXiv:2503.01785 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Evochart: A benchmark and a self-training approach towards real-world chart understanding , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[16] [16]

International Conference on Learning Representations (ICLR) , year=

React: Synergizing reasoning and acting in language models , author=. International Conference on Learning Representations (ICLR) , year=

[17] [17]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

C hart QAP ro: A More Diverse and Challenging Benchmark for Chart Question Answering

Masry, Ahmed and Islam, Mohammed Saidul and Ahmed, Mahir and Bajaj, Aayush and Kabir, Firoz and Kartha, Aaryaman and Laskar, Md Tahmid Rahman and Rahman, Mizanur and Rahman, Shadikur and Shahmohammadi, Mehrad and Thakkar, Megh and Parvez, Md Rizwan and Hoque, Enamul and Joty, Shafiq. C hart QAP ro: A More Diverse and Challenging Benchmark for Chart Questi...

work page doi:10.18653/v1/2025.findings-acl.978 2025

[19] [19]

arXiv preprint arXiv:2312.15915 , year=

Chartbench: A benchmark for complex visual reasoning in charts , author=. arXiv preprint arXiv:2312.15915 , year=

work page arXiv

[20] [20]

Qwen2.5-VL Technical Report

Qwen2.5-VL Technical Report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. arXiv preprint arXiv:2409.12191 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. arXiv preprint arXiv:2308.12966 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Findings of the Association for Computational Linguistics: ACL 2022 , pages=

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning , author=. Findings of the Association for Computational Linguistics: ACL 2022 , pages=

2022

[24] [24]

GitHub repository , publisher =

Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou. GitHub repository , publisher =

[25] [25]

and Kumar, Pratyush , title =

Methani, Nitesh and Ganguly, Pritha and Khapra, Mitesh M. and Kumar, Pratyush , title =. The IEEE Winter Conference on Applications of Computer Vision (WACV) , month =

[26] [26]

Findings of the Association for Computational Linguistics: EACL 2023 , pages=

Reading and Reasoning over Chart Images for Evidence-based Automated Fact-Checking , author=. Findings of the Association for Computational Linguistics: EACL 2023 , pages=

2023

[27] [27]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

[28] [28]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023

[29] [29]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[30] [30]

International Conference on Machine Learning , pages=

Pix2struct: Screenshot parsing as pretraining for visual language understanding , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[31] [31]

arXiv preprint arXiv:2504.13275 , year=

ChartQA-X: Generating Explanations for Charts , author=. arXiv preprint arXiv:2504.13275 , year=

work page arXiv

[32] [32]

arXiv preprint arXiv:2504.06637 , year=

SCI-Reason: A Dataset with Chain-of-Thought Rationales for Complex Multimodal Reasoning in Academic Areas , author=. arXiv preprint arXiv:2504.06637 , year=

work page arXiv

[33] [33]

The Exploration in AI Today Workshop at ICML 2025 , year=

Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models , author=. The Exploration in AI Today Workshop at ICML 2025 , year=

2025

[34] [34]

Findings of the Association for Computational Linguistics ACL 2024 , pages=

ChartAssistant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning , author=. Findings of the Association for Computational Linguistics ACL 2024 , pages=

2024

[35] [35]

T iny C hart: Efficient Chart Understanding with Program-of-Thoughts Learning and Visual Token Merging

Zhang, Liang and Hu, Anwen and Xu, Haiyang and Yan, Ming and Xu, Yichen and Jin, Qin and Zhang, Ji and Huang, Fei. T iny C hart: Efficient Chart Understanding with Program-of-Thoughts Learning and Visual Token Merging. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.112

work page doi:10.18653/v1/2024.emnlp-main.112 2024

[36] [36]

Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

Graph-Based Multimodal Contrastive Learning for Chart Question Answering , author=. Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

[37] [37]

C hart I nsights: Evaluating Multimodal Large Language Models for Low-Level Chart Question Answering

Wu, Yifan and Yan, Lutao and Shen, Leixian and Wang, Yunhai and Tang, Nan and Luo, Yuyu. C hart I nsights: Evaluating Multimodal Large Language Models for Low-Level Chart Question Answering. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.710

work page doi:10.18653/v1/2024.findings-emnlp.710 2024

[38] [38]

arXiv preprint arXiv:2502.12289 , year=

Evaluating step-by-step reasoning traces: A survey , author=. arXiv preprint arXiv:2502.12289 , year=

work page arXiv

[39] [39]

arXiv preprint arXiv:2505.20777 , year=

TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs , author=. arXiv preprint arXiv:2505.20777 , year=

work page arXiv

[40] [40]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

Advances in Neural Information Processing Systems , volume=

Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting , author=. Advances in Neural Information Processing Systems , volume=

[42] [42]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Sft memorizes, rl generalizes: A comparative study of foundation model post-training , author=. arXiv preprint arXiv:2501.17161 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

arXiv preprint arXiv:2504.02906 , year=

Boosting Chart-to-Code Generation in MLLM via Dual Preference-Guided Refinement , author=. arXiv preprint arXiv:2504.02906 , year=

work page arXiv

[44] [44]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Vision-r1: Incentivizing reasoning capability in multimodal large language models , author=. arXiv preprint arXiv:2503.06749 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Reimers, Nils and Gurevych, Iryna. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019

2019

[46] [46]

Gemma 3 Technical Report

Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency , author=. arXiv preprint arXiv:2508.18265 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

arXiv preprint arXiv:2505.21523 , year=

More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models , author=. arXiv preprint arXiv:2505.21523 , year=

work page arXiv

[49] [49]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling , author=. arXiv preprint arXiv:2412.05271 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

Emergent Abilities of Large Language Models

Emergent abilities of large language models , author=. arXiv preprint arXiv:2206.07682 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[51] [51]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[52] [52]

BigCharts-R1: Enhanced Chart Reasoning with Visual Reinforcement Finetuning , author=

[53] [53]

Advances in Neural Information Processing Systems , volume=

Charxiv: Charting gaps in realistic chart understanding in multimodal llms , author=. Advances in Neural Information Processing Systems , volume=

[54] [54]

The Eleventh International Conference on Learning Representations , year=

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=

[55] [55]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Improve vision language model chain-of-thought reasoning , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[56] [56]

Multimodal Chain-of-Thought Reasoning in Language Models

Multimodal chain-of-thought reasoning in language models , author=. arXiv preprint arXiv:2302.00923 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[57] [57]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Measuring and improving chain-of-thought reasoning in vision-language models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2024

[58] [58]

Advances in Neural Information Processing Systems , volume=

Cambrian-1: A fully open, vision-centric exploration of multimodal llms , author=. Advances in Neural Information Processing Systems , volume=

[59] [59]

On the impact of fine-tuning on chain-of-thought reasoning , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025

[60] [60]

arXiv preprint arXiv:2407.10490 , year=

Learning dynamics of llm finetuning , author=. arXiv preprint arXiv:2407.10490 , year=

work page arXiv

[61] [61]

arXiv preprint arXiv:2512.12218 , year=

Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking , author=. arXiv preprint arXiv:2512.12218 , year=

work page arXiv

[62] [62]

train on the test set

Benchmark Designers Should" Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts , author=. arXiv preprint arXiv:2511.04655 , year=

work page arXiv

[63] [63]

arXiv preprint arXiv:2507.03019 , year=

Look-back: Implicit visual re-focusing in mllm reasoning , author=. arXiv preprint arXiv:2507.03019 , year=

work page arXiv

[64] [64]

arXiv preprint arXiv:2503.01773 , year=

Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas , author=. arXiv preprint arXiv:2503.01773 , year=

work page arXiv

[65] [65]

Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Li, Yong Jae , journal=

[66] [66]

8th Annual Conference on Robot Learning , year=

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models , author =. 8th Annual Conference on Robot Learning , year=

[67] [67]

European Conference on Computer Vision (ECCV) , year =

Chonghao Sima and Katrin Renz and Kashyap Chitta and Li Chen and Hanxue Zhang and Chengen Xie and Ping Luo and Andreas Geiger and Hongyang Li , title =. European Conference on Computer Vision (ECCV) , year =

[68] [68]

2025 , eprint =

Distilling Multi-modal Large Language Models for Autonomous Driving , author =. 2025 , eprint =

2025

[69] [69]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

Hua, Hang and Liu, Qing and Zhang, Lingzhi and Shi, Jing and Kim, Soo Ye and Zhang, Zhifei and Wang, Yilin and Zhang, Jianming and Lin, Zhe and Luo, Jiebo , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

[70] [70]

2025 , eprint =

Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception , author =. 2025 , eprint =

2025

[71] [71]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Wang, Sheng and Li, Yilei and Yuan, Jiayi and Zhou, Pan and Chen, Xinyuan and Fu, Jingjing and Xu, Jie and Jiang, Lijuan , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =

2025

[72] [72]

2024 IEEE/ACM Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE) , pages=

On Large Visual Language Models for Medical Imaging Analysis: An Empirical Study , author=. 2024 IEEE/ACM Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE) , pages=. 2024 , organization=

2024

[73] [73]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

[74] [74]

Advances in Neural Information Processing Systems , editor =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems , editor =

[75] [75]

Advances in Neural Information Processing Systems , editor =

Large Language Models are Zero-Shot Reasoners , author =. Advances in Neural Information Processing Systems , editor =

[76] [76]

and Girshick, Ross , title =

Johnson, Justin and Hariharan, Bharath and van der Maaten, Laurens and Fei-Fei, Li and Lawrence Zitnick, C. and Girshick, Ross , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

[77] [77]

and Manning, Christopher D

Hudson, Drew A. and Manning, Christopher D. , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

[78] [78]

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Zellers, Rowan and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , title =. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

[79] [79]

arXiv preprint arXiv:2404.18624 , year=

Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations? , author=. arXiv preprint arXiv:2404.18624 , year=

work page arXiv

[80] [80]

arXiv preprint arXiv:2010.13984 , year=

Interpretation of NLP models through input marginalization , author=. arXiv preprint arXiv:2010.13984 , year=

work page arXiv 2010