MathVis-Fine: Aligning Visual Supervision with Necessity via Progressive Dependency-Guided Training for Multimodal Mathematical Reasoning

Haidong Yuan; Haokun Zhao; Long Ma; Songjun Cao; Wanshi Xu

arxiv: 2606.17888 · v1 · pith:ITPCLALNnew · submitted 2026-06-16 · 💻 cs.AI

MathVis-Fine: Aligning Visual Supervision with Necessity via Progressive Dependency-Guided Training for Multimodal Mathematical Reasoning

Wanshi Xu , Haokun Zhao , Haidong Yuan , Songjun Cao , Long Ma This is my paper

Pith reviewed 2026-06-27 01:20 UTC · model grok-4.3

classification 💻 cs.AI

keywords multimodal mathematical reasoningvisual dependency ratingsprogressive trainingreward balancingvisual supervisionMathVis-Fine datasetchain-of-thought reasoning

0 comments

The pith

A two-stage training paradigm uses sample-specific visual dependency ratings to balance answer correctness and visual grounding rewards in multimodal math reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that current multimodal reasoning models apply visual supervision uniformly, ignoring how much each math problem actually needs the image. It constructs the MathVis-Fine dataset with fine-grained visual annotations and dependency ratings for each sample. A two-stage progressive training process then weights visual rewards higher only for high-dependency samples while keeping answer rewards primary for low-dependency ones. This targets two problems: coarse visual signals that do not match necessity and inaccurate feedback from uniform reward application. A sympathetic reader would care because mismatched supervision can cause models either to ignore useful images or to over-rely on irrelevant ones during chain-of-thought reasoning.

Core claim

The MathVis-Fine framework first augments mathematical problems with fine-grained visual annotations and visual dependency ratings. It then applies a two-stage progressive visual enhancement training paradigm that balances answer correctness rewards and visual grounding rewards according to each sample's intrinsic visual dependency level, thereby mitigating reward bias and improving supervision accuracy for multimodal mathematical reasoning.

What carries the argument

The two-stage progressive visual enhancement training paradigm that adjusts the balance between answer correctness rewards and visual grounding rewards according to per-sample visual dependency ratings.

Load-bearing premise

Visual dependency ratings can be reliably and accurately assigned to samples, and balancing the two reward types according to those ratings will reduce bias without creating new training inaccuracies.

What would settle it

A controlled comparison in which models trained with uniform visual rewards across all samples reach equal or higher accuracy on multimodal math benchmarks than models trained with dependency-guided reward balancing.

Figures

Figures reproduced from arXiv: 2606.17888 by Haidong Yuan, Haokun Zhao, Long Ma, Songjun Cao, Wanshi Xu.

**Figure 1.** Figure 1: Overview of the framework, which begins by constructing a dataset with fine-grained visual dependency annotations. Stage 1 employs a Retrieval-Perception Synergy strategy during supervised fine-tuning (SFT) to enhance visual perception. Stage 2 utilizes Multi-Dimensional Visual-Dependent Reinforcement Learning (MDVDRL). By integrating the two visual rewards and leveraging the dependency score (λv) as a ga… view at source ↗

**Figure 2.** Figure 2: Pearson correlation coefficient between Visual Retrieval Recall and Answer Correctness across different visual dependency levels (λv), when p < 0.05. The correlation significantly increases as the visual dependency of the problem rises, validating our strategy to weight visual rewards based on λv. with fine-grained visual dependency annotations, and proposed a novel multi-stage training framework. By inco… view at source ↗

**Figure 3.** Figure 3: illustrates the proportional distribution of these three categories. This distribution highlights our strategy to cover a diverse range of multimodal scenarios, from text-dominant problems to those requiring intensive visual interpretation. Low Dependency ( v = 0.0) 48.4% (2628) High Dependency ( v = 1.0) 36.8% (1995) Medium Dependency ( v = 0.5) 14.8% (802) Distribution of Visual Dependency Levels in Ma… view at source ↗

read the original abstract

Chain-of-Thought (CoT) reasoning has extended from purely linguistic domains to multimodal scenarios; however, existing approaches often treat visual inputs as homogeneous or auxiliary signals, failing to capture the intricate and sample-specific dependencies between text and images in mathematical problem-solving. This gives rise to two core issues: first, the supervisory signals for visual content are generalized and coarse-grained, lacking adaptation to the actual necessity of visual information in each sample; second, training feedback becomes inaccurate when visual rewards are uniformly applied without distinguishing the complementary relationships among inputs. These limitations hinder models from achieving precise multimodal reasoning. In this work, we propose a framework for modeling fine-grained visual dependencies in mathematical reasoning. We first construct the MathVis-Fine dataset, augmenting fine-grained visual annotations with visual dependency ratings. Building upon this dataset, we introduce a two-stage progressive visual enhancement training paradigm that balances answer correctness rewards and visual grounding rewards according to the intrinsic visual dependency level of each sample, thereby mitigating reward bias and improving supervision accuracy. Extensive experiments demonstrate that the MathVis-Fine framework effectively enhances visual perception progressively based on visual dependency, offering a more precise training framework for multimodal mathematical reasoning. We will release the dataset upon acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a dataset with visual dependency ratings plus a two-stage reward-balancing trainer, but offers no checks on whether those ratings are reliable or drive the gains.

read the letter

The core new pieces are the MathVis-Fine dataset, which adds explicit visual dependency ratings to math problems, and the two-stage progressive training that scales visual-grounding rewards according to those ratings. This directly targets the real problem that some multimodal math items need the image heavily while others do not, and the progressive structure is a clean way to avoid uniform supervision.

The approach builds on existing ideas about visual grounding and reward balancing, so the novelty sits mainly in making the dependency explicit and sample-specific. That framing is useful for anyone working on CoT in diagrams or charts.

The main gap is the missing validation for the ratings themselves. Nothing in the abstract shows inter-annotator agreement, correlation with actual visual-ablation performance, or an ablation that replaces the ratings with random or uniform values. Without those checks, any reported gains could come from the two-stage schedule or extra data rather than the dependency mechanism. The abstract also gives no numbers, error bars, or dataset stats, so the claim that reward bias is mitigated stays untested on the page.

This is aimed at researchers building multimodal math reasoners or educational tools. It is worth a serious referee if the full paper supplies the missing ablations and results; the underlying concern is legitimate even if the current evidence is thin. I would send it to review rather than desk-reject.

Referee Report

1 major / 1 minor

Summary. The paper claims that existing multimodal CoT methods apply coarse, uniform visual supervision that fails to account for sample-specific text-image dependencies in mathematical reasoning, leading to inaccurate feedback. To address this, the authors construct the MathVis-Fine dataset by augmenting fine-grained visual annotations with visual dependency ratings. They then introduce a two-stage progressive visual enhancement training paradigm that balances answer correctness rewards against visual grounding rewards according to each sample's intrinsic visual dependency level, thereby mitigating reward bias. Extensive experiments are stated to demonstrate that the framework enhances visual perception progressively based on these ratings, providing a more precise training approach for multimodal mathematical reasoning. The dataset will be released upon acceptance.

Significance. If the central mechanism holds after validation, the work could advance multimodal reasoning by making visual supervision adaptive to necessity rather than uniform, potentially reducing bias in reward signals and improving precision on problems where visual information varies in importance. The dataset release would be a positive contribution for the community. However, the significance is currently limited by the absence of evidence that the dependency ratings are reliable or that the balancing specifically drives gains beyond the progressive training structure itself.

major comments (1)

[Dataset Construction] Dataset Construction: The manuscript provides no description of how visual dependency ratings are assigned to samples, nor any validation such as inter-annotator agreement, correlation with visual ablation performance, or an ablation replacing ratings with uniform/random values. This is load-bearing for the central claim that reward balancing per rating mitigates bias specifically via dependency guidance; without these checks, observed improvements could arise from the two-stage paradigm or increased dataset size rather than the proposed mechanism.

minor comments (1)

[Abstract] Abstract: The claim of 'extensive experiments' is made without any quantitative results, error bars, dataset statistics, or baseline comparisons being referenced, making it difficult to assess the strength of the reported improvements.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below and will incorporate the requested details and validations into the revised manuscript.

read point-by-point responses

Referee: [Dataset Construction] Dataset Construction: The manuscript provides no description of how visual dependency ratings are assigned to samples, nor any validation such as inter-annotator agreement, correlation with visual ablation performance, or an ablation replacing ratings with uniform/random values. This is load-bearing for the central claim that reward balancing per rating mitigates bias specifically via dependency guidance; without these checks, observed improvements could arise from the two-stage paradigm or increased dataset size rather than the proposed mechanism.

Authors: We agree that the current manuscript lacks sufficient detail on the visual dependency rating assignment process and supporting validations, which weakens the central claim. In the revised version we will expand the Dataset Construction section to describe: (1) the annotation guidelines and criteria used to assign ratings (e.g., explicit rubrics distinguishing samples where visual information is necessary versus supplementary), (2) inter-annotator agreement statistics, (3) correlation between ratings and performance degradation under visual ablation, and (4) an ablation that replaces the learned dependency ratings with uniform or random values while keeping the two-stage training structure fixed. These additions will directly test whether the observed gains are attributable to the dependency-guided reward balancing. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework is self-contained

full rationale

The paper constructs the MathVis-Fine dataset externally by augmenting annotations with visual dependency ratings and then applies a two-stage training procedure that balances rewards according to those ratings. No equations, fitted parameters, or derivations are presented that reduce the claimed performance gains to the inputs by construction. No self-citations are invoked as load-bearing premises, and the central claims rest on the independent dataset construction and training paradigm rather than renaming or self-referential definitions. This is the normal case of an externally grounded proposal with no circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to identify any free parameters, axioms, or invented entities; the visual dependency ratings appear to be part of the new dataset construction.

pith-pipeline@v0.9.1-grok · 5762 in / 1198 out tokens · 47407 ms · 2026-06-27T01:20:59.395279+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

113 extracted references · 26 linked inside Pith

[1]

International conference on machine learning , pages=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

2023
[2]

2023 , eprint=

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. 2023 , eprint=

2023
[3]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[4]

European Conference on Information Retrieval , pages=

Cross-modal retrieval for knowledge-based visual question answering , author=. European Conference on Information Retrieval , pages=. 2024 , organization=

2024
[5]

arXiv preprint arXiv:2410.08876 , year=

RoRA-VLM: Robust Retrieval-Augmented Vision Language Models , author=. arXiv preprint arXiv:2410.08876 , year=

arXiv
[6]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[7]

arXiv preprint arXiv:2407.12735 , year=

EchoSight: Advancing Visual-Language Models with Wiki Knowledge , author=. arXiv preprint arXiv:2407.12735 , year=

arXiv
[8]

arXiv preprint arXiv:2411.16863 , year=

Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering , author=. arXiv preprint arXiv:2411.16863 , year=

arXiv
[9]

Zhang, Tao and Zhang, Ziqi and Ma, Zongyang and Chen, Yuxin and Qi, Zhongang and Yuan, Chunfeng and Li, Bing and Pu, Junfu and Zhao, Yuxuan and Xie, Zehua and others , journal=. mR \^
[10]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[11]

arXiv preprint arXiv:2302.11713 , year=

Can pre-trained vision and language models answer visual information-seeking questions? , author=. arXiv preprint arXiv:2302.11713 , year=

arXiv
[12]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
[13]

arXiv preprint arXiv:2302.00923 , year=

Multimodal chain-of-thought reasoning in language models , author=. arXiv preprint arXiv:2302.00923 , year=

Pith/arXiv arXiv
[14]

2023 , eprint=

Evaluating Object Hallucination in Large Vision-Language Models , author=. 2023 , eprint=

2023
[15]

2025 , eprint=

Hallucination of Multimodal Large Language Models: A Survey , author=. 2025 , eprint=

2025
[16]

Advances in Neural Information Processing Systems , volume=

Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models , author=. Advances in Neural Information Processing Systems , volume=
[17]

Advances in Neural Information Processing Systems , volume=

Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning , author=. Advances in Neural Information Processing Systems , volume=
[18]

arXiv preprint arXiv:2505.17020 , year=

CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms , author=. arXiv preprint arXiv:2505.17020 , year=

arXiv
[19]

arXiv preprint arXiv:2502.04326 , year=

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs , author=. arXiv preprint arXiv:2502.04326 , year=

Pith/arXiv arXiv
[20]

5-vl technical report , author=

Seed1. 5-vl technical report , author=. arXiv preprint arXiv:2505.07062 , year=

Pith/arXiv arXiv
[21]

arXiv preprint arXiv:2505.20199 , year=

Adaptive Classifier-Free Guidance via Dynamic Low-Confidence Masking , author=. arXiv preprint arXiv:2505.20199 , year=

arXiv
[22]

arXiv preprint arXiv:2406.09403 , year=

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models , author=. arXiv preprint arXiv:2406.09403 , year=

arXiv
[23]

arXiv preprint arXiv:2502.17425 , year=

Introducing Visual Perception Token into Multimodal Large Language Model , author=. arXiv preprint arXiv:2502.17425 , year=

arXiv
[24]

2025 , eprint=

Interleaved-Modal Chain-of-Thought , author=. 2025 , eprint=

2025
[25]

2024 , eprint=

MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine , author=. 2024 , eprint=

2024
[26]

2025 , eprint=

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought , author=. 2025 , eprint=

2025
[27]

arXiv preprint arXiv:2405.09818 , year=

Chameleon: Mixed-modal early-fusion foundation models , author=. arXiv preprint arXiv:2405.09818 , year=

Pith/arXiv arXiv
[28]

arXiv preprint arXiv:2412.18319 , year=

Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search , author=. arXiv preprint arXiv:2412.18319 , year=

arXiv
[29]

International Conference on Learning Representations (ICLR) , year=

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts , author=. International Conference on Learning Representations (ICLR) , year=
[30]

2021 , eprint=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. 2021 , eprint=

2021
[31]

2024 , eprint=

GPT-4o System Card , author=. 2024 , eprint=

2024
[32]

2022 , eprint=

PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR System , author=. 2022 , eprint=

2022
[33]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

2024
[34]

2025 , eprint=

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency , author=. 2025 , eprint=

2025
[35]

arXiv preprint arXiv:2409.00147 , year=

Multimath: Bridging visual and mathematical reasoning for large language models , author=. arXiv preprint arXiv:2409.00147 , year=

arXiv
[36]

arXiv preprint arXiv:2406.17294 , year=

Math-llava: Bootstrapping mathematical reasoning for multimodal large language models , author=. arXiv preprint arXiv:2406.17294 , year=

arXiv
[37]

arXiv preprint arXiv:2501.04686 , year=

URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics , author=. arXiv preprint arXiv:2501.04686 , year=

arXiv
[38]

arXiv preprint arXiv:2410.17885 , year=

R-cot: Reverse chain-of-thought problem generation for geometric reasoning in large multimodal models , author=. arXiv preprint arXiv:2410.17885 , year=

arXiv
[39]

arXiv preprint arXiv:2409.09039 , year=

Autogeo: Automating geometric image dataset creation for enhanced geometry understanding , author=. arXiv preprint arXiv:2409.09039 , year=

arXiv
[40]

European Conference on Computer Vision , pages=

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024
[41]

Annual Meeting of the Association for Computational Linguistics , pages=

Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning , author=. Annual Meeting of the Association for Computational Linguistics , pages=
[42]

ArXiv , year=

GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical Reasoning , author=. ArXiv , year=
[43]

arXiv preprint arXiv:2312.11370 , year=

G-llava: Solving geometric problem with multi-modal large language model , author=. arXiv preprint arXiv:2312.11370 , year=

arXiv
[44]

NeurIPS , year =

Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae , title =. NeurIPS , year =
[45]

ArXiv , year=

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities , author=. ArXiv , year=
[46]

arXiv preprint arXiv:2409.12191 , year=

Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution , author=. arXiv preprint arXiv:2409.12191 , year=

Pith/arXiv arXiv
[47]

arXiv preprint arXiv:2401.04398 , year=

Chain-of-table: Evolving tables in the reasoning chain for table understanding , author=. arXiv preprint arXiv:2401.04398 , year=

arXiv
[48]

arXiv preprint arXiv:2307.08674 , year=

Tablegpt: Towards unifying tables, nature language and commands into one gpt , author=. arXiv preprint arXiv:2307.08674 , year=

arXiv
[49]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Layoutllm: Layout instruction tuning with large language models for document understanding , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[50]

arXiv preprint arXiv:2503.07365 , year=

MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning , author=. arXiv preprint arXiv:2503.07365 , year=

Pith/arXiv arXiv
[51]

arXiv preprint arXiv:2503.12937 , year=

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization , author=. arXiv preprint arXiv:2503.12937 , year=

Pith/arXiv arXiv
[52]

arXiv preprint arXiv:2503.10615 , year=

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization , author=. arXiv preprint arXiv:2503.10615 , year=

Pith/arXiv arXiv
[53]

arXiv preprint arXiv:2503.06749 , year=

Vision-r1: Incentivizing reasoning capability in multimodal large language models , author=. arXiv preprint arXiv:2503.06749 , year=

Pith/arXiv arXiv
[54]

2025 , eprint=

OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement , author=. 2025 , eprint=

2025
[55]

Chen, Liang and Li, Lei and Zhao, Haozhe and Song, Yifan and Vinci , title =
[56]

arXiv preprint arXiv:2403.12966 , year=

Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models , author=. arXiv preprint arXiv:2403.12966 , year=

arXiv
[57]

arXiv preprint arXiv:2311.09241 , year=

Chain of Images for Intuitively Reasoning , author=. arXiv preprint arXiv:2311.09241 , year=

arXiv
[58]

arXiv preprint arXiv:2411.14432 , year=

Insight-v: Exploring long-chain visual reasoning with multimodal large language models , author=. arXiv preprint arXiv:2411.14432 , year=

arXiv
[59]

arXiv preprint arXiv:2501.12948 , year=

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

Pith/arXiv arXiv
[60]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[61]

arXiv preprint arXiv:2501.07542 , year=

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought , author=. arXiv preprint arXiv:2501.07542 , year=

Pith/arXiv arXiv
[62]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[63]

2024 , eprint=

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. 2024 , eprint=

2024
[64]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[65]

2023 , eprint=

G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model , author=. 2023 , eprint=

2023
[66]

2025 , eprint=

Boosting MLLM Reasoning with Text-Debiased Hint-GRPO , author=. 2025 , eprint=

2025
[67]

Model Card Addendum: Claude 3.5 Haiku and Upgraded Claude 3.5 Sonnet , author=
[69]

2025 , eprint=

URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics , author=. 2025 , eprint=

2025
[70]

2025 , eprint=

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning , author=. 2025 , eprint=

2025
[71]

2025 , howpublished =

EvolvingLMMs-Lab , title =. 2025 , howpublished =

2025
[72]

arXiv preprint arXiv:2411.10442 , year=

Enhancing the reasoning ability of multimodal large language models via mixed preference optimization , author=. arXiv preprint arXiv:2411.10442 , year=

Pith/arXiv arXiv
[73]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[74]

2024 , eprint=

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding , author=. 2024 , eprint=

2024
[75]

2025 , eprint=

Qwen2.5-VL Technical Report , author=. 2025 , eprint=

2025
[76]

2025 , howpublished =

Qwen Team , title =. 2025 , howpublished =

2025
[77]

2025 , eprint=

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization , author=. 2025 , eprint=

2025
[78]

2023 , eprint=

Sigmoid Loss for Language Image Pre-Training , author=. 2023 , eprint=

2023
[79]

arXiv preprint arXiv:2308.16911 , year=

Pointllm: Empowering large language models to understand point clouds , author=. arXiv preprint arXiv:2308.16911 , year=

arXiv
[80]

arXiv preprint arXiv:2309.00615 , year=

Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following , author=. arXiv preprint arXiv:2309.00615 , year=

arXiv
[81]

arXiv preprint arXiv:2502.09620 , year=

Exploring the Potential of Encoder-free Architectures in 3D LMMs , author=. arXiv preprint arXiv:2502.09620 , year=

arXiv

Showing first 80 references.

[1] [1]

International conference on machine learning , pages=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

2023

[2] [2]

2023 , eprint=

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. 2023 , eprint=

2023

[3] [3]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[4] [4]

European Conference on Information Retrieval , pages=

Cross-modal retrieval for knowledge-based visual question answering , author=. European Conference on Information Retrieval , pages=. 2024 , organization=

2024

[5] [5]

arXiv preprint arXiv:2410.08876 , year=

RoRA-VLM: Robust Retrieval-Augmented Vision Language Models , author=. arXiv preprint arXiv:2410.08876 , year=

arXiv

[6] [6]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[7] [7]

arXiv preprint arXiv:2407.12735 , year=

EchoSight: Advancing Visual-Language Models with Wiki Knowledge , author=. arXiv preprint arXiv:2407.12735 , year=

arXiv

[8] [8]

arXiv preprint arXiv:2411.16863 , year=

Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering , author=. arXiv preprint arXiv:2411.16863 , year=

arXiv

[9] [9]

Zhang, Tao and Zhang, Ziqi and Ma, Zongyang and Chen, Yuxin and Qi, Zhongang and Yuan, Chunfeng and Li, Bing and Pu, Junfu and Zhao, Yuxuan and Xie, Zehua and others , journal=. mR \^

[10] [10]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[11] [11]

arXiv preprint arXiv:2302.11713 , year=

Can pre-trained vision and language models answer visual information-seeking questions? , author=. arXiv preprint arXiv:2302.11713 , year=

arXiv

[12] [12]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

[13] [13]

arXiv preprint arXiv:2302.00923 , year=

Multimodal chain-of-thought reasoning in language models , author=. arXiv preprint arXiv:2302.00923 , year=

Pith/arXiv arXiv

[14] [14]

2023 , eprint=

Evaluating Object Hallucination in Large Vision-Language Models , author=. 2023 , eprint=

2023

[15] [15]

2025 , eprint=

Hallucination of Multimodal Large Language Models: A Survey , author=. 2025 , eprint=

2025

[16] [16]

Advances in Neural Information Processing Systems , volume=

Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models , author=. Advances in Neural Information Processing Systems , volume=

[17] [17]

Advances in Neural Information Processing Systems , volume=

Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning , author=. Advances in Neural Information Processing Systems , volume=

[18] [18]

arXiv preprint arXiv:2505.17020 , year=

CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms , author=. arXiv preprint arXiv:2505.17020 , year=

arXiv

[19] [19]

arXiv preprint arXiv:2502.04326 , year=

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs , author=. arXiv preprint arXiv:2502.04326 , year=

Pith/arXiv arXiv

[20] [20]

5-vl technical report , author=

Seed1. 5-vl technical report , author=. arXiv preprint arXiv:2505.07062 , year=

Pith/arXiv arXiv

[21] [21]

arXiv preprint arXiv:2505.20199 , year=

Adaptive Classifier-Free Guidance via Dynamic Low-Confidence Masking , author=. arXiv preprint arXiv:2505.20199 , year=

arXiv

[22] [22]

arXiv preprint arXiv:2406.09403 , year=

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models , author=. arXiv preprint arXiv:2406.09403 , year=

arXiv

[23] [23]

arXiv preprint arXiv:2502.17425 , year=

Introducing Visual Perception Token into Multimodal Large Language Model , author=. arXiv preprint arXiv:2502.17425 , year=

arXiv

[24] [24]

2025 , eprint=

Interleaved-Modal Chain-of-Thought , author=. 2025 , eprint=

2025

[25] [25]

2024 , eprint=

MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine , author=. 2024 , eprint=

2024

[26] [26]

2025 , eprint=

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought , author=. 2025 , eprint=

2025

[27] [27]

arXiv preprint arXiv:2405.09818 , year=

Chameleon: Mixed-modal early-fusion foundation models , author=. arXiv preprint arXiv:2405.09818 , year=

Pith/arXiv arXiv

[28] [28]

arXiv preprint arXiv:2412.18319 , year=

Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search , author=. arXiv preprint arXiv:2412.18319 , year=

arXiv

[29] [29]

International Conference on Learning Representations (ICLR) , year=

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts , author=. International Conference on Learning Representations (ICLR) , year=

[30] [30]

2021 , eprint=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. 2021 , eprint=

2021

[31] [31]

2024 , eprint=

GPT-4o System Card , author=. 2024 , eprint=

2024

[32] [32]

2022 , eprint=

PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR System , author=. 2022 , eprint=

2022

[33] [33]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

2024

[34] [34]

2025 , eprint=

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency , author=. 2025 , eprint=

2025

[35] [35]

arXiv preprint arXiv:2409.00147 , year=

Multimath: Bridging visual and mathematical reasoning for large language models , author=. arXiv preprint arXiv:2409.00147 , year=

arXiv

[36] [36]

arXiv preprint arXiv:2406.17294 , year=

Math-llava: Bootstrapping mathematical reasoning for multimodal large language models , author=. arXiv preprint arXiv:2406.17294 , year=

arXiv

[37] [37]

arXiv preprint arXiv:2501.04686 , year=

URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics , author=. arXiv preprint arXiv:2501.04686 , year=

arXiv

[38] [38]

arXiv preprint arXiv:2410.17885 , year=

R-cot: Reverse chain-of-thought problem generation for geometric reasoning in large multimodal models , author=. arXiv preprint arXiv:2410.17885 , year=

arXiv

[39] [39]

arXiv preprint arXiv:2409.09039 , year=

Autogeo: Automating geometric image dataset creation for enhanced geometry understanding , author=. arXiv preprint arXiv:2409.09039 , year=

arXiv

[40] [40]

European Conference on Computer Vision , pages=

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024

[41] [41]

Annual Meeting of the Association for Computational Linguistics , pages=

Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning , author=. Annual Meeting of the Association for Computational Linguistics , pages=

[42] [42]

ArXiv , year=

GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical Reasoning , author=. ArXiv , year=

[43] [43]

arXiv preprint arXiv:2312.11370 , year=

G-llava: Solving geometric problem with multi-modal large language model , author=. arXiv preprint arXiv:2312.11370 , year=

arXiv

[44] [44]

NeurIPS , year =

Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae , title =. NeurIPS , year =

[45] [45]

ArXiv , year=

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities , author=. ArXiv , year=

[46] [46]

arXiv preprint arXiv:2409.12191 , year=

Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution , author=. arXiv preprint arXiv:2409.12191 , year=

Pith/arXiv arXiv

[47] [47]

arXiv preprint arXiv:2401.04398 , year=

Chain-of-table: Evolving tables in the reasoning chain for table understanding , author=. arXiv preprint arXiv:2401.04398 , year=

arXiv

[48] [48]

arXiv preprint arXiv:2307.08674 , year=

Tablegpt: Towards unifying tables, nature language and commands into one gpt , author=. arXiv preprint arXiv:2307.08674 , year=

arXiv

[49] [49]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Layoutllm: Layout instruction tuning with large language models for document understanding , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[50] [50]

arXiv preprint arXiv:2503.07365 , year=

MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning , author=. arXiv preprint arXiv:2503.07365 , year=

Pith/arXiv arXiv

[51] [51]

arXiv preprint arXiv:2503.12937 , year=

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization , author=. arXiv preprint arXiv:2503.12937 , year=

Pith/arXiv arXiv

[52] [52]

arXiv preprint arXiv:2503.10615 , year=

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization , author=. arXiv preprint arXiv:2503.10615 , year=

Pith/arXiv arXiv

[53] [53]

arXiv preprint arXiv:2503.06749 , year=

Vision-r1: Incentivizing reasoning capability in multimodal large language models , author=. arXiv preprint arXiv:2503.06749 , year=

Pith/arXiv arXiv

[54] [54]

2025 , eprint=

OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement , author=. 2025 , eprint=

2025

[55] [55]

Chen, Liang and Li, Lei and Zhao, Haozhe and Song, Yifan and Vinci , title =

[56] [56]

arXiv preprint arXiv:2403.12966 , year=

Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models , author=. arXiv preprint arXiv:2403.12966 , year=

arXiv

[57] [57]

arXiv preprint arXiv:2311.09241 , year=

Chain of Images for Intuitively Reasoning , author=. arXiv preprint arXiv:2311.09241 , year=

arXiv

[58] [58]

arXiv preprint arXiv:2411.14432 , year=

Insight-v: Exploring long-chain visual reasoning with multimodal large language models , author=. arXiv preprint arXiv:2411.14432 , year=

arXiv

[59] [59]

arXiv preprint arXiv:2501.12948 , year=

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

Pith/arXiv arXiv

[60] [60]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

[61] [61]

arXiv preprint arXiv:2501.07542 , year=

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought , author=. arXiv preprint arXiv:2501.07542 , year=

Pith/arXiv arXiv

[62] [62]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[63] [63]

2024 , eprint=

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. 2024 , eprint=

2024

[64] [64]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021

[65] [65]

2023 , eprint=

G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model , author=. 2023 , eprint=

2023

[66] [66]

2025 , eprint=

Boosting MLLM Reasoning with Text-Debiased Hint-GRPO , author=. 2025 , eprint=

2025

[67] [67]

Model Card Addendum: Claude 3.5 Haiku and Upgraded Claude 3.5 Sonnet , author=

[68] [69]

2025 , eprint=

URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics , author=. 2025 , eprint=

2025

[69] [70]

2025 , eprint=

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning , author=. 2025 , eprint=

2025

[70] [71]

2025 , howpublished =

EvolvingLMMs-Lab , title =. 2025 , howpublished =

2025

[71] [72]

arXiv preprint arXiv:2411.10442 , year=

Enhancing the reasoning ability of multimodal large language models via mixed preference optimization , author=. arXiv preprint arXiv:2411.10442 , year=

Pith/arXiv arXiv

[72] [73]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[73] [74]

2024 , eprint=

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding , author=. 2024 , eprint=

2024

[74] [75]

2025 , eprint=

Qwen2.5-VL Technical Report , author=. 2025 , eprint=

2025

[75] [76]

2025 , howpublished =

Qwen Team , title =. 2025 , howpublished =

2025

[76] [77]

2025 , eprint=

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization , author=. 2025 , eprint=

2025

[77] [78]

2023 , eprint=

Sigmoid Loss for Language Image Pre-Training , author=. 2023 , eprint=

2023

[78] [79]

arXiv preprint arXiv:2308.16911 , year=

Pointllm: Empowering large language models to understand point clouds , author=. arXiv preprint arXiv:2308.16911 , year=

arXiv

[79] [80]

arXiv preprint arXiv:2309.00615 , year=

Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following , author=. arXiv preprint arXiv:2309.00615 , year=

arXiv

[80] [81]

arXiv preprint arXiv:2502.09620 , year=

Exploring the Potential of Encoder-free Architectures in 3D LMMs , author=. arXiv preprint arXiv:2502.09620 , year=

arXiv