pith. machine review for the scientific record. sign in

arxiv: 2503.10615 · v2 · submitted 2025-03-13 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Authors on Pith no claims yet

Pith reviewed 2026-05-16 00:14 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal reasoningcross-modal formalizationvision-language modelsreinforcement learningreasoning benchmarkssupervised fine-tuningR1-Onevision
0
0 comments X

The pith

Converting images to formal textual representations lets a new model reason more precisely about visual content and outperform GPT-4o on multimodal benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces R1-Onevision, which uses a pipeline to turn visual inputs into structured text forms that support exact step-by-step language reasoning. This addresses the common failure of vision-language models to integrate images and text reliably for hard problems. The authors build a dataset of detailed multimodal reasoning traces, train the model first with supervised fine-tuning and then reinforcement learning, and test it on a new benchmark spanning junior high school through university exams. If the approach holds, it shows that formal text conversion can close the gap between perception and deep reasoning without discarding essential visual details.

Core claim

R1-Onevision achieves state-of-the-art results by applying a cross-modal reasoning pipeline that converts images into formal textual representations, enabling precise language-based reasoning; the same pipeline supports construction of a large annotated dataset and training via supervised fine-tuning followed by reinforcement learning, yielding superior performance over GPT-4o and Qwen2.5-VL across multiple challenging multimodal benchmarks including the new R1-Onevision-Bench aligned with educational stages.

What carries the argument

The cross-modal reasoning pipeline that transforms images into formal textual representations for subsequent language-based reasoning.

If this is right

  • The model generalizes across domains from junior high school to university-level exam questions.
  • Step-by-step textual reasoning traces produced by the pipeline improve both accuracy and interpretability compared with direct vision-language baselines.
  • Reinforcement learning applied after supervised fine-tuning further strengthens robustness on out-of-distribution multimodal problems.
  • The new R1-Onevision-Bench provides a graded test suite that measures reasoning capability by educational stage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same image-to-formal-text conversion could be applied to video sequences or 3D scenes to support temporal or spatial reasoning.
  • If the formal representations prove lossless for most tasks, they could serve as a common intermediate language linking multiple input modalities.
  • Educational tools might use the generated reasoning traces to produce transparent explanations for students at different grade levels.

Load-bearing premise

Converting an image into a formal textual representation preserves all critical visual information needed for accurate reasoning.

What would settle it

A direct comparison in which the same base model is run once with the formal text pipeline and once with raw image input on tasks that require fine-grained visual details, such as exact spatial counting or subtle pattern recognition, showing no accuracy gain or a loss for the pipeline version.

read the original abstract

Large Language Models have demonstrated remarkable reasoning capability in complex textual tasks. However, multimodal reasoning, which requires integrating visual and textual information, remains a significant challenge. Existing visual-language models often struggle to effectively analyze and reason visual content, resulting in suboptimal performance on complex reasoning tasks. Moreover, the absence of comprehensive benchmarks hinders the accurate assessment of multimodal reasoning capabilities. In this paper, we introduce R1-Onevision, a multimodal reasoning model designed to bridge the gap between visual perception and deep reasoning. To achieve this, we propose a cross-modal reasoning pipeline that transforms images into formal textural representations, enabling precise language-based reasoning. Leveraging this pipeline, we construct the R1-Onevision dataset which provides detailed, step-by-step multimodal reasoning annotations across diverse domains. We further develop the R1-Onevision model through supervised fine-tuning and reinforcement learning to cultivate advanced reasoning and robust generalization abilities. To comprehensively evaluate multimodal reasoning performance across different grades, we introduce R1-Onevision-Bench, a benchmark aligned with human educational stages, covering exams from junior high school to university and beyond. Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL on multiple challenging multimodal reasoning benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces R1-Onevision, a multimodal reasoning model that uses a cross-modal pipeline to transform images into formal textual representations, enabling precise language-based reasoning. It constructs the R1-Onevision dataset with detailed step-by-step multimodal annotations across domains, trains the model via supervised fine-tuning followed by reinforcement learning, and introduces R1-Onevision-Bench, a new benchmark aligned with human educational stages from junior high school through university level. The central claim is that this yields state-of-the-art performance, outperforming GPT-4o and Qwen2.5-VL on multiple challenging multimodal reasoning benchmarks.

Significance. If the performance claims and pipeline validity are substantiated with rigorous experiments, the cross-modal formalization approach could meaningfully advance multimodal reasoning by converting visual input into structured text that supports reliable step-by-step inference and better generalization. The education-stage benchmark is a constructive addition for evaluating reasoning progression. No machine-checked proofs, reproducible code releases, or parameter-free derivations are described, so these strengths are not present to credit.

major comments (2)
  1. [Abstract] Abstract: the assertion that 'Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL' is unsupported because the manuscript contains no quantitative tables, ablation studies, error analysis, or matched-condition comparisons; this directly undermines the central performance claim.
  2. [Method] Cross-modal reasoning pipeline description: no implementation details, pseudocode, or validation experiments are provided for how images are transformed into formal textual representations or for confirming that critical visual information is preserved without loss; this is load-bearing for the claim that the pipeline enables precise reasoning.
minor comments (2)
  1. [Dataset] The description of the R1-Onevision dataset would benefit from explicit statistics on domain coverage, annotation length, and example instances to allow reproducibility assessment.
  2. [Method] Notation for the formal textual representation step is introduced without a clear diagram or formal definition, which could be clarified for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of results and methodological details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that 'Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL' is unsupported because the manuscript contains no quantitative tables, ablation studies, error analysis, or matched-condition comparisons; this directly undermines the central performance claim.

    Authors: We acknowledge the referee's concern. While the manuscript includes an experiments section with performance comparisons, we agree that the current version lacks sufficient quantitative tables, ablation studies, error analysis, and explicit matched-condition comparisons to fully substantiate the SOTA claim in the abstract. In the revised manuscript, we will add detailed tables reporting exact metrics on R1-Onevision-Bench and additional multimodal reasoning benchmarks, include ablation studies isolating the cross-modal formalization and RL components, and provide error analysis with direct side-by-side comparisons to GPT-4o and Qwen2.5-VL. revision: yes

  2. Referee: [Method] Cross-modal reasoning pipeline description: no implementation details, pseudocode, or validation experiments are provided for how images are transformed into formal textual representations or for confirming that critical visual information is preserved without loss; this is load-bearing for the claim that the pipeline enables precise reasoning.

    Authors: We agree that additional details are required for reproducibility and to validate the pipeline's effectiveness. In the revised version, we will expand the method section with concrete implementation details on the image-to-formal-text transformation (including the structured representation format and extraction rules), provide pseudocode for the full cross-modal pipeline, and add validation experiments such as quantitative information-preservation metrics and human evaluations confirming that critical visual elements are retained without loss. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper's core argument consists of proposing a cross-modal pipeline to convert images to formal text, using that pipeline to annotate a new dataset, training via SFT+RL, and evaluating on a newly introduced educational-stage benchmark. These steps are constructive and empirical; the SOTA performance claims rest on experimental comparisons rather than any equation or claim that reduces by construction to fitted inputs, self-citations, or renamed prior results. No load-bearing derivation equates a prediction to its own training signal or invokes an unverified uniqueness theorem from the same authors. The work is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; full technical details are unavailable.

pith-pipeline@v0.9.0 · 5561 in / 1044 out tokens · 57663 ms · 2026-05-16T00:14:57.086487+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs

    cs.CL 2026-05 unverdicted novelty 7.0

    LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.

  2. SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    Multimodal AI models for physics reasoning lose performance when information shifts from text to images, and RLVR training gains often come from non-visual textual or distributional cues rather than actual visual evidence.

  3. SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.

  4. ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety

    cs.CR 2026-04 unverdicted novelty 7.0

    ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisone...

  5. Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models

    cs.LG 2026-04 unverdicted novelty 7.0

    RL post-training on hallucination-forced multimodal data improves reasoning performance and can outperform standard training.

  6. Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

    cs.LG 2026-05 unverdicted novelty 6.0

    A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.

  7. Reinforcing Multimodal Reasoning Against Visual Degradation

    cs.CV 2026-05 unverdicted novelty 6.0

    ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.

  8. CharTide: Data-Centric Chart-to-Code Generation via Tri-Perspective Tuning and Inquiry-Driven Evolution

    cs.CV 2026-04 unverdicted novelty 6.0

    CharTide decouples chart-to-code data into three perspectives and uses inquiry-driven RL with atomic QA verification to let smaller VLMs surpass GPT-4o on chart-to-code tasks.

  9. One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.

  10. Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward

    cs.CV 2026-04 unverdicted novelty 6.0

    Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.

  11. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  12. VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    cs.CV 2025-04 unverdicted novelty 6.0

    VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.

  13. Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

    cs.LG 2026-04 unverdicted novelty 5.0

    A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.

  14. SignReasoner: Compositional Reasoning for Complex Traffic Sign Understanding via Functional Structure Units

    cs.CV 2026-04 unverdicted novelty 5.0

    SignReasoner decomposes traffic signs into functional structure units and uses a two-stage VLM post-training pipeline to achieve state-of-the-art compositional reasoning on a new benchmark.

  15. Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models

    cs.AI 2026-04 unverdicted novelty 5.0

    Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...

  16. MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering

    cs.CV 2026-04 unverdicted novelty 5.0

    MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five...

  17. VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    cs.CV 2025-04 unverdicted novelty 5.0

    Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.

  18. Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation

    cs.CV 2026-05 unverdicted novelty 4.0

    A new CoVQD-guided retrieval-augmented generation framework improves multimodal LLMs on visual question answering by using structured reasoning to retrieve better external knowledge.

  19. From System 1 to System 2: A Survey of Reasoning Large Language Models

    cs.AI 2025-02 accept novelty 3.0

    The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 18 Pith papers · 10 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

  2. [2]

    Large language models for mathemat- ical reasoning: Progresses and challenges

    Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathemat- ical reasoning: Progresses and challenges. arXiv preprint arXiv:2402.00157, 2024. 2

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 2, 7

  4. [4]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Hen- rique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evalu- ating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. 1

  5. [5]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling. arXiv preprint arXiv:2412.05271, 2024. 2, 7

  6. [6]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Train- ing verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. 1

  7. [7]

    Opencompass: A universal evaluation platform for foundation models, 2023

    OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models, 2023. 2, 6

  8. [8]

    Cruxeval: A benchmark for code reasoning, understanding and execution

    Alex Gu, Baptiste Rozi `ere, Hugh Leather, Armando Solar- Lezama, Gabriel Synnaeve, and Sida I Wang. Cruxeval: A benchmark for code reasoning, understanding and execution. In International Conference on Machine Learning, 2024. 1

  9. [9]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 1, 3

  10. [10]

    Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale

    Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xi- ang Yue. Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale. arXiv preprint arXiv:2412.05237,

  11. [11]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Proceedings of the Neural Informa- tion Processing Systems Track on Datasets and Benchmarks,

  12. [12]

    Towards reason- ing in large language models: A survey

    Jie Huang and Kevin Chen-Chuan Chang. Towards reason- ing in large language models: A survey. arXiv preprint arXiv:2212.10403, 2022. 2

  13. [13]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 6700–6709, 2019. 3

  14. [14]

    Lawrence Zitnick, and Ross Girshick

    Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elemen- tary visual reasoning, 2016. 6

  15. [15]

    Fig- ureqa: An annotated figure dataset for visual reasoning

    Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkin- son, ´Akos K´ad´ar, Adam Trischler, and Yoshua Bengio. Fig- ureqa: An annotated figure dataset for visual reasoning. In International Conference on Learning Representations Workshop Track, 2018. 3

  16. [16]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information pro- cessing systems, 35:22199–22213, 2022. 2

  17. [17]

    LLaV A-onevision: Easy visual task transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaV A-onevision: Easy visual task transfer. Transactions on Machine Learning Research,

  18. [18]

    Mathvista: Evaluating mathe- matical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathe- matical reasoning of foundation models in visual contexts. In The Twelfth International Conference on Learning Repre- sentations, 2024. 2, 6, 12

  19. [19]

    Llama 3.2: Revolutionizing edge ai and vision with open, customizable models

    AI Meta. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models. Meta AI Blog. Retrieved December, 20:2024, 2024. 2

  20. [20]

    Gpt-4o system card, 2024

    OpenAI. Gpt-4o system card, 2024. 2, 7

  21. [21]

    Humanity's Last Exam

    Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, et al. Humanity’s last exam. arXiv preprint arXiv:2501.14249, 2025. 2

  22. [22]

    Reasoning with large lan- guage models, a survey

    Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, and Thomas Back. Reasoning with large lan- guage models, a survey. arXiv preprint arXiv:2407.11511 ,

  23. [23]

    We-math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284, 2024

    Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large mul- timodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284, 2024. 2, 6

  24. [24]

    Zer- obench: An impossible visual benchmark for contemporary large multimodal models

    Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Florian Langer, Vyas Raina, et al. Zer- obench: An impossible visual benchmark for contemporary large multimodal models. arXiv preprint arXiv:2502.09696,

  25. [25]

    Llamav-o1: Rethink- ing step-by-step visual reasoning in llms.arXiv:2501.06186,

    Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed 9 Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav- o1: Rethinking step-by-step visual reasoning in llms. arXiv preprint arXiv:2501.06186, 2025. 2

  26. [26]

    Mea- suring multimodal mathematical reasoning with math-vision dataset

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Mea- suring multimodal mathematical reasoning with math-vision dataset. Advances in Neural Information Processing Sys- tems, 37:95095–95169, 2025. 2, 3, 6, 12

  27. [27]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 2, 7

  28. [28]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In Advances in Neural Information Processing Systems, 2024. 1

  29. [29]

    Chain-of-thought prompting elicits reasoning in large lan- guage models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models. Advances in neural information processing systems, 35:24824–24837, 2022. 1

  30. [30]

    Large language models are better reasoners with self-verification

    Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Shengping Liu, Bin Sun, Kang Liu, and Jun Zhao. Large language models are better reasoners with self-verification. arXiv preprint arXiv:2212.09561, 2022. 2

  31. [31]

    Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding, 2024

    Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. Deepseek-vl2: Mixture-of-experts visio...

  32. [32]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022a

    Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, and Li Yuan. Llava-o1: Let vision language models reason step- by-step. arXiv preprint arXiv:2411.10440, 2024. 2

  33. [33]

    Llava-cot: Let vision language models reason step- by-step, 2025

    Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step- by-step, 2025. 2, 7

  34. [34]

    Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search

    Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, et al. Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search. arXiv preprint arXiv:2412.18319 , 2024. 2

  35. [35]

    Tree of thoughts: Deliberate problem solving with large language models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822, 2023. 2

  36. [36]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024. 7

  37. [37]

    Raven: A dataset for relational and analogical vi- sual reasoning

    Chi Zhang, Feng Gao, Baoxiong Jia, Yixin Zhu, and Song- Chun Zhu. Raven: A dataset for relational and analogical vi- sual reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 5317– 5327, 2019. 3

  38. [38]

    Humaneval-v: Evaluating visual understanding and rea- soning abilities of large multimodal models through coding tasks

    Fengji Zhang, Linquan Wu, Huiyu Bai, Guancheng Lin, Xiao Li, Xiao Yu, Yue Wang, Bei Chen, and Jacky Ke- ung. Humaneval-v: Evaluating visual understanding and rea- soning abilities of large multimodal models through coding tasks. arXiv preprint arXiv:2410.12381, 2024. 2

  39. [39]

    Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, page 169–186, 2024

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, Peng Gao, and Hongsheng Li. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, page 169–186, 2024. 2, 6

  40. [40]

    Cumulative Reasoning with Large Language Models

    Yifan Zhang, Jingqin Yang, Yang Yuan, and Andrew Chi- Chih Yao. Cumulative reasoning with large language mod- els. arXiv preprint arXiv:2308.04371, 2023. 1

  41. [41]

    Dynamath: A dynamic visual bench- mark for evaluating mathematical reasoning robustness of vision language models

    Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual bench- mark for evaluating mathematical reasoning robustness of vision language models. arXiv preprint arXiv:2411.00836,

  42. [42]

    Subsequently, Section B encompasses a presentation of supplementary visualization results

    3 10 Roadmap of Appendix The structure of the appendix is delineated as follows: De- scriptions of the relevant experimental details are provided in the Section A. Subsequently, Section B encompasses a presentation of supplementary visualization results. A. More Implementation Details A.1. Data Details Our cross-modal reasoning pipeline consists of: Data ...

  43. [43]

    Simulate reasoning by imagining you are looking at the image, and act as if you can see it

    Simulate image reasoning: Treat the image cap- tion as an image. Simulate reasoning by imagining you are looking at the image, and act as if you can see it. However, avoid visualization as a step in the reasoning process

  44. [44]

    The image shows

    Direct visual language: Frame observations as if you are directly viewing the image (e.g., “The image shows...”). Avoid reasoning through image caption or description

  45. [45]

    based on the caption

    Forbidden phrases: Avoid phrases like “based on the caption”, “based on the description”, “visualiz- ing the image”. Question: {question} Image Content: {caption}. Then, we introduce “role play” to bridge the gap in in real image understanding and then filter the data. The prompts are as follows: Revise the provided Chain of Thought (CoT) to fol- low thes...

  46. [46]

    based on the description

    Style Shift: Convert all references to im- age description-based reasoning into direct image- based reasoning. For example: Replace phrases like “based on the description” “based on the caption” with “the image shows” or “as seen in the image”

  47. [47]

    Apply these changes rigorously to ensure that the final CoT reflects direct image interpretation, unin- fluenced by description, caption, image visualiza- tion

    Remove image visualization step: If the CoT contains an inference step for image visualization, remove it and rewrite the CoT to reflect reasoning directly on the image itself, rather than reasoning after visualization from the image description. Apply these changes rigorously to ensure that the final CoT reflects direct image interpretation, unin- fluenc...

  48. [48]

    The assistant’s response has correct reasoning steps

  49. [49]

    The assistant’s response has the final reasoning answer, and the final reasoning answer is consistent with the meaning of the standard answer

  50. [50]

    The assistant’s response is based on the reasoning process of the image, not the image description or caption

  51. [51]

    First output the thinking process in <think> < /think> tags and then output the final answer in <answer> </answer> tags

    There are no steps in the assistant’s response that are irrelevant to the reasoning, and each reasoning step is closely related. Standard answer: {gt} Assistant’s response: {augmented answer} Output: A.2. Model Details For model training, we utilized the llama-factory and adopted a full fine-tuning startegy to optimize the model’s performance. Following t...

  52. [52]

    What is the total area of the unshaded region? Choices: A. 12 B. 18 C. 22 D. 24 E. 30 The image shows a large square with a side length of 6 units, which has a total area of 36 square units. Inside this large square, there are three smaller shaded squares with side lengths of 3, 2, and 1 units. The areas of these shaded squares are 9, 4, and 1 square unit...