pith. sign in

arxiv: 2605.27960 · v1 · pith:67JGF7TBnew · submitted 2026-05-27 · 💻 cs.CV

Mags-RL: Wearing Multimodal LLMs a Magnifying Glass via Agentic Reinforcement Learning For Complex Scene Reasoning

Pith reviewed 2026-06-29 13:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal large language modelsagentic reinforcement learningsuper-resolutionvisual reasoningcomplex scene understandingcurriculum learningfine-grained inspectiontwo-round reasoning
0
0 comments X

The pith

Mags-RL equips MLLMs with an external super-resolution magnifying glass agent that performs autonomous two-round reasoning to improve accuracy on cluttered scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Mags-RL as an agentic reinforcement learning method that adds a super-resolution magnifying glass agent to multimodal large language models. The model first generates an initial rationale and selects regions of interest on its own, then invokes the agent to crop and upscale those regions for a verification round that produces the final answer. A curriculum learning schedule allows this training to succeed with as few as 40 samples. Readers would care because MLLMs commonly fail on high-density or cluttered images, and the approach avoids the extra bounding-box annotations required by prior methods while still delivering stronger results on VSR, TallyQA, and GQA subsets.

Core claim

The central claim is that an external super-resolution magnifying glass agent, invoked through agentic RL, lets MLLMs conduct two-round reasoning: an initial pass produces a rationale and autonomously identifies regions of interest without extra annotations, after which the agent crops and upscales those regions so the model can revisit and verify its earlier reasoning to reach a final answer, yielding superior performance on complex scene reasoning benchmarks.

What carries the argument

Two-round reasoning loop in which the first round generates a rationale and selects regions autonomously, and the second round invokes a super-resolution agent to upscale those regions before verification.

If this is right

  • Superior performance against recent competing methods on VSR, TallyQA, and GQA subsets with precise visual grounding.
  • Data-efficient RL training that reaches reasonable performance using only 40 samples via curriculum learning.
  • Region identification occurs without relying on additional human annotations such as bounding boxes.
  • The two-round structure separates initial rationale generation from detail verification, allowing the model to correct early mistakes after upscaling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-round autonomous selection plus external agent pattern could transfer to other detail-sensitive multimodal tasks such as medical imaging or document understanding.
  • Removing the need for pre-provided bounding boxes may encourage future MLLM designs to rely more on internal rationale steps rather than external detectors.
  • Curriculum-based training on tiny datasets hints that similar agentic loops might work in low-resource or few-shot multimodal settings beyond the three benchmarks tested.

Load-bearing premise

The initial rationale generation step can reliably identify regions of interest without any additional annotations, and the invoked super-resolution agent will supply fine-grained details that meaningfully improve verification in the second round.

What would settle it

If replacing the super-resolution upscaling step with the original low-resolution crops produces no measurable gain in final answer accuracy on the same test sets, the benefit of the magnifying glass agent would be refuted.

Figures

Figures reproduced from arXiv: 2605.27960 by Gen Li, Hao Wang, Peijie Qiu, Prayag Tiwari, Shao Tang, Wenhui Zhu, Xiaobing Yu, Xin Li, Xiwen Chen, Xuanzhao Dong, Yalin Wang, Yanxi Chen, Yujian Xiong, Zhipeng Wang.

Figure 1
Figure 1. Figure 1: Illustration of model responses to counting questions from TallyQA. A.-D. show the responses produced by CoT, Zoom-Refine, GRIT, and Ours (i.e., Mags-RL), re￾spectively. The Red boxes denote the image crops generated from the model-predicted coordinates. Key syntax is omitted for clarity. tency and data efficiency, we train the model using the Group-Relative Policy Optimization (GRPO) [34] algorithm with a… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Mags-RL pipeline, which is comprised of a two-round reasoning chain: i) the LLM policy generates an initial rationale and determines a zoom-in region based on the multimodal input; and ii) the LLM policy performs an additional reasoning step to verify the initial logic and generate the final answer. In between these two rounds, a magnifying super-resolution (SR) agent is employed. Mags-RL e… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the Reward Design for Mags-RL. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of generation results across curriculum learning stages. In stage 1, the model successfully acquires the ability to perform reasoning, trigger the external SR agent, and execute self-verification. However, its bounding box predictions remain relatively conservative, often under-covering relevant targets. In stage 2, the model not only maintains structural correctness but also demonstrates improv… view at source ↗
Figure 5
Figure 5. Figure 5: Illustrations of model responses from the Zoom-Medium benchmark. A. compares Mags-RL against the direct query baseline on the VSR dataset. B. compares Mags-RL against the GRIT baseline on the GQA dataset, and C. compares our method against the Zoom-Refine baseline on the TallyQA dataset. performance in spatial reasoning and visual grounding through self-verification, declines from 45.01% to 18.15%. Beyond … view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of model responses during CL stage-1. As training with Mags-RL progresses, the model gradually reduces overly descriptive outputs (sub-panel figures A and B) and instead tends to directly target the area of interest, thereby triggering the external visual agent. Key syntax is omitted for clarity. (see [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of model responses with and without the SR module. The Blue boxes highlight outputs from the Crop-and-Resize module, while the Green boxes indicate the response from our Stage-1 CL process [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ablation study of the Curriculum Learning (CL) strategy [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The system prompt for curriculum learning Stage 1. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The prompt suffix in stage 1 First, think between <think> and </think>, using <zoom>[[x1, y1, x2, y2]]</zoom> if details are unclear. Then, after receiving system feedback, provide your final reasoning in <rethink>... </rethink> and the final answer in <answer>...</answer> [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Modified system prompt for curriculum learning Stage 2. [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: The prompt suffix in stage 2 First, think between <think> and </think>, using <zoom>[[x1, y1, x2, y2], ...]</zoom> to verify all relevant details. Then, after receiving system feedback, provide your final reasoning in <rethink>...</rethink> and the final answer in <answer>...</answer>. A.2 Prompt Design for Evaluation For our primary evaluation, we employ the system prompt and suffix from CL Stage 2 durin… view at source ↗
Figure 13
Figure 13. Figure 13: The prompt design for data extractor Edata [System Prompt]: You are a strict data-extraction assistant. Your only job is to extract the final, core factual answer from the model’s response based on the original question. - You must output ONLY the extracted concise answer. - DO NOT include conversational filler (e.g., ’The image shows...’, ’The answer is...’). - DO NOT use punctuation unless it is part of… view at source ↗
Figure 14
Figure 14. Figure 14: Illustration of the influence of data extractor module Edata, A and B represent response from direct query, C represent the response from ICL baseline and D denote the answer from GRIT. We omit the related images for clarity. assessment includes GPT accuracy, which evaluates semantic performance, and inclusion accuracy, which serves as a proxy for hard recall. The specific prompt design for the LLM judges… view at source ↗
Figure 15
Figure 15. Figure 15: The prompt design for LLM judge [System Prompt]:You are an impartial, strict expert judge evaluating the factual correctness of a model’s answer to a question, based solely on the provided question and ground-truth answer. You must score the model’s answer on a scale from 0.0 to 1.0 using the following strict rubric: - 1.0: The answer is factually correct, complete, and perfectly aligns with the ground tr… view at source ↗
Figure 16
Figure 16. Figure 16: The prompt design for VLM judge [System Prompt]: You are an expert Visual Information Analyst. Your task is to evaluate the "Information Density" and "Zoom Necessity" of the provided image. Analyze the image based on these criteria: 1. Object Scale: How small are the key elements relative to the image size? 2. Visual Clutter: Is the scene crowded, chaotic, or clean? 3. Text/Detail Level: Is there fine pri… view at source ↗
Figure 17
Figure 17. Figure 17: Illustration of Mags-RL responses on the Zoom-Easy evaluation set. The red boxes indicate the spatial regions explicitly referenced in the model’s text response. D Case Studies We present additional case studies in [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Illustration of Mags-RL responses on the Zoom-Medium evaluation set. The red boxes indicate the spatial regions explicitly referenced in the model’s text response. 𝐀. Prompt: What are the people on? Answer: bridge Round 1 [Think and Zoom-in]: <think> To determine what the people are on, I need to analyze the image and identify any visible structures or objects. The image appears to show a river with a bri… view at source ↗
Figure 19
Figure 19. Figure 19: Illustration of Mags-RL responses on the Zoom-Hard evaluation set. The red boxes indicate the spatial regions explicitly referenced in the model’s text response [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗
read the original abstract

Despite their popularity and success, Multimodal Large Language Models (MLLMs) often struggle to interpret images accurately, which limits their reasoning capability in complex scenarios (e.g., high object density and complex background clutter). Prior work mainly addresses this limitation by incorporating explicit visual cues like bounding boxes that require extra annotations. In addition, the resulting low-resolution crops often miss fine-grained details that MLLMs require for accurate reasoning. Therefore, we propose Mags-RL, an Agentic Reinforcement Learning (RL) framework that equips MLLMs with an external super-resolution "magnifying glass" agent for high-resolution fine-grained inspection. Specifically, the model performs two-round reasoning: in the first round, it generates an initial rationale and autonomously identifies regions of interest without relying on additional annotations; in the second round, it invokes a super-resolution agent to crop and upscale those regions, then revisits and verifies its earlier reasoning to produce the final answer. We also introduce a novel curriculum learning strategy that enables data-efficient RL training, needing as few as only 40 training samples to achieve reasonable performance. Experiments on VSR, TallyQA, and GQA subsets show its superior performance against recent strong competing methods, demonstrating high-quality reasoning with precise visual grounding. Code and weights will be released soon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce Mags-RL, an Agentic Reinforcement Learning framework that equips Multimodal Large Language Models (MLLMs) with an external super-resolution 'magnifying glass' agent. The approach performs two-round reasoning: the first round generates an initial rationale and autonomously identifies regions of interest without additional annotations; the second round invokes the super-resolution agent to crop and upscale those regions before verifying the rationale to produce the final answer. A novel curriculum learning strategy is said to enable data-efficient RL training with as few as 40 samples. Experiments on subsets of VSR, TallyQA, and GQA are asserted to show superior performance against recent strong competing methods, with high-quality reasoning and precise visual grounding.

Significance. If the empirical claims hold with proper controls and ablations, the work could meaningfully advance MLLM reasoning in high-density, cluttered scenes by avoiding reliance on explicit annotations and using modular external agents for fine-grained inspection. The emphasis on curriculum-based RL for training with only 40 samples represents a potential strength in data efficiency that, if reproducible, would be valuable for practical deployment. The two-round verification structure offers a clear, falsifiable mechanism that could be tested on additional benchmarks.

major comments (2)
  1. [Abstract] Abstract: The central claim of 'superior performance' on VSR, TallyQA, and GQA subsets is stated without any quantitative results, baseline comparisons, error bars, training hyperparameters, or ablation studies. This absence is load-bearing because the manuscript's contribution rests entirely on the empirical demonstration that the two-round RL process outperforms prior methods; without these data the claim cannot be evaluated.
  2. [Abstract] Abstract (method description): No formulation is given for the RL components (reward function, state/action space, policy network, or curriculum schedule). This directly affects assessment of the weakest assumption that the first-round rationale can autonomously select regions whose upscaled versions improve second-round verification; without the training mechanics it is impossible to determine whether the policy learns task-relevant selection rather than spurious crops.
minor comments (2)
  1. [Title] Title: The phrasing 'Wearing Multimodal LLMs a Magnifying Glass' is grammatically awkward and does not clearly convey the intended meaning of equipping the model with an external agent.
  2. [Abstract] Abstract: The statement 'Code and weights will be released soon' is conventional but could usefully include a specific repository or expected release window.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and agree that the abstract can be strengthened with additional details to better support evaluation of the claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of 'superior performance' on VSR, TallyQA, and GQA subsets is stated without any quantitative results, baseline comparisons, error bars, training hyperparameters, or ablation studies. This absence is load-bearing because the manuscript's contribution rests entirely on the empirical demonstration that the two-round RL process outperforms prior methods; without these data the claim cannot be evaluated.

    Authors: We agree that the abstract would be more informative with key quantitative results. In the revised version we will add specific accuracy gains on each dataset (e.g., +X% on VSR, +Y% on TallyQA) together with the main baselines used. Full tables with error bars, hyperparameters, and ablations already appear in Sections 4 and 5; the abstract revision will make the central empirical claim evaluable without requiring the reader to consult the body. revision: yes

  2. Referee: [Abstract] Abstract (method description): No formulation is given for the RL components (reward function, state/action space, policy network, or curriculum schedule). This directly affects assessment of the weakest assumption that the first-round rationale can autonomously select regions whose upscaled versions improve second-round verification; without the training mechanics it is impossible to determine whether the policy learns task-relevant selection rather than spurious crops.

    Authors: The abstract supplies a high-level overview. We will revise it to include a concise formulation of the RL components (reward combining answer correctness and region utility, action space over crop coordinates and scale factors, policy integrated with the MLLM backbone, and the 40-sample curriculum that progressively raises scene complexity). Complete equations and training mechanics are already given in Section 3; the abstract addition will directly address the concern about whether the learned policy selects task-relevant regions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL framework with no derivations or self-referential predictions

full rationale

The paper describes an empirical agentic RL method for MLLM reasoning with a super-resolution agent and curriculum training on 40 samples. No equations, first-principles derivations, or predictions appear in the provided abstract or described content. The central claims rest on experimental results from RL training rather than any internal definitions, fitted parameters renamed as predictions, or self-citation chains that reduce the result to its inputs by construction. The method is self-contained against external benchmarks via reported performance on VSR, TallyQA, and GQA.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the effectiveness of the autonomous ROI identification and the utility of the external super-resolution agent, both introduced without independent evidence or implementation details in the abstract.

invented entities (1)
  • super-resolution magnifying glass agent no independent evidence
    purpose: Crop and upscale autonomously identified regions of interest for fine-grained verification in the second reasoning round
    Presented as an external tool invoked by the model; no details on its training, architecture, or integration provided in the abstract.

pith-pipeline@v0.9.1-grok · 5820 in / 1265 out tokens · 59204 ms · 2026-06-29T13:34:24.806334+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 32 canonical work pages · 19 internal anchors

  1. [1]

    In: Proceedings of the AAAI conference on artificial intelligence

    Acharya,M.,Kafle,K.,Kanan,C.:Tallyqa:Answeringcomplexcountingquestions. In: Proceedings of the AAAI conference on artificial intelligence. vol. 33, pp. 8076– 8084 (2019)

  2. [2]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  3. [3]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., Finn, C., Fu, C., Gopalakrishnan, K., Hausman, K., et al.: Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691 (2022)

  4. [4]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

  5. [5]

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923

  6. [6]

    In: Pro- ceedings ofthe 26thannualinternational conference on machine learning.pp

    Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Pro- ceedings ofthe 26thannualinternational conference on machine learning.pp. 41–48 (2009)

  7. [7]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., Florence, P., Fu, C., Arenas, M.G., Gopalakrishnan, K., Han, K., Hausman, K., Herzog, A., Hsu, J., Ichter, B., Irpan, A., Joshi, N., Julian, R., Kalashnikov, D., Kuang, Y., Leal, I., Lee, L., Lee, T.W.E., Levine, S., Lu, Y., Michalewski...

  8. [8]

    Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Win- ter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford...

  9. [9]

    Findings of the association for computational linguistics: ACL 2024 pp

    Caffagni, D., Cocchi, F., Barsellotti, L., Moratelli, N., Sarto, S., Baraldi, L., Cornia, M., Cucchiara, R.: The revolution of multimodal large language models: A survey. Findings of the association for computational linguistics: ACL 2024 pp. 13590– 13618 (2024)

  10. [10]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: Unleash- ing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)

  11. [11]

    arXiv preprint arXiv:2506.06366 (2025)

    Chen, L., Zhang, Y., Feng, J., Chai, H., Zhang, H., Fan, B., Ma, Y., Zhang, S., Li, N., Liu, T., et al.: Ai agent behavioral science. arXiv preprint arXiv:2506.06366 (2025)

  12. [12]

    arXiv preprint arXiv:2505.09655 (2025)

    Chen, X., Zhu, W., Qiu, P., Dong, X., Wang, H., Wu, H., Li, H., Sotiras, A., Wang, Y., Razi, A.: Dra-grpo: Exploring diversity-aware reward adjustment for r1-zero- like training of large language models. arXiv preprint arXiv:2505.09655 (2025)

  13. [13]

    Chen, X., Zhu, W., Qiu, P., Wang, H., Li, H., Wu, H., Dong, X., Sotiras, A., Wang, Y., Razi, A.: Prompt-ot: An optimal transport regularization paradigm for knowledgepreservationinvision-languagemodeladaptation.In:Proceedingsofthe IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 667–676 (2026)

  14. [14]

    arXiv preprint arXiv:2512.24052 (2025)

    Chen, Y., Zhu, W., Chen, X., Wang, Z., Li, X., Qiu, P., Wang, H., Dong, X., Xiong, Y., Schneider, A., et al.: Aha: Aligning large audio-language models for reasoning hallucinations via counterfactual hard negatives. arXiv preprint arXiv:2512.24052 (2025)

  15. [15]

    Agent-R1: A Unified and Modular Framework for Agentic Reinforcement Learning

    Cheng, M., Ouyang, J., Yu, S., Yan, R., Luo, Y., Liu, Z., Wang, D., Liu, Q., Chen, E.: Agent-r1: Training powerful llm agents with end-to-end reinforcement learning. arXiv preprint arXiv:2511.14460 (2025)

  16. [16]

    Advances in neural information processing systems30(2017)

    Christiano, P.F., Leike, J., Brown, T., Martic, M., Legg, S., Amodei, D.: Deep reinforcement learning from human preferences. Advances in neural information processing systems30(2017)

  17. [17]

    arXiv preprint arXiv:2508.01617 (2025)

    Dong, X., Zhu, W., Chen, X., Wang, Z., Qiu, P., Tang, S., Li, X., Wang, Y.: Llada-medv: Exploring large language diffusion models for biomedical image un- derstanding. arXiv preprint arXiv:2508.01617 (2025)

  18. [18]

    PaLM-E: An Embodied Multimodal Language Model

    Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)

  19. [19]

    GRIT: Teaching MLLMs to Think with Images

    Fan, Y., He, X., Yang, D., Zheng, K., Kuo, C.C., Zheng, Y., Narayanaraju, S.J., Guan, X., Wang, X.E.: Grit: Teaching mllms to think with images. arXiv preprint arXiv:2505.15879 (2025)

  20. [20]

    In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition

    Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition. pp. 6700–6709 (2019) Mags-RL: Wearing Multimodal LLMs a Magnifying Glass 17

  21. [21]

    arXiv preprint arXiv:2512.16848 (2025)

    Jiang, Y., Jiang, L., Teney, D., Moor, M., Brbic, M.: Meta-rl induces exploration in language agents. arXiv preprint arXiv:2512.16848 (2025)

  22. [22]

    Advances in Neural Information Processing Systems36, 28541–28564 (2023)

    Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems36, 28541–28564 (2023)

  23. [23]

    Perception, reason, think, and plan: A survey on large multimodal reasoning models.arXiv preprint arXiv:2505.04921, 2025

    Li, Y., Liu, Z., Li, Z., Zhang, X., Xu, Z., Chen, X., Shi, H., Jiang, S., Wang, X., Wang, J., et al.: Perception, reason, think, and plan: A survey on large multimodal reasoning models. arXiv preprint arXiv:2505.04921 (2025)

  24. [24]

    In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Li, Z., Yang, B., Liu, Q., Ma, Z., Zhang, S., Yang, J., Sun, Y., Liu, Y., Bai, X.: Monkey: Image resolution and text label are important things for large multi- modal models. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26763–26773 (2024)

  25. [25]

    Transactions of the Association for Computational Linguistics11, 635–651 (2023)

    Liu, F., Emerson, G., Collier, N.: Visual spatial reasoning. Transactions of the Association for Computational Linguistics11, 635–651 (2023)

  26. [26]

    Advances in neural information processing sys- tems35, 27730–27744 (2022)

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. Advances in neural information processing sys- tems35, 27730–27744 (2022)

  27. [27]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Wei, F.: Kosmos- 2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)

  28. [28]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Qiu, P., Zhu, W., Kumar, S., Chen, X., Yang, J., Sun, X., Razi, A., Wang, Y., Soti- ras, A.: Multimodal variational autoencoder: A barycentric view. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 20060–20068 (2025)

  29. [29]

    In: Proceed- ings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining

    Rasley, J., Rajbhandari, S., Ruwase, O., He, Y.: Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In: Proceed- ings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. pp. 3505–3506 (2020)

  30. [30]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Schmalfuss, J., Chang, N., VS, V., Shen, M., Bruhn, A., Alvarez, J.M.: Parc: A quantitative framework uncovering the symmetries within vision language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 25081–25091 (2025)

  31. [31]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

  32. [32]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Shao, H., Hu, Y., Wang, L., Song, G., Waslander, S.L., Liu, Y., Li, H.: Lmdrive: Closed-loop end-to-end driving with large language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15120– 15130 (2024)

  33. [33]

    Advances in Neural Information Processing Systems37, 8612–8642 (2024)

    Shao, H., Qian, S., Xiao, H., Song, G., Zong, Z., Wang, L., Liu, Y., Li, H.: Vi- sual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. Advances in Neural Information Processing Systems37, 8612–8642 (2024)

  34. [34]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

  35. [35]

    In: European conference on computer vision

    Sima, C., Renz, K., Chitta, K., Chen, L., Zhang, H., Xie, C., Beißwenger, J., Luo, P., Geiger, A., Li, H.: Drivelm: Driving with graph visual question answering. In: European conference on computer vision. pp. 256–274. Springer (2024) 18 X. Dong et al

  36. [36]

    Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

    Su, Z., Xia, P., Guo, H., Liu, Z., Ma, Y., Qu, X., Liu, J., Li, Y., Zeng, K., Yang, Z., et al.: Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers. arXiv preprint arXiv:2506.23918 (2025)

  37. [37]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  38. [38]

    Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

    Wang, W., Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Zhu, J., Zhu, X., Lu, L., Qiao, Y., et al.: Enhancing the reasoning ability of multimodal large language models via mixed preference optimization. arXiv preprint arXiv:2411.10442 (2024)

  39. [39]

    arXiv preprint arXiv:2401.06805 (2024)

    Wang,Y.,Chen,W.,Han,X.,Lin,X.,Zhao,H.,Liu,Y.,Zhai,B.,Yuan,J.,You,Q., Yang, H.: Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning. arXiv preprint arXiv:2401.06805 (2024)

  40. [40]

    Advances in neural information processing systems35, 24824–24837 (2022)

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)

  41. [41]

    In: 2023 IEEE International Conference on Big Data (BigData)

    Wu, J., Gan, W., Chen, Z., Wan, S., Yu, P.S.: Multimodal large language models: A survey. In: 2023 IEEE International Conference on Big Data (BigData). pp. 2247–2256. IEEE (2023)

  42. [42]

    Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

    Wu, J., Guan, J., Feng, K., Liu, Q., Wu, S., Wang, L., Wu, W., Tan, T.: Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing. arXiv preprint arXiv:2506.09965 (2025)

  43. [43]

    Grounded chain-of-thought for multimodal large language models.arXiv preprint arXiv:2503.12799, 2025

    Wu, Q., Yang, X., Zhou, Y., Fang, C., Song, B., Sun, X., Ji, R.: Grounded chain- of-thought for multimodal large language models. arXiv preprint arXiv:2503.12799 (2025)

  44. [44]

    IEEE Robotics and Automation Letters9(10), 8186–8193 (2024)

    Xu, Z., Zhang, Y., Xie, E., Zhao, Z., Guo, Y., Wong, K.Y.K., Li, Z., Zhao, H.: Drivegpt4: Interpretable end-to-end autonomous driving via large language model. IEEE Robotics and Automation Letters9(10), 8186–8193 (2024)

  45. [45]

    arXiv preprint arXiv:2405.03162 (2024)

    Yang, L., Xu, S., Sellergren, A., Kohlberger, T., Zhou, Y., Ktena, I., Kiraly, A., Ahmed, F., Hormozdiari, F., Jaroensri, T., et al.: Advancing multimodal medical capabilities of gemini. arXiv preprint arXiv:2405.03162 (2024)

  46. [46]

    arXiv preprint arXiv:2512.17306 (2025)

    Yang, W., Xia, Y., Huang, J., Lu, S., Chen, Q.G., Xu, Z., Luo, W., Zhang, K., Wan, Y., Zhang, L.: Deep but reliable: Advancing multi-turn reasoning for thinking with images. arXiv preprint arXiv:2512.17306 (2025)

  47. [47]

    National Science Review11(12), nwae403 (2024)

    Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E.: A survey on multimodal large language models. National Science Review11(12), nwae403 (2024)

  48. [48]

    Yu, X., Guan, D., Gu, Y.: Zoom-refine: Boosting high-resolution multimodal un- derstanding via localized zoom and self-refinement (2025),https://arxiv.org/ abs/2506.01663

  49. [49]

    The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

    Zhang, G., Geng, H., Yu, X., Yin, Z., Zhang, Z., Tan, Z., Zhou, H., Li, Z., Xue, X., Li, Y., et al.: The landscape of agentic reinforcement learning for llms: A survey. arXiv preprint arXiv:2509.02547 (2025)

  50. [50]

    ACM Computing Surveys (2025)

    Zhang,M.,Yang,Y.,Xie,R.,Dhingra,B.,Zhou,S.,Pei,J.:Generalizabilityoflarge language model-based agents: A comprehensive survey. ACM Computing Surveys (2025)

  51. [51]

    Multimodal Chain-of-Thought Reasoning in Language Models

    Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain- of-thought reasoning in language models. arXiv preprint arXiv:2302.00923 (2023)

  52. [52]

    In: ECCV (2024) Mags-RL: Wearing Multimodal LLMs a Magnifying Glass 19

    Zheng, M., Sun, L., Dong, J., Pan, J.: Smfanet: A lightweight self-modulation feature aggregation network for efficient image super-resolution. In: ECCV (2024) Mags-RL: Wearing Multimodal LLMs a Magnifying Glass 19

  53. [53]

    Zhou, C., Wang, M., Ma, Y., Wu, C., Chen, W., Qian, Z., Liu, X., Zhang, Y., Wang, J., Xu, H., et al.: From perception to cognition: A survey of vision- languageinteractivereasoninginmultimodallargelanguagemodels.arXivpreprint arXiv:2509.25373 (2025)

  54. [54]

    arXiv preprint arXiv:2503.03987 (2025) A Prompt Degisn Details This section details the prompt design for training Mags-RL

    Zhu, W., Li, X., Chen, X., Qiu, P., Vasa, V.K., Dong, X., Chen, Y., Lepore, N., Dumitrascu, O., Su, Y., et al.: Retinalgpt: A retinal clinical preference con- versational assistant powered by large vision-language models. arXiv preprint arXiv:2503.03987 (2025) A Prompt Degisn Details This section details the prompt design for training Mags-RL. Specificall...

  55. [55]

    All initial analysis must be inside <think>...</think>

  56. [56]

    The Zoom tool <zoom>...</zoom> must be nested INSIDE the <think>...</think> block

  57. [57]

    All updated reasoning must be inside <rethink>...</rethink>

  58. [58]

    Information Density

    The final answer must be inside <answer>...</answer>. Fig.10: The prompt suffix in stage 1 First, think between <think> and </think>, using <zoom>[[x1, y1, x2, y2]]</zoom> if details are unclear. Then, after receiving system feedback, provide your final reasoning in <rethink>... </rethink> and the final answer in <answer>...</answer>. Fig.11: Modified sys...

  59. [59]

    Object Scale: How small are the key elements relative to the image size?

  60. [60]

    Visual Clutter: Is the scene crowded, chaotic, or clean?

  61. [61]

    zoom_score

    Text/Detail Level: Is there fine print, tiny textures, or distant background details that are hard to see? Based on your analysis, provide a "zoom_score" from 1 to 10: - Score 1-3 (Simple): - Subject is large, centered, and clearly visible. - No zoom needed. - Score 4-7 (Medium): - A standard scene with multiple objects or moderate distance. - Main elemen...