Mags-RL: Wearing Multimodal LLMs a Magnifying Glass via Agentic Reinforcement Learning For Complex Scene Reasoning

Gen Li; Hao Wang; Peijie Qiu; Prayag Tiwari; Shao Tang; Wenhui Zhu; Xiaobing Yu; Xin Li; Xiwen Chen; Xuanzhao Dong

arxiv: 2605.27960 · v1 · pith:67JGF7TBnew · submitted 2026-05-27 · 💻 cs.CV

Mags-RL: Wearing Multimodal LLMs a Magnifying Glass via Agentic Reinforcement Learning For Complex Scene Reasoning

Xuanzhao Dong , Wenhui Zhu , Peijie Qiu , Xiwen Chen , Xiaobing Yu , Xin Li , Zhipeng Wang , Shao Tang

show 6 more authors

Gen Li Yujian Xiong Hao Wang Yanxi Chen Prayag Tiwari Yalin Wang

This is my paper

Pith reviewed 2026-06-29 13:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal large language modelsagentic reinforcement learningsuper-resolutionvisual reasoningcomplex scene understandingcurriculum learningfine-grained inspectiontwo-round reasoning

0 comments

The pith

Mags-RL equips MLLMs with an external super-resolution magnifying glass agent that performs autonomous two-round reasoning to improve accuracy on cluttered scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Mags-RL as an agentic reinforcement learning method that adds a super-resolution magnifying glass agent to multimodal large language models. The model first generates an initial rationale and selects regions of interest on its own, then invokes the agent to crop and upscale those regions for a verification round that produces the final answer. A curriculum learning schedule allows this training to succeed with as few as 40 samples. Readers would care because MLLMs commonly fail on high-density or cluttered images, and the approach avoids the extra bounding-box annotations required by prior methods while still delivering stronger results on VSR, TallyQA, and GQA subsets.

Core claim

The central claim is that an external super-resolution magnifying glass agent, invoked through agentic RL, lets MLLMs conduct two-round reasoning: an initial pass produces a rationale and autonomously identifies regions of interest without extra annotations, after which the agent crops and upscales those regions so the model can revisit and verify its earlier reasoning to reach a final answer, yielding superior performance on complex scene reasoning benchmarks.

What carries the argument

Two-round reasoning loop in which the first round generates a rationale and selects regions autonomously, and the second round invokes a super-resolution agent to upscale those regions before verification.

If this is right

Superior performance against recent competing methods on VSR, TallyQA, and GQA subsets with precise visual grounding.
Data-efficient RL training that reaches reasonable performance using only 40 samples via curriculum learning.
Region identification occurs without relying on additional human annotations such as bounding boxes.
The two-round structure separates initial rationale generation from detail verification, allowing the model to correct early mistakes after upscaling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-round autonomous selection plus external agent pattern could transfer to other detail-sensitive multimodal tasks such as medical imaging or document understanding.
Removing the need for pre-provided bounding boxes may encourage future MLLM designs to rely more on internal rationale steps rather than external detectors.
Curriculum-based training on tiny datasets hints that similar agentic loops might work in low-resource or few-shot multimodal settings beyond the three benchmarks tested.

Load-bearing premise

The initial rationale generation step can reliably identify regions of interest without any additional annotations, and the invoked super-resolution agent will supply fine-grained details that meaningfully improve verification in the second round.

What would settle it

If replacing the super-resolution upscaling step with the original low-resolution crops produces no measurable gain in final answer accuracy on the same test sets, the benefit of the magnifying glass agent would be refuted.

Figures

Figures reproduced from arXiv: 2605.27960 by Gen Li, Hao Wang, Peijie Qiu, Prayag Tiwari, Shao Tang, Wenhui Zhu, Xiaobing Yu, Xin Li, Xiwen Chen, Xuanzhao Dong, Yalin Wang, Yanxi Chen, Yujian Xiong, Zhipeng Wang.

**Figure 1.** Figure 1: Illustration of model responses to counting questions from TallyQA. A.-D. show the responses produced by CoT, Zoom-Refine, GRIT, and Ours (i.e., Mags-RL), respectively. The Red boxes denote the image crops generated from the model-predicted coordinates. Key syntax is omitted for clarity. tency and data efficiency, we train the model using the Group-Relative Policy Optimization (GRPO) [34] algorithm with a… view at source ↗

**Figure 2.** Figure 2: Overview of the Mags-RL pipeline, which is comprised of a two-round reasoning chain: i) the LLM policy generates an initial rationale and determines a zoom-in region based on the multimodal input; and ii) the LLM policy performs an additional reasoning step to verify the initial logic and generate the final answer. In between these two rounds, a magnifying super-resolution (SR) agent is employed. Mags-RL e… view at source ↗

**Figure 3.** Figure 3: Illustration of the Reward Design for Mags-RL. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of generation results across curriculum learning stages. In stage 1, the model successfully acquires the ability to perform reasoning, trigger the external SR agent, and execute self-verification. However, its bounding box predictions remain relatively conservative, often under-covering relevant targets. In stage 2, the model not only maintains structural correctness but also demonstrates improv… view at source ↗

**Figure 5.** Figure 5: Illustrations of model responses from the Zoom-Medium benchmark. A. compares Mags-RL against the direct query baseline on the VSR dataset. B. compares Mags-RL against the GRIT baseline on the GQA dataset, and C. compares our method against the Zoom-Refine baseline on the TallyQA dataset. performance in spatial reasoning and visual grounding through self-verification, declines from 45.01% to 18.15%. Beyond … view at source ↗

**Figure 6.** Figure 6: Illustration of model responses during CL stage-1. As training with Mags-RL progresses, the model gradually reduces overly descriptive outputs (sub-panel figures A and B) and instead tends to directly target the area of interest, thereby triggering the external visual agent. Key syntax is omitted for clarity. (see [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of model responses with and without the SR module. The Blue boxes highlight outputs from the Crop-and-Resize module, while the Green boxes indicate the response from our Stage-1 CL process [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Ablation study of the Curriculum Learning (CL) strategy [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: The system prompt for curriculum learning Stage 1. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: The prompt suffix in stage 1 First, think between <think> and </think>, using <zoom>[[x1, y1, x2, y2]]</zoom> if details are unclear. Then, after receiving system feedback, provide your final reasoning in <rethink>... </rethink> and the final answer in <answer>...</answer> [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Modified system prompt for curriculum learning Stage 2. [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: The prompt suffix in stage 2 First, think between <think> and </think>, using <zoom>[[x1, y1, x2, y2], ...]</zoom> to verify all relevant details. Then, after receiving system feedback, provide your final reasoning in <rethink>...</rethink> and the final answer in <answer>...</answer>. A.2 Prompt Design for Evaluation For our primary evaluation, we employ the system prompt and suffix from CL Stage 2 durin… view at source ↗

**Figure 13.** Figure 13: The prompt design for data extractor Edata [System Prompt]: You are a strict data-extraction assistant. Your only job is to extract the final, core factual answer from the model’s response based on the original question. - You must output ONLY the extracted concise answer. - DO NOT include conversational filler (e.g., ’The image shows...’, ’The answer is...’). - DO NOT use punctuation unless it is part of… view at source ↗

**Figure 14.** Figure 14: Illustration of the influence of data extractor module Edata, A and B represent response from direct query, C represent the response from ICL baseline and D denote the answer from GRIT. We omit the related images for clarity. assessment includes GPT accuracy, which evaluates semantic performance, and inclusion accuracy, which serves as a proxy for hard recall. The specific prompt design for the LLM judges… view at source ↗

**Figure 15.** Figure 15: The prompt design for LLM judge [System Prompt]:You are an impartial, strict expert judge evaluating the factual correctness of a model’s answer to a question, based solely on the provided question and ground-truth answer. You must score the model’s answer on a scale from 0.0 to 1.0 using the following strict rubric: - 1.0: The answer is factually correct, complete, and perfectly aligns with the ground tr… view at source ↗

**Figure 16.** Figure 16: The prompt design for VLM judge [System Prompt]: You are an expert Visual Information Analyst. Your task is to evaluate the "Information Density" and "Zoom Necessity" of the provided image. Analyze the image based on these criteria: 1. Object Scale: How small are the key elements relative to the image size? 2. Visual Clutter: Is the scene crowded, chaotic, or clean? 3. Text/Detail Level: Is there fine pri… view at source ↗

**Figure 17.** Figure 17: Illustration of Mags-RL responses on the Zoom-Easy evaluation set. The red boxes indicate the spatial regions explicitly referenced in the model’s text response. D Case Studies We present additional case studies in [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗

**Figure 18.** Figure 18: Illustration of Mags-RL responses on the Zoom-Medium evaluation set. The red boxes indicate the spatial regions explicitly referenced in the model’s text response. 𝐀. Prompt: What are the people on? Answer: bridge Round 1 [Think and Zoom-in]: <think> To determine what the people are on, I need to analyze the image and identify any visible structures or objects. The image appears to show a river with a bri… view at source ↗

**Figure 19.** Figure 19: Illustration of Mags-RL responses on the Zoom-Hard evaluation set. The red boxes indicate the spatial regions explicitly referenced in the model’s text response [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗

read the original abstract

Despite their popularity and success, Multimodal Large Language Models (MLLMs) often struggle to interpret images accurately, which limits their reasoning capability in complex scenarios (e.g., high object density and complex background clutter). Prior work mainly addresses this limitation by incorporating explicit visual cues like bounding boxes that require extra annotations. In addition, the resulting low-resolution crops often miss fine-grained details that MLLMs require for accurate reasoning. Therefore, we propose Mags-RL, an Agentic Reinforcement Learning (RL) framework that equips MLLMs with an external super-resolution "magnifying glass" agent for high-resolution fine-grained inspection. Specifically, the model performs two-round reasoning: in the first round, it generates an initial rationale and autonomously identifies regions of interest without relying on additional annotations; in the second round, it invokes a super-resolution agent to crop and upscale those regions, then revisits and verifies its earlier reasoning to produce the final answer. We also introduce a novel curriculum learning strategy that enables data-efficient RL training, needing as few as only 40 training samples to achieve reasonable performance. Experiments on VSR, TallyQA, and GQA subsets show its superior performance against recent strong competing methods, demonstrating high-quality reasoning with precise visual grounding. Code and weights will be released soon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mags-RL trains an RL agent to pick ROIs for super-resolution in a two-round MLLM loop and claims it works with 40 samples, but the first-round selection reliability is the unproven core.

read the letter

The paper's central move is to add an external RL-trained agent that lets the base MLLM generate an initial rationale, pick its own regions without extra labels, upscale those via super-resolution, and then verify in a second round. It also uses a curriculum strategy to reach decent results on VSR, TallyQA, and GQA subsets with only 40 training samples.

The data-efficiency part is the clearest practical angle. Prior approaches often rely on bounding-box annotations or larger supervised sets for grounding, so avoiding that while still claiming gains on dense-scene benchmarks is worth checking. The two-round structure directly targets the low-resolution crop issue that the abstract flags in earlier work.

The soft spot is the assumption that the first-round rationale will reliably surface regions whose upscaled versions actually improve the second round. The abstract itself notes that MLLMs already fail on high-density clutter, so if the initial picks are off, the magnifying-glass step has nothing useful to work with. The stress-test concern lands here: without ablations showing that the RL policy learns task-relevant crops rather than spurious ones, the gains could be fragile. The abstract states superior performance but supplies no numbers, baselines, or error bars, which leaves the size of the improvement unclear even after reading the full text.

This is for researchers focused on agentic extensions to existing MLLMs and on low-data fine-tuning for visual QA. A reader already working on grounding or curriculum RL would find the setup familiar enough to evaluate quickly. It deserves a serious referee because the targeted combination and the 40-sample claim are concrete enough to test, even if the first-round reliability needs more evidence.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce Mags-RL, an Agentic Reinforcement Learning framework that equips Multimodal Large Language Models (MLLMs) with an external super-resolution 'magnifying glass' agent. The approach performs two-round reasoning: the first round generates an initial rationale and autonomously identifies regions of interest without additional annotations; the second round invokes the super-resolution agent to crop and upscale those regions before verifying the rationale to produce the final answer. A novel curriculum learning strategy is said to enable data-efficient RL training with as few as 40 samples. Experiments on subsets of VSR, TallyQA, and GQA are asserted to show superior performance against recent strong competing methods, with high-quality reasoning and precise visual grounding.

Significance. If the empirical claims hold with proper controls and ablations, the work could meaningfully advance MLLM reasoning in high-density, cluttered scenes by avoiding reliance on explicit annotations and using modular external agents for fine-grained inspection. The emphasis on curriculum-based RL for training with only 40 samples represents a potential strength in data efficiency that, if reproducible, would be valuable for practical deployment. The two-round verification structure offers a clear, falsifiable mechanism that could be tested on additional benchmarks.

major comments (2)

[Abstract] Abstract: The central claim of 'superior performance' on VSR, TallyQA, and GQA subsets is stated without any quantitative results, baseline comparisons, error bars, training hyperparameters, or ablation studies. This absence is load-bearing because the manuscript's contribution rests entirely on the empirical demonstration that the two-round RL process outperforms prior methods; without these data the claim cannot be evaluated.
[Abstract] Abstract (method description): No formulation is given for the RL components (reward function, state/action space, policy network, or curriculum schedule). This directly affects assessment of the weakest assumption that the first-round rationale can autonomously select regions whose upscaled versions improve second-round verification; without the training mechanics it is impossible to determine whether the policy learns task-relevant selection rather than spurious crops.

minor comments (2)

[Title] Title: The phrasing 'Wearing Multimodal LLMs a Magnifying Glass' is grammatically awkward and does not clearly convey the intended meaning of equipping the model with an external agent.
[Abstract] Abstract: The statement 'Code and weights will be released soon' is conventional but could usefully include a specific repository or expected release window.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and agree that the abstract can be strengthened with additional details to better support evaluation of the claims.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of 'superior performance' on VSR, TallyQA, and GQA subsets is stated without any quantitative results, baseline comparisons, error bars, training hyperparameters, or ablation studies. This absence is load-bearing because the manuscript's contribution rests entirely on the empirical demonstration that the two-round RL process outperforms prior methods; without these data the claim cannot be evaluated.

Authors: We agree that the abstract would be more informative with key quantitative results. In the revised version we will add specific accuracy gains on each dataset (e.g., +X% on VSR, +Y% on TallyQA) together with the main baselines used. Full tables with error bars, hyperparameters, and ablations already appear in Sections 4 and 5; the abstract revision will make the central empirical claim evaluable without requiring the reader to consult the body. revision: yes
Referee: [Abstract] Abstract (method description): No formulation is given for the RL components (reward function, state/action space, policy network, or curriculum schedule). This directly affects assessment of the weakest assumption that the first-round rationale can autonomously select regions whose upscaled versions improve second-round verification; without the training mechanics it is impossible to determine whether the policy learns task-relevant selection rather than spurious crops.

Authors: The abstract supplies a high-level overview. We will revise it to include a concise formulation of the RL components (reward combining answer correctness and region utility, action space over crop coordinates and scale factors, policy integrated with the MLLM backbone, and the 40-sample curriculum that progressively raises scene complexity). Complete equations and training mechanics are already given in Section 3; the abstract addition will directly address the concern about whether the learned policy selects task-relevant regions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL framework with no derivations or self-referential predictions

full rationale

The paper describes an empirical agentic RL method for MLLM reasoning with a super-resolution agent and curriculum training on 40 samples. No equations, first-principles derivations, or predictions appear in the provided abstract or described content. The central claims rest on experimental results from RL training rather than any internal definitions, fitted parameters renamed as predictions, or self-citation chains that reduce the result to its inputs by construction. The method is self-contained against external benchmarks via reported performance on VSR, TallyQA, and GQA.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the effectiveness of the autonomous ROI identification and the utility of the external super-resolution agent, both introduced without independent evidence or implementation details in the abstract.

invented entities (1)

super-resolution magnifying glass agent no independent evidence
purpose: Crop and upscale autonomously identified regions of interest for fine-grained verification in the second reasoning round
Presented as an external tool invoked by the model; no details on its training, architecture, or integration provided in the abstract.

pith-pipeline@v0.9.1-grok · 5820 in / 1265 out tokens · 59204 ms · 2026-06-29T13:34:24.806334+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 32 canonical work pages · 19 internal anchors

[1]

In: Proceedings of the AAAI conference on artificial intelligence

Acharya,M.,Kafle,K.,Kanan,C.:Tallyqa:Answeringcomplexcountingquestions. In: Proceedings of the AAAI conference on artificial intelligence. vol. 33, pp. 8076– 8084 (2019)

2019
[2]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., Finn, C., Fu, C., Gopalakrishnan, K., Hausman, K., et al.: Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

In: Pro- ceedings ofthe 26thannualinternational conference on machine learning.pp

Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Pro- ceedings ofthe 26thannualinternational conference on machine learning.pp. 41–48 (2009)

2009
[7]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., Florence, P., Fu, C., Arenas, M.G., Gopalakrishnan, K., Han, K., Hausman, K., Herzog, A., Hsu, J., Ichter, B., Irpan, A., Joshi, N., Julian, R., Kalashnikov, D., Kuang, Y., Leal, I., Lee, L., Lee, T.W.E., Levine, S., Lu, Y., Michalewski...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Win- ter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[9]

Findings of the association for computational linguistics: ACL 2024 pp

Caffagni, D., Cocchi, F., Barsellotti, L., Moratelli, N., Sarto, S., Baraldi, L., Cornia, M., Cucchiara, R.: The revolution of multimodal large language models: A survey. Findings of the association for computational linguistics: ACL 2024 pp. 13590– 13618 (2024)

2024
[10]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: Unleash- ing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

arXiv preprint arXiv:2506.06366 (2025)

Chen, L., Zhang, Y., Feng, J., Chai, H., Zhang, H., Fan, B., Ma, Y., Zhang, S., Li, N., Liu, T., et al.: Ai agent behavioral science. arXiv preprint arXiv:2506.06366 (2025)

work page arXiv 2025
[12]

arXiv preprint arXiv:2505.09655 (2025)

Chen, X., Zhu, W., Qiu, P., Dong, X., Wang, H., Wu, H., Li, H., Sotiras, A., Wang, Y., Razi, A.: Dra-grpo: Exploring diversity-aware reward adjustment for r1-zero- like training of large language models. arXiv preprint arXiv:2505.09655 (2025)

work page arXiv 2025
[13]

Chen, X., Zhu, W., Qiu, P., Wang, H., Li, H., Wu, H., Dong, X., Sotiras, A., Wang, Y., Razi, A.: Prompt-ot: An optimal transport regularization paradigm for knowledgepreservationinvision-languagemodeladaptation.In:Proceedingsofthe IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 667–676 (2026)

2026
[14]

arXiv preprint arXiv:2512.24052 (2025)

Chen, Y., Zhu, W., Chen, X., Wang, Z., Li, X., Qiu, P., Wang, H., Dong, X., Xiong, Y., Schneider, A., et al.: Aha: Aligning large audio-language models for reasoning hallucinations via counterfactual hard negatives. arXiv preprint arXiv:2512.24052 (2025)

work page arXiv 2025
[15]

Agent-R1: A Unified and Modular Framework for Agentic Reinforcement Learning

Cheng, M., Ouyang, J., Yu, S., Yan, R., Luo, Y., Liu, Z., Wang, D., Liu, Q., Chen, E.: Agent-r1: Training powerful llm agents with end-to-end reinforcement learning. arXiv preprint arXiv:2511.14460 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Advances in neural information processing systems30(2017)

Christiano, P.F., Leike, J., Brown, T., Martic, M., Legg, S., Amodei, D.: Deep reinforcement learning from human preferences. Advances in neural information processing systems30(2017)

2017
[17]

arXiv preprint arXiv:2508.01617 (2025)

Dong, X., Zhu, W., Chen, X., Wang, Z., Qiu, P., Tang, S., Li, X., Wang, Y.: Llada-medv: Exploring large language diffusion models for biomedical image un- derstanding. arXiv preprint arXiv:2508.01617 (2025)

work page arXiv 2025
[18]

PaLM-E: An Embodied Multimodal Language Model

Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

GRIT: Teaching MLLMs to Think with Images

Fan, Y., He, X., Yang, D., Zheng, K., Kuo, C.C., Zheng, Y., Narayanaraju, S.J., Guan, X., Wang, X.E.: Grit: Teaching mllms to think with images. arXiv preprint arXiv:2505.15879 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition

Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition. pp. 6700–6709 (2019) Mags-RL: Wearing Multimodal LLMs a Magnifying Glass 17

2019
[21]

arXiv preprint arXiv:2512.16848 (2025)

Jiang, Y., Jiang, L., Teney, D., Moor, M., Brbic, M.: Meta-rl induces exploration in language agents. arXiv preprint arXiv:2512.16848 (2025)

work page arXiv 2025
[22]

Advances in Neural Information Processing Systems36, 28541–28564 (2023)

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems36, 28541–28564 (2023)

2023
[23]

Perception, reason, think, and plan: A survey on large multimodal reasoning models.arXiv preprint arXiv:2505.04921, 2025

Li, Y., Liu, Z., Li, Z., Zhang, X., Xu, Z., Chen, X., Shi, H., Jiang, S., Wang, X., Wang, J., et al.: Perception, reason, think, and plan: A survey on large multimodal reasoning models. arXiv preprint arXiv:2505.04921 (2025)

work page arXiv 2025
[24]

In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Li, Z., Yang, B., Liu, Q., Ma, Z., Zhang, S., Yang, J., Sun, Y., Liu, Y., Bai, X.: Monkey: Image resolution and text label are important things for large multi- modal models. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26763–26773 (2024)

2024
[25]

Transactions of the Association for Computational Linguistics11, 635–651 (2023)

Liu, F., Emerson, G., Collier, N.: Visual spatial reasoning. Transactions of the Association for Computational Linguistics11, 635–651 (2023)

2023
[26]

Advances in neural information processing sys- tems35, 27730–27744 (2022)

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. Advances in neural information processing sys- tems35, 27730–27744 (2022)

2022
[27]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Wei, F.: Kosmos- 2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Qiu, P., Zhu, W., Kumar, S., Chen, X., Yang, J., Sun, X., Razi, A., Wang, Y., Soti- ras, A.: Multimodal variational autoencoder: A barycentric view. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 20060–20068 (2025)

2025
[29]

In: Proceed- ings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining

Rasley, J., Rajbhandari, S., Ruwase, O., He, Y.: Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In: Proceed- ings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. pp. 3505–3506 (2020)

2020
[30]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Schmalfuss, J., Chang, N., VS, V., Shen, M., Bruhn, A., Alvarez, J.M.: Parc: A quantitative framework uncovering the symmetries within vision language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 25081–25091 (2025)

2025
[31]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[32]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Shao, H., Hu, Y., Wang, L., Song, G., Waslander, S.L., Liu, Y., Li, H.: Lmdrive: Closed-loop end-to-end driving with large language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15120– 15130 (2024)

2024
[33]

Advances in Neural Information Processing Systems37, 8612–8642 (2024)

Shao, H., Qian, S., Xiao, H., Song, G., Zong, Z., Wang, L., Liu, Y., Li, H.: Vi- sual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. Advances in Neural Information Processing Systems37, 8612–8642 (2024)

2024
[34]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

In: European conference on computer vision

Sima, C., Renz, K., Chitta, K., Chen, L., Zhang, H., Xie, C., Beißwenger, J., Luo, P., Geiger, A., Li, H.: Drivelm: Driving with graph visual question answering. In: European conference on computer vision. pp. 256–274. Springer (2024) 18 X. Dong et al

2024
[36]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Su, Z., Xia, P., Guo, H., Liu, Z., Ma, Y., Qu, X., Liu, J., Li, Y., Zeng, K., Yang, Z., et al.: Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers. arXiv preprint arXiv:2506.23918 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Wang, W., Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Zhu, J., Zhu, X., Lu, L., Qiao, Y., et al.: Enhancing the reasoning ability of multimodal large language models via mixed preference optimization. arXiv preprint arXiv:2411.10442 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

arXiv preprint arXiv:2401.06805 (2024)

Wang,Y.,Chen,W.,Han,X.,Lin,X.,Zhao,H.,Liu,Y.,Zhai,B.,Yuan,J.,You,Q., Yang, H.: Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning. arXiv preprint arXiv:2401.06805 (2024)

work page arXiv 2024
[40]

Advances in neural information processing systems35, 24824–24837 (2022)

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)

2022
[41]

In: 2023 IEEE International Conference on Big Data (BigData)

Wu, J., Gan, W., Chen, Z., Wan, S., Yu, P.S.: Multimodal large language models: A survey. In: 2023 IEEE International Conference on Big Data (BigData). pp. 2247–2256. IEEE (2023)

2023
[42]

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

Wu, J., Guan, J., Feng, K., Liu, Q., Wu, S., Wang, L., Wu, W., Tan, T.: Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing. arXiv preprint arXiv:2506.09965 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Grounded chain-of-thought for multimodal large language models.arXiv preprint arXiv:2503.12799, 2025

Wu, Q., Yang, X., Zhou, Y., Fang, C., Song, B., Sun, X., Ji, R.: Grounded chain- of-thought for multimodal large language models. arXiv preprint arXiv:2503.12799 (2025)

work page arXiv 2025
[44]

IEEE Robotics and Automation Letters9(10), 8186–8193 (2024)

Xu, Z., Zhang, Y., Xie, E., Zhao, Z., Guo, Y., Wong, K.Y.K., Li, Z., Zhao, H.: Drivegpt4: Interpretable end-to-end autonomous driving via large language model. IEEE Robotics and Automation Letters9(10), 8186–8193 (2024)

2024
[45]

arXiv preprint arXiv:2405.03162 (2024)

Yang, L., Xu, S., Sellergren, A., Kohlberger, T., Zhou, Y., Ktena, I., Kiraly, A., Ahmed, F., Hormozdiari, F., Jaroensri, T., et al.: Advancing multimodal medical capabilities of gemini. arXiv preprint arXiv:2405.03162 (2024)

work page arXiv 2024
[46]

arXiv preprint arXiv:2512.17306 (2025)

Yang, W., Xia, Y., Huang, J., Lu, S., Chen, Q.G., Xu, Z., Luo, W., Zhang, K., Wan, Y., Zhang, L.: Deep but reliable: Advancing multi-turn reasoning for thinking with images. arXiv preprint arXiv:2512.17306 (2025)

work page arXiv 2025
[47]

National Science Review11(12), nwae403 (2024)

Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E.: A survey on multimodal large language models. National Science Review11(12), nwae403 (2024)

2024
[48]

Yu, X., Guan, D., Gu, Y.: Zoom-refine: Boosting high-resolution multimodal un- derstanding via localized zoom and self-refinement (2025),https://arxiv.org/ abs/2506.01663

work page arXiv 2025
[49]

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

Zhang, G., Geng, H., Yu, X., Yin, Z., Zhang, Z., Tan, Z., Zhou, H., Li, Z., Xue, X., Li, Y., et al.: The landscape of agentic reinforcement learning for llms: A survey. arXiv preprint arXiv:2509.02547 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

ACM Computing Surveys (2025)

Zhang,M.,Yang,Y.,Xie,R.,Dhingra,B.,Zhou,S.,Pei,J.:Generalizabilityoflarge language model-based agents: A comprehensive survey. ACM Computing Surveys (2025)

2025
[51]

Multimodal Chain-of-Thought Reasoning in Language Models

Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain- of-thought reasoning in language models. arXiv preprint arXiv:2302.00923 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

In: ECCV (2024) Mags-RL: Wearing Multimodal LLMs a Magnifying Glass 19

Zheng, M., Sun, L., Dong, J., Pan, J.: Smfanet: A lightweight self-modulation feature aggregation network for efficient image super-resolution. In: ECCV (2024) Mags-RL: Wearing Multimodal LLMs a Magnifying Glass 19

2024
[53]

Zhou, C., Wang, M., Ma, Y., Wu, C., Chen, W., Qian, Z., Liu, X., Zhang, Y., Wang, J., Xu, H., et al.: From perception to cognition: A survey of vision- languageinteractivereasoninginmultimodallargelanguagemodels.arXivpreprint arXiv:2509.25373 (2025)

work page arXiv 2025
[54]

arXiv preprint arXiv:2503.03987 (2025) A Prompt Degisn Details This section details the prompt design for training Mags-RL

Zhu, W., Li, X., Chen, X., Qiu, P., Vasa, V.K., Dong, X., Chen, Y., Lepore, N., Dumitrascu, O., Su, Y., et al.: Retinalgpt: A retinal clinical preference con- versational assistant powered by large vision-language models. arXiv preprint arXiv:2503.03987 (2025) A Prompt Degisn Details This section details the prompt design for training Mags-RL. Specificall...

work page arXiv 2025
[55]

All initial analysis must be inside <think>...</think>
[56]

The Zoom tool <zoom>...</zoom> must be nested INSIDE the <think>...</think> block
[57]

All updated reasoning must be inside <rethink>...</rethink>
[58]

Information Density

The final answer must be inside <answer>...</answer>. Fig.10: The prompt suffix in stage 1 First, think between <think> and </think>, using <zoom>[[x1, y1, x2, y2]]</zoom> if details are unclear. Then, after receiving system feedback, provide your final reasoning in <rethink>... </rethink> and the final answer in <answer>...</answer>. Fig.11: Modified sys...
[59]

Object Scale: How small are the key elements relative to the image size?
[60]

Visual Clutter: Is the scene crowded, chaotic, or clean?
[61]

zoom_score

Text/Detail Level: Is there fine print, tiny textures, or distant background details that are hard to see? Based on your analysis, provide a "zoom_score" from 1 to 10: - Score 1-3 (Simple): - Subject is large, centered, and clearly visible. - No zoom needed. - Score 4-7 (Medium): - A standard scene with multiple objects or moderate distance. - Main elemen...

[1] [1]

In: Proceedings of the AAAI conference on artificial intelligence

Acharya,M.,Kafle,K.,Kanan,C.:Tallyqa:Answeringcomplexcountingquestions. In: Proceedings of the AAAI conference on artificial intelligence. vol. 33, pp. 8076– 8084 (2019)

2019

[2] [2]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., Finn, C., Fu, C., Gopalakrishnan, K., Hausman, K., et al.: Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[4] [4]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

In: Pro- ceedings ofthe 26thannualinternational conference on machine learning.pp

Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Pro- ceedings ofthe 26thannualinternational conference on machine learning.pp. 41–48 (2009)

2009

[7] [7]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., Florence, P., Fu, C., Arenas, M.G., Gopalakrishnan, K., Han, K., Hausman, K., Herzog, A., Hsu, J., Ichter, B., Irpan, A., Joshi, N., Julian, R., Kalashnikov, D., Kuang, Y., Leal, I., Lee, L., Lee, T.W.E., Levine, S., Lu, Y., Michalewski...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Win- ter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[9] [9]

Findings of the association for computational linguistics: ACL 2024 pp

Caffagni, D., Cocchi, F., Barsellotti, L., Moratelli, N., Sarto, S., Baraldi, L., Cornia, M., Cucchiara, R.: The revolution of multimodal large language models: A survey. Findings of the association for computational linguistics: ACL 2024 pp. 13590– 13618 (2024)

2024

[10] [10]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: Unleash- ing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

arXiv preprint arXiv:2506.06366 (2025)

Chen, L., Zhang, Y., Feng, J., Chai, H., Zhang, H., Fan, B., Ma, Y., Zhang, S., Li, N., Liu, T., et al.: Ai agent behavioral science. arXiv preprint arXiv:2506.06366 (2025)

work page arXiv 2025

[12] [12]

arXiv preprint arXiv:2505.09655 (2025)

Chen, X., Zhu, W., Qiu, P., Dong, X., Wang, H., Wu, H., Li, H., Sotiras, A., Wang, Y., Razi, A.: Dra-grpo: Exploring diversity-aware reward adjustment for r1-zero- like training of large language models. arXiv preprint arXiv:2505.09655 (2025)

work page arXiv 2025

[13] [13]

Chen, X., Zhu, W., Qiu, P., Wang, H., Li, H., Wu, H., Dong, X., Sotiras, A., Wang, Y., Razi, A.: Prompt-ot: An optimal transport regularization paradigm for knowledgepreservationinvision-languagemodeladaptation.In:Proceedingsofthe IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 667–676 (2026)

2026

[14] [14]

arXiv preprint arXiv:2512.24052 (2025)

Chen, Y., Zhu, W., Chen, X., Wang, Z., Li, X., Qiu, P., Wang, H., Dong, X., Xiong, Y., Schneider, A., et al.: Aha: Aligning large audio-language models for reasoning hallucinations via counterfactual hard negatives. arXiv preprint arXiv:2512.24052 (2025)

work page arXiv 2025

[15] [15]

Agent-R1: A Unified and Modular Framework for Agentic Reinforcement Learning

Cheng, M., Ouyang, J., Yu, S., Yan, R., Luo, Y., Liu, Z., Wang, D., Liu, Q., Chen, E.: Agent-r1: Training powerful llm agents with end-to-end reinforcement learning. arXiv preprint arXiv:2511.14460 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Advances in neural information processing systems30(2017)

Christiano, P.F., Leike, J., Brown, T., Martic, M., Legg, S., Amodei, D.: Deep reinforcement learning from human preferences. Advances in neural information processing systems30(2017)

2017

[17] [17]

arXiv preprint arXiv:2508.01617 (2025)

Dong, X., Zhu, W., Chen, X., Wang, Z., Qiu, P., Tang, S., Li, X., Wang, Y.: Llada-medv: Exploring large language diffusion models for biomedical image un- derstanding. arXiv preprint arXiv:2508.01617 (2025)

work page arXiv 2025

[18] [18]

PaLM-E: An Embodied Multimodal Language Model

Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

GRIT: Teaching MLLMs to Think with Images

Fan, Y., He, X., Yang, D., Zheng, K., Kuo, C.C., Zheng, Y., Narayanaraju, S.J., Guan, X., Wang, X.E.: Grit: Teaching mllms to think with images. arXiv preprint arXiv:2505.15879 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition

Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition. pp. 6700–6709 (2019) Mags-RL: Wearing Multimodal LLMs a Magnifying Glass 17

2019

[21] [21]

arXiv preprint arXiv:2512.16848 (2025)

Jiang, Y., Jiang, L., Teney, D., Moor, M., Brbic, M.: Meta-rl induces exploration in language agents. arXiv preprint arXiv:2512.16848 (2025)

work page arXiv 2025

[22] [22]

Advances in Neural Information Processing Systems36, 28541–28564 (2023)

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems36, 28541–28564 (2023)

2023

[23] [23]

Perception, reason, think, and plan: A survey on large multimodal reasoning models.arXiv preprint arXiv:2505.04921, 2025

Li, Y., Liu, Z., Li, Z., Zhang, X., Xu, Z., Chen, X., Shi, H., Jiang, S., Wang, X., Wang, J., et al.: Perception, reason, think, and plan: A survey on large multimodal reasoning models. arXiv preprint arXiv:2505.04921 (2025)

work page arXiv 2025

[24] [24]

In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Li, Z., Yang, B., Liu, Q., Ma, Z., Zhang, S., Yang, J., Sun, Y., Liu, Y., Bai, X.: Monkey: Image resolution and text label are important things for large multi- modal models. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26763–26773 (2024)

2024

[25] [25]

Transactions of the Association for Computational Linguistics11, 635–651 (2023)

Liu, F., Emerson, G., Collier, N.: Visual spatial reasoning. Transactions of the Association for Computational Linguistics11, 635–651 (2023)

2023

[26] [26]

Advances in neural information processing sys- tems35, 27730–27744 (2022)

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. Advances in neural information processing sys- tems35, 27730–27744 (2022)

2022

[27] [27]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Wei, F.: Kosmos- 2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Qiu, P., Zhu, W., Kumar, S., Chen, X., Yang, J., Sun, X., Razi, A., Wang, Y., Soti- ras, A.: Multimodal variational autoencoder: A barycentric view. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 20060–20068 (2025)

2025

[29] [29]

In: Proceed- ings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining

Rasley, J., Rajbhandari, S., Ruwase, O., He, Y.: Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In: Proceed- ings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. pp. 3505–3506 (2020)

2020

[30] [30]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Schmalfuss, J., Chang, N., VS, V., Shen, M., Bruhn, A., Alvarez, J.M.: Parc: A quantitative framework uncovering the symmetries within vision language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 25081–25091 (2025)

2025

[31] [31]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[32] [32]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Shao, H., Hu, Y., Wang, L., Song, G., Waslander, S.L., Liu, Y., Li, H.: Lmdrive: Closed-loop end-to-end driving with large language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15120– 15130 (2024)

2024

[33] [33]

Advances in Neural Information Processing Systems37, 8612–8642 (2024)

Shao, H., Qian, S., Xiao, H., Song, G., Zong, Z., Wang, L., Liu, Y., Li, H.: Vi- sual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. Advances in Neural Information Processing Systems37, 8612–8642 (2024)

2024

[34] [34]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

In: European conference on computer vision

Sima, C., Renz, K., Chitta, K., Chen, L., Zhang, H., Xie, C., Beißwenger, J., Luo, P., Geiger, A., Li, H.: Drivelm: Driving with graph visual question answering. In: European conference on computer vision. pp. 256–274. Springer (2024) 18 X. Dong et al

2024

[36] [36]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Su, Z., Xia, P., Guo, H., Liu, Z., Ma, Y., Qu, X., Liu, J., Li, Y., Zeng, K., Yang, Z., et al.: Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers. arXiv preprint arXiv:2506.23918 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Wang, W., Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Zhu, J., Zhu, X., Lu, L., Qiao, Y., et al.: Enhancing the reasoning ability of multimodal large language models via mixed preference optimization. arXiv preprint arXiv:2411.10442 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

arXiv preprint arXiv:2401.06805 (2024)

Wang,Y.,Chen,W.,Han,X.,Lin,X.,Zhao,H.,Liu,Y.,Zhai,B.,Yuan,J.,You,Q., Yang, H.: Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning. arXiv preprint arXiv:2401.06805 (2024)

work page arXiv 2024

[40] [40]

Advances in neural information processing systems35, 24824–24837 (2022)

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)

2022

[41] [41]

In: 2023 IEEE International Conference on Big Data (BigData)

Wu, J., Gan, W., Chen, Z., Wan, S., Yu, P.S.: Multimodal large language models: A survey. In: 2023 IEEE International Conference on Big Data (BigData). pp. 2247–2256. IEEE (2023)

2023

[42] [42]

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

Wu, J., Guan, J., Feng, K., Liu, Q., Wu, S., Wang, L., Wu, W., Tan, T.: Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing. arXiv preprint arXiv:2506.09965 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Grounded chain-of-thought for multimodal large language models.arXiv preprint arXiv:2503.12799, 2025

Wu, Q., Yang, X., Zhou, Y., Fang, C., Song, B., Sun, X., Ji, R.: Grounded chain- of-thought for multimodal large language models. arXiv preprint arXiv:2503.12799 (2025)

work page arXiv 2025

[44] [44]

IEEE Robotics and Automation Letters9(10), 8186–8193 (2024)

Xu, Z., Zhang, Y., Xie, E., Zhao, Z., Guo, Y., Wong, K.Y.K., Li, Z., Zhao, H.: Drivegpt4: Interpretable end-to-end autonomous driving via large language model. IEEE Robotics and Automation Letters9(10), 8186–8193 (2024)

2024

[45] [45]

arXiv preprint arXiv:2405.03162 (2024)

Yang, L., Xu, S., Sellergren, A., Kohlberger, T., Zhou, Y., Ktena, I., Kiraly, A., Ahmed, F., Hormozdiari, F., Jaroensri, T., et al.: Advancing multimodal medical capabilities of gemini. arXiv preprint arXiv:2405.03162 (2024)

work page arXiv 2024

[46] [46]

arXiv preprint arXiv:2512.17306 (2025)

Yang, W., Xia, Y., Huang, J., Lu, S., Chen, Q.G., Xu, Z., Luo, W., Zhang, K., Wan, Y., Zhang, L.: Deep but reliable: Advancing multi-turn reasoning for thinking with images. arXiv preprint arXiv:2512.17306 (2025)

work page arXiv 2025

[47] [47]

National Science Review11(12), nwae403 (2024)

Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E.: A survey on multimodal large language models. National Science Review11(12), nwae403 (2024)

2024

[48] [48]

Yu, X., Guan, D., Gu, Y.: Zoom-refine: Boosting high-resolution multimodal un- derstanding via localized zoom and self-refinement (2025),https://arxiv.org/ abs/2506.01663

work page arXiv 2025

[49] [49]

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

Zhang, G., Geng, H., Yu, X., Yin, Z., Zhang, Z., Tan, Z., Zhou, H., Li, Z., Xue, X., Li, Y., et al.: The landscape of agentic reinforcement learning for llms: A survey. arXiv preprint arXiv:2509.02547 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

ACM Computing Surveys (2025)

Zhang,M.,Yang,Y.,Xie,R.,Dhingra,B.,Zhou,S.,Pei,J.:Generalizabilityoflarge language model-based agents: A comprehensive survey. ACM Computing Surveys (2025)

2025

[51] [51]

Multimodal Chain-of-Thought Reasoning in Language Models

Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain- of-thought reasoning in language models. arXiv preprint arXiv:2302.00923 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

In: ECCV (2024) Mags-RL: Wearing Multimodal LLMs a Magnifying Glass 19

Zheng, M., Sun, L., Dong, J., Pan, J.: Smfanet: A lightweight self-modulation feature aggregation network for efficient image super-resolution. In: ECCV (2024) Mags-RL: Wearing Multimodal LLMs a Magnifying Glass 19

2024

[53] [53]

Zhou, C., Wang, M., Ma, Y., Wu, C., Chen, W., Qian, Z., Liu, X., Zhang, Y., Wang, J., Xu, H., et al.: From perception to cognition: A survey of vision- languageinteractivereasoninginmultimodallargelanguagemodels.arXivpreprint arXiv:2509.25373 (2025)

work page arXiv 2025

[54] [54]

arXiv preprint arXiv:2503.03987 (2025) A Prompt Degisn Details This section details the prompt design for training Mags-RL

Zhu, W., Li, X., Chen, X., Qiu, P., Vasa, V.K., Dong, X., Chen, Y., Lepore, N., Dumitrascu, O., Su, Y., et al.: Retinalgpt: A retinal clinical preference con- versational assistant powered by large vision-language models. arXiv preprint arXiv:2503.03987 (2025) A Prompt Degisn Details This section details the prompt design for training Mags-RL. Specificall...

work page arXiv 2025

[55] [55]

All initial analysis must be inside <think>...</think>

[56] [56]

The Zoom tool <zoom>...</zoom> must be nested INSIDE the <think>...</think> block

[57] [57]

All updated reasoning must be inside <rethink>...</rethink>

[58] [58]

Information Density

The final answer must be inside <answer>...</answer>. Fig.10: The prompt suffix in stage 1 First, think between <think> and </think>, using <zoom>[[x1, y1, x2, y2]]</zoom> if details are unclear. Then, after receiving system feedback, provide your final reasoning in <rethink>... </rethink> and the final answer in <answer>...</answer>. Fig.11: Modified sys...

[59] [59]

Object Scale: How small are the key elements relative to the image size?

[60] [60]

Visual Clutter: Is the scene crowded, chaotic, or clean?

[61] [61]

zoom_score

Text/Detail Level: Is there fine print, tiny textures, or distant background details that are hard to see? Based on your analysis, provide a "zoom_score" from 1 to 10: - Score 1-3 (Simple): - Subject is large, centered, and clearly visible. - No zoom needed. - Score 4-7 (Medium): - A standard scene with multiple objects or moderate distance. - Main elemen...