DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Chao Shen; Chenxiao Zhao; Guohai Xu; Jack Hong; Le Yang; Michael Yang; Xing Yu; Ziwei Zheng

arxiv: 2505.14362 · v3 · submitted 2025-05-20 · 💻 cs.CV

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng , Michael Yang , Jack Hong , Chenxiao Zhao , Guohai Xu , Le Yang , Chao Shen , Xing Yu This is my paper

Pith reviewed 2026-05-11 14:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelsreinforcement learningactive perceptionmultimodal reasoningvisual groundingthinking with imageshallucination reduction

0 comments

The pith

Reinforcement learning lets vision-language models develop native image-based reasoning without pre-collected data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that a vision-language model can acquire the capacity to think with images by using reinforcement learning to foster active perception. This process relies on the model's own grounding abilities and a custom data selection plus reward design rather than any initial supervised fine-tuning on reasoning examples. A sympathetic reader would care because the resulting behavior produces measurable gains on perception and reasoning tasks while also cutting hallucinations and aiding mathematical work. The training trace reveals the model shifting from broad visual exploration toward precise, efficient exploitation of image information. In short, the claim is that image-grounded reasoning can emerge as an intrinsic, reward-shaped skill instead of an externally supplied one.

Core claim

DeepEyes trains a vision-language model end-to-end with reinforcement learning so that it learns to think with images through active perception, using its intrinsic grounding capability rather than external tools or pre-collected reasoning data. A tailored data selection and reward strategy steers the model to strategically ground its reasoning in visual content. The outcome is significant gains on general perception and reasoning benchmarks together with better grounding, lower hallucination rates, and stronger mathematical reasoning. During training the model passes through distinct stages: initial exploratory perception gives way to efficient and accurate exploitation, accompanied by a多样化

What carries the argument

Active perception, the learned strategy by which the model decides when and how to ground its ongoing reasoning directly in visual information.

If this is right

Performance improves on perception and reasoning benchmarks without any pre-collected reasoning traces.
Grounding accuracy rises while hallucination rates fall, including on mathematical reasoning tasks.
The model exhibits an internal progression from exploratory to exploitative visual behavior.
Diverse thinking patterns appear that parallel human visual reasoning sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reinforcement-learning incentive structure could be tested on video or audio sequences to induce analogous active-perception loops.
If the approach scales, training pipelines for multimodal models may require far less curated reasoning data than current supervised routes.
Longer-horizon tasks could reveal whether the emergent perception strategies remain stable or require additional reward shaping.
Real-world deployment in dynamic environments would test whether the learned visual-grounding habits transfer beyond static benchmark images.

Load-bearing premise

The custom reward and data selection rules will steer the model toward genuine, useful visual grounding rather than superficial patterns that merely maximize the reward signal.

What would settle it

Run the same reinforcement learning loop with the visual-grounding reward terms removed or replaced by generic accuracy rewards; if benchmark gains and the reported evolution of perception behavior remain unchanged, the claim that active perception drives the improvements is falsified.

read the original abstract

Large Vision-Language Models excel at multimodal understanding but struggle to deeply integrate visual information into their predominantly text-based reasoning processes, a key challenge in mirroring human cognition. To address this, we introduce DeepEyes, a model that learns to "think with images", trained end-to-end with reinforcement learning without requiring pre-collected reasoning data for cold-start supervised fine-tuning (SFT). Notably, this ability emerges natively, leveraging the model's own grounding capability as an intrinsic function rather than relying on external specialized models or APIs. We enable this capability through active perception, where the model learns to strategically ground its reasoning in visual information, guided by a tailored data selection and reward strategy. DeepEyes achieves significant performance gains on general perception and reasoning benchmarks and also demonstrates improvement in grounding, hallucination, and mathematical reasoning tasks. Interestingly, we observe the distinct evolution of active perception from initial exploration to efficient and accurate exploitation, and diverse thinking patterns that closely mirror human visual reasoning processes. Code is available at https://github.com/Visual-Agent/DeepEyes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeepEyes claims RL can teach VLMs to think with images natively without SFT, but missing details leave the mechanism unverified.

read the letter

The main takeaway is that DeepEyes trains a VLM using reinforcement learning to learn active visual perception for reasoning, all without a supervised fine-tuning cold start on reasoning data. It uses the model's built-in grounding as an intrinsic part of the process, guided by specific data choices and rewards, and reports gains across perception, reasoning, grounding, hallucination reduction, and math tasks. They also describe training dynamics where the model moves from exploration to exploitation with varied thinking styles. This approach is new in combining end-to-end RL for this capability without relying on external grounding models or pre-made reasoning traces. It could simplify how we build models that integrate vision more deeply into thought processes, which matters for applications needing real visual reasoning. The paper does a decent job framing the problem of text-dominant reasoning in VLMs and positioning their method as a way to address it directly through incentives rather than imitation. However, the abstract provides no quantitative results, no ablation experiments, and no error analysis. This makes it difficult to judge the strength of the claims. The performance improvements could come from the data selection process alone rather than the RL-driven active perception. There is also no evidence presented that the visual grounding steps actually improve the final outputs in a causal way, as opposed to the model just producing outputs that fit the reward criteria. The risk that the model learns superficial behaviors, like inserting grounding tokens at regular intervals without using them meaningfully, is plausible given how RL can exploit reward signals in unexpected ways, especially without prior SFT to stabilize the policy. For a reader working on multimodal learning or RL for language models, this could spark ideas about intrinsic rewards and active perception. But anyone wanting to build on it would need the full details and experiments to replicate or extend the work. I think this paper deserves to go through peer review. The core idea is worth serious evaluation by referees who can assess the methods and results in depth, even though the abstract alone does not provide enough to fully endorse the findings.

Referee Report

3 major / 2 minor

Summary. The paper introduces DeepEyes, a vision-language model trained end-to-end via reinforcement learning to develop native 'thinking with images' capability through active perception. It claims this emerges without any cold-start supervised fine-tuning on pre-collected reasoning data, relying instead on tailored data selection and a custom reward strategy that leverages the model's intrinsic grounding. The approach reportedly yields significant gains on general perception and reasoning benchmarks, plus improvements in grounding, hallucination reduction, and mathematical reasoning, with observed behavioral evolution from exploration to exploitation and diverse human-like thinking patterns.

Significance. If the central claims hold under rigorous verification, the work would be moderately significant for multimodal AI research. It offers an empirical demonstration that RL can elicit integrated visual reasoning in VLMs without heavy reliance on SFT or external tools, potentially reducing data curation costs and enabling more autonomous active perception. The public code release is a clear strength for reproducibility.

major comments (3)

[Results] Results section (and any associated tables/figures reporting benchmark scores): The manuscript claims 'significant performance gains' on perception and reasoning benchmarks but provides no quantitative deltas, baseline comparisons, statistical significance tests, or error bars. Without these, it is impossible to evaluate whether the gains exceed what data curation alone would produce, which is load-bearing for the claim that the RL mechanism (rather than selection) drives the result.
[Methods] Methods section on reward design and data selection: The reward strategy is described at a high level as 'tailored' to encourage active perception, but no explicit formulation (e.g., components for grounding accuracy, reasoning utility, or format compliance) or weighting is given. This prevents assessment of whether the policy converges to integrative visual thinking or to superficial high-reward patterns such as periodic token emission, directly undermining the 'natively emerges' and 'causal integration' claims.
[Analysis] Analysis or ablation subsection (if present): There are no reported ablations that isolate the contribution of the RL reward versus data selection, nor any causal intervention (e.g., forcing or removing visual thought steps and measuring downstream accuracy change). The observed 'evolution from exploration to exploitation' is presented observationally; without metrics tracking grounding utility over training or controlled experiments, the mechanism remains unverified.

minor comments (2)

[Abstract] The abstract and introduction use the phrase 'significant performance gains' without defining the term or providing supporting numbers; this should be replaced with concrete metrics or removed.
[Methods] Notation for the active perception loop (e.g., how visual grounding actions are interleaved with text reasoning) is introduced informally; a clear algorithmic pseudocode or diagram would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We appreciate the opportunity to clarify the presentation of our results, methods, and analyses. We address each major comment below and commit to revisions that will strengthen the manuscript.

read point-by-point responses

Referee: [Results] Results section (and any associated tables/figures reporting benchmark scores): The manuscript claims 'significant performance gains' on perception and reasoning benchmarks but provides no quantitative deltas, baseline comparisons, statistical significance tests, or error bars. Without these, it is impossible to evaluate whether the gains exceed what data curation alone would produce, which is load-bearing for the claim that the RL mechanism (rather than selection) drives the result.

Authors: We agree that explicit quantitative comparisons are necessary to substantiate the claims. In the revised manuscript we will add tables reporting baseline scores, absolute and relative performance deltas, error bars from multiple runs, and statistical significance tests. We will also include a discussion comparing the observed gains against what data curation alone can achieve, thereby clarifying the contribution of the RL objective. revision: yes
Referee: [Methods] Methods section on reward design and data selection: The reward strategy is described at a high level as 'tailored' to encourage active perception, but no explicit formulation (e.g., components for grounding accuracy, reasoning utility, or format compliance) or weighting is given. This prevents assessment of whether the policy converges to integrative visual thinking or to superficial high-reward patterns such as periodic token emission, directly undermining the 'natively emerges' and 'causal integration' claims.

Authors: We acknowledge that the reward formulation was presented at too high a level. The revised Methods section will contain the complete mathematical definition of the reward, explicitly listing each component (grounding accuracy, reasoning utility, format compliance) together with the weighting coefficients. This will enable readers to evaluate convergence behavior and rule out superficial reward hacking. revision: yes
Referee: [Analysis] Analysis or ablation subsection (if present): There are no reported ablations that isolate the contribution of the RL reward versus data selection, nor any causal intervention (e.g., forcing or removing visual thought steps and measuring downstream accuracy change). The observed 'evolution from exploration to exploitation' is presented observationally; without metrics tracking grounding utility over training or controlled experiments, the mechanism remains unverified.

Authors: We agree that additional ablations and quantitative tracking would strengthen the mechanistic claims. The revision will include ablation experiments that compare full RL training against data-selection-only baselines, as well as plots of grounding utility and exploration/exploitation metrics across training steps. Full causal interventions (forcing or ablating visual thought steps) would require new controlled runs; we will therefore provide enhanced observational analysis and discuss the limits of the current evidence. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical RL training with external benchmarks

full rationale

The paper presents an empirical end-to-end RL method for training VLMs to perform active perception and 'think with images' without cold-start SFT. Claims rest on performance gains measured against external perception/reasoning benchmarks and observed behavioral evolution during training. No mathematical derivations, equations, or self-referential definitions are present that would reduce any result to its inputs by construction. The approach is self-contained against independent evaluation data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach rests on standard RL assumptions and the pre-existing grounding capability of the base VLM; no new physical entities or ad-hoc constants are introduced.

pith-pipeline@v0.9.0 · 5496 in / 1029 out tokens · 41197 ms · 2026-05-11T14:37:27.553701+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
cs.CL 2026-05 accept novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...
ETCHR: Editing To Clarify and Harness Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

A decoupled question-conditioned image editor trained via supervised imitation then VLM-reward enhancement improves MLLM visual reasoning Pass@1 by 4.6-5.5 points across models and tasks.
CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering
cs.CV 2026-05 unverdicted novelty 7.0

Introduces CaST-Bench, a dataset of 2,066 causal questions on 1,015 videos with annotated causal chains and metrics to evaluate VLMs on spatio-temporal causal reasoning.
Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning
cs.CL 2026-05 conditional novelty 7.0

AutoTool uses reinforcement learning with dual-mode rewards to train multimodal LLMs to adaptively choose between tool-assisted and text-centric reasoning, yielding accuracy and efficiency gains on V* and POPE benchmarks.
Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era
cs.LG 2026-05 unverdicted novelty 7.0

Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable ...
GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding
cs.CV 2026-05 unverdicted novelty 7.0

GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolutio...
UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs
cs.CV 2026-05 unverdicted novelty 7.0

VLMs show a resolution illusion on UHR Earth observation imagery where higher resolution does not improve micro-target perception; UHR-Micro benchmark and MAP-Agent address this via evidence-centered active inspection.
UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
cs.CV 2026-05 unverdicted novelty 7.0

UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
cs.AI 2026-05 conditional novelty 7.0

State-conditioned commitment depth in a vision-language policy Pareto-dominates fixed-depth baselines on Sliding Puzzle and Sokoban, raising solve rates by up to 12.5 points while using 25% fewer actions and beating l...
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
cs.AI 2026-05 conditional novelty 7.0

A vision-language policy learns state-conditioned commitment depth to Pareto-dominate fixed-depth baselines on long-horizon puzzles, achieving up to 12.5 pp higher solve rate with 25% fewer actions.
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 7.0

Pest-Thinker is a reinforcement learning framework that improves MLLMs' expert-level reasoning on pest morphology via synthesized CoT trajectories, GRPO optimization, and an LLM-judged feature reward on new benchmarks...
Act2See: Emergent Active Visual Perception for Video Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

Act2See trains VLMs via supervised fine-tuning on verified reasoning traces to interleave active frame calls within text CoTs, yielding SOTA results on video reasoning benchmarks.
Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs
cs.CV 2026-04 unverdicted novelty 7.0

Chain of Modality dynamically orchestrates multimodal input topologies and bifurcates cognitive execution to overcome static fusion biases in Omni-MLLMs.
TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables
cs.AI 2026-04 conditional novelty 7.0

TableVision benchmark shows explicit spatial grounding recovers MLLM reasoning on hierarchical tables, delivering 12.3% accuracy improvement through a decoupled perception-reasoning framework.
OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning
cs.CV 2026-03 conditional novelty 7.0

OmniSch is the first benchmark exposing gaps in LMMs for PCB schematic visual grounding, topology-to-graph parsing, geometric weighting, and tool-augmented reasoning.
V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators
cs.CV 2026-03 unverdicted novelty 7.0

V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the...
Motion-o: Trajectory-Grounded Video Reasoning
cs.CV 2026-03 conditional novelty 7.0

Motion-o extends VLMs with Motion Chain of Thought (MCoT) using <motion/> tags and perturbation rewards to make object trajectories explicit and supervised in video reasoning.
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
cs.CV 2026-01 unverdicted novelty 7.0

VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
Q-Probe: Scaling Image Quality Assessment to High Resolution via Context-Aware Agentic Probing
eess.IV 2026-01 unverdicted novelty 7.0

Q-Probe introduces the first agentic IQA framework that scales to high resolutions using context-aware probing, a new Vista-Bench benchmark, and three-stage training to reach state-of-the-art performance across scales.
Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning
cs.AI 2026-01 unverdicted novelty 7.0

Omni-R1 unifies multimodal reasoning by generating intermediate images during the process in a SFT-plus-RL framework, with an Omni-R1-Zero variant that matches or exceeds it using only text data.
Forest Before Trees: Latent Superposition for Efficient Visual Reasoning
cs.CL 2026-01 unverdicted novelty 7.0

Laser reformulates visual reasoning via Dynamic Windowed Alignment Learning to maintain latent superposition of global features, delivering 5.03% average gains over Monet and over 97% fewer inference tokens on six benchmarks.
Code-in-the-Loop Forensics: Agentic Tool Use for Image Forgery Detection
cs.AI 2025-12 unverdicted novelty 7.0

ForenAgent lets MLLMs create and iteratively improve low-level Python tools for image forgery detection via a two-stage training pipeline and a new 100k-image benchmark dataset.
Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space
cs.CV 2025-12 unverdicted novelty 7.0

DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.
Training Multi-Image Vision Agents via End2End Reinforcement Learning
cs.CV 2025-12 unverdicted novelty 7.0

IMAgent trains a multi-image vision agent via pure end-to-end RL with visual reflection tools and a two-layer motion trajectory masking strategy, reaching SOTA on single- and multi-image benchmarks while revealing too...
Latent Visual Reasoning
cs.CV 2025-09 unverdicted novelty 7.0

Latent Visual Reasoning enables autoregressive generation of latent visual states that reconstruct critical image tokens, yielding gains on perception-heavy VQA benchmarks such as 71.67% on MMVP.
HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling
cs.CV 2025-09 unverdicted novelty 7.0

HiDe is a training-free hierarchical decoupling method that separates key visual tokens from background interference in high-resolution MLLMs to achieve new state-of-the-art results on V*Bench, HRBench4K, and HRBench8K.
DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning
cs.AI 2025-09 unverdicted novelty 7.0

DeFacto trains multimodal models using counterfactual image variants and reinforcement learning rewards to improve both answer accuracy and evidence-answer consistency.
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
cs.CV 2025-02 unverdicted novelty 7.0

WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles
cs.LG 2026-05 unverdicted novelty 6.0

Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and...
Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA via Active Zooming
cs.CV 2026-05 unverdicted novelty 6.0

A structured Zoom-then-Diagnose paradigm with uncertainty-aware GRPO rewards improves lesion localization by 39.3% on liver, breast, and thyroid ultrasound VQA datasets by encouraging caution under ambiguity.
Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation
cs.CV 2026-05 unverdicted novelty 6.0

Vision-OPD uses on-policy self-distillation from crop-conditioned to full-image policies within the same MLLM to close the regional-to-global perception gap.
Leveraging Latent Visual Reasoning in Silence
cs.CV 2026-05 conditional novelty 6.0

Latent visual reasoning improves multimodal models via training effects even without using latent tokens at inference, enabled by an attention-based RL reward that promotes interaction with text tokens.
VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation
cs.CV 2026-05 unverdicted novelty 6.0

VideoSeeker integrates agentic reasoning and visual prompts into LVLMs via automated data synthesis, cold-start supervision, and RL training, yielding +13.7% gains on instance-level video tasks over baselines includin...
Venus-DeFakerOne: Unified Fake Image Detection & Localization
cs.CV 2026-05 unverdicted novelty 6.0

DeFakerOne integrates InternVL2 and SAM2 into a single model that achieves state-of-the-art results on 39 detection and 9 localization benchmarks for unified fake image detection and localization.
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
cs.AI 2026-05 unverdicted novelty 6.0

ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

GAP introduces three-level alignment for visual latent reasoning in MLLMs, achieving top aggregate perception and reasoning performance on Qwen2.5-VL 7B by addressing decoder-input norm mismatch.
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding best aggregate perception/reasoning scores on Qwen2.5-VL 7B among supervised variants while showing task-relevant signal i...
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
cs.CV 2026-05 unverdicted novelty 6.0

SCOLAR addresses information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens from LLM hidden states, extending acceptable CoT length over 30x and achieving +14.12% gains on b...
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
cs.CV 2026-05 unverdicted novelty 6.0

SCOLAR fixes information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens via a detransformer, extending acceptable CoT length over 30x and delivering +14.12% gains on reasoni...
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while pr...
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
cs.CV 2026-05 unverdicted novelty 6.0

Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning
cs.CV 2026-05 unverdicted novelty 6.0

HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.
AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding
cs.CV 2026-05 unverdicted novelty 6.0

AutoFocus converts token perplexity into an anisotropic Gaussian uncertainty field to drive region proposals and shape-aware zooming for improved GUI grounding in VLMs.
Agentic AI for Remote Sensing: Technical Challenges and Research Directions
cs.CV 2026-04 unverdicted novelty 6.0

Agentic AI faces structural challenges in remote sensing due to geospatial data properties and workflow constraints, requiring EO-native agents built around structured state, tool-aware reasoning, and validity-aware e...
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
cs.CV 2026-04 unverdicted novelty 6.0

ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.
SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.
Visual Reasoning through Tool-supervised Reinforcement Learning
cs.CV 2026-04 unverdicted novelty 6.0

ToolsRL trains MLLMs via a tool-specific then accuracy-focused RL curriculum to master visual tools for complex reasoning tasks.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
cs.AI 2026-04 unverdicted novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Chain-of-Glimpse is a reinforcement learning framework that builds progressive, spatially grounded reasoning traces around task-relevant objects in videos to enable more accurate and interpretable multi-step decisions.
Towards Long-horizon Agentic Multimodal Search
cs.CV 2026-04 unverdicted novelty 6.0

LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp a...
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.
AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning
cs.CV 2026-04 unverdicted novelty 6.0

AnomalyAgent uses tool-augmented reinforcement learning with self-reflection to generate realistic industrial anomalies, achieving better metrics than zero-shot methods on MVTec-AD.
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
cs.CV 2026-04 unverdicted novelty 6.0

MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

LAST augments MLLMs with a tool-abstraction sandbox and three-stage training to deliver around 20% gains on spatial reasoning tasks, outperforming closed-source models.
CharTool: Tool-Integrated Visual Reasoning for Chart Understanding
cs.AI 2026-04 unverdicted novelty 6.0

CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.
Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models
cs.CV 2026-03 unverdicted novelty 6.0

Attention dispersion during extended reasoning impairs MLLM perception on images, and a training-free VRGA framework mitigates it by selecting and reweighting visual attention heads using an entropy-focus criterion.
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?
cs.LG 2026-02 unverdicted novelty 6.0

MapTab benchmark shows current MLLMs struggle with multi-criteria multimodal route planning and that combining vision and language frequently underperforms single-modality approaches.
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?
cs.LG 2026-02 conditional novelty 6.0

MapTab is a new multimodal benchmark with 328 images and nearly 200k queries that shows current MLLMs have substantial difficulty with multi-criteria route planning when visual and tabular information must be combined.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 83 Pith papers · 20 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Per- ception tokens enhance visual reasoning in multimodal lan- guage models

Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen, Linda G Shapiro, and Ranjay Krishna. Perception tokens enhance visual reasoning in multimodal language models. arXiv preprint arXiv:2412.03548,

work page arXiv
[5]

Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025

Kezhen Chen, Rahul Thapa, Rahul Chalamala, Ben Athiwaratkun, Shuaiwen Leon Song, and James Zou. Dragonfly: Multi-resolution zoom supercharges large visual-language model.arXiv e-prints, pp. arXiv–2406, 2024a. Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu Cheng. Advancing multimodal reasonin...

work page arXiv
[6]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

10 Published as a conference paper at ICLR 2026 Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024b. Zhe Chen, Jiannan Wu, Wenhai Wang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025a. Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jia...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Referitgame: Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 787–798,

work page 2014
[9]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli ´c, and Furu Wei. Imagine while reasoning in space: Multimodal visua...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InComputer vision– ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pp. 740–755. Springer,

work page 2014
[11]

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023b. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowled...

work page internal anchor Pith review arXiv
[12]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Pkrd-cot: A unified chain-of-thought prompting for multi-modal large language models in autonomous driving

Xuewen Luo, Fan Ding, Yinsheng Song, Xiaofeng Zhang, and Junnyong Loo. Pkrd-cot: A unified chain-of-thought prompting for multi-modal large language models in autonomous driving.arXiv preprint arXiv:2412.02025,

work page arXiv
[14]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2503.07365,

work page Pith review arXiv
[15]

s1: Simple test-time scaling

12 Published as a conference paper at ICLR 2026 Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393,

work page Pith review arXiv 2026
[16]

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536,

work page internal anchor Pith review arXiv
[17]

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284,

work page internal anchor Pith review arXiv
[18]

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024a. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song...

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Mome: Mixture of multimodal experts for generalist multimodal large language models.arXiv preprint arXiv:2407.12709, 2024b

Leyang Shen, Gongwei Chen, Rui Shao, Weili Guan, and Liqiang Nie. Mome: Mixture of multimodal experts for generalist multimodal large language models.arXiv preprint arXiv:2407.12709, 2024b. Fangxun Shu, Yue Liao, Le Zhuo, Chenning Xu, Lei Zhang, Guanghao Zhang, Haonan Shi, Long Chen, Tao Zhong, Wanggui He, et al. Llava-mod: Making llava tiny via moe knowl...

work page arXiv
[20]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: In- centivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966,

work page internal anchor Pith review arXiv
[21]

Visual agents as fast and slow thinkers

Guangyan Sun, Mingyu Jin, Zhenting Wang, Cheng-Long Wang, Siqi Ma, Qifan Wang, Tong Geng, Ying Nian Wu, Yongfeng Zhang, and Dongfang Liu. Visual agents as fast and slow thinkers.arXiv preprint arXiv:2408.08862,

work page arXiv
[22]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025a. Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973,

work page internal anchor Pith review arXiv
[24]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528,

work page internal anchor Pith review arXiv
[25]

Show-o turbo: Towards accelerated unified multimodal understanding and generation.arXiv preprint arXiv:2502.05415, 2025

Chenkai Xu, Xu Wang, Zhenyi Liao, Yishun Li, Tianqi Hou, and Zhijie Deng. Show-o turbo: Towards accelerated unified multimodal understanding and generation.arXiv preprint arXiv:2502.05415,

work page arXiv
[26]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Li- juan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision).arXiv preprint arXiv:2309.17421, 9(1):1,

work page internal anchor Pith review arXiv
[28]

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models.arXiv preprint arXiv:2408.04840, 2024a. Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. ...

work page internal anchor Pith review arXiv
[29]

A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pp. 13040–13051, 2024b. Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang,...

work page internal anchor Pith review arXiv 2026
[30]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models.arXiv preprint arXiv:2411.00836, 2024

Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models.arXiv preprint arXiv:2411.00836,

work page arXiv
[32]

type": "function

A PROMPT A.1 SYSTEMPROMPT SYSTEM_PROMPT You are a helpful assistant. # Tools You may call one or more functions to assist with the user query. You are provided with function signatures within <tools></tools> XML tags:,→ <tools> { "type": "function", "function": { "name": "image_zoom_in_tool", "description": "Zoom in on a specific region of an image by cro...

work page 2026

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Per- ception tokens enhance visual reasoning in multimodal lan- guage models

Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen, Linda G Shapiro, and Ranjay Krishna. Perception tokens enhance visual reasoning in multimodal language models. arXiv preprint arXiv:2412.03548,

work page arXiv

[5] [5]

Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025

Kezhen Chen, Rahul Thapa, Rahul Chalamala, Ben Athiwaratkun, Shuaiwen Leon Song, and James Zou. Dragonfly: Multi-resolution zoom supercharges large visual-language model.arXiv e-prints, pp. arXiv–2406, 2024a. Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu Cheng. Advancing multimodal reasonin...

work page arXiv

[6] [6]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

10 Published as a conference paper at ICLR 2026 Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024b. Zhe Chen, Jiannan Wu, Wenhai Wang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025a. Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jia...

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Referitgame: Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 787–798,

work page 2014

[9] [9]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli ´c, and Furu Wei. Imagine while reasoning in space: Multimodal visua...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InComputer vision– ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pp. 740–755. Springer,

work page 2014

[11] [11]

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023b. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowled...

work page internal anchor Pith review arXiv

[12] [12]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Pkrd-cot: A unified chain-of-thought prompting for multi-modal large language models in autonomous driving

Xuewen Luo, Fan Ding, Yinsheng Song, Xiaofeng Zhang, and Junnyong Loo. Pkrd-cot: A unified chain-of-thought prompting for multi-modal large language models in autonomous driving.arXiv preprint arXiv:2412.02025,

work page arXiv

[14] [14]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2503.07365,

work page Pith review arXiv

[15] [15]

s1: Simple test-time scaling

12 Published as a conference paper at ICLR 2026 Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393,

work page Pith review arXiv 2026

[16] [16]

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536,

work page internal anchor Pith review arXiv

[17] [17]

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284,

work page internal anchor Pith review arXiv

[18] [18]

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024a. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song...

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Mome: Mixture of multimodal experts for generalist multimodal large language models.arXiv preprint arXiv:2407.12709, 2024b

Leyang Shen, Gongwei Chen, Rui Shao, Weili Guan, and Liqiang Nie. Mome: Mixture of multimodal experts for generalist multimodal large language models.arXiv preprint arXiv:2407.12709, 2024b. Fangxun Shu, Yue Liao, Le Zhuo, Chenning Xu, Lei Zhang, Guanghao Zhang, Haonan Shi, Long Chen, Tao Zhong, Wanggui He, et al. Llava-mod: Making llava tiny via moe knowl...

work page arXiv

[20] [20]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: In- centivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966,

work page internal anchor Pith review arXiv

[21] [21]

Visual agents as fast and slow thinkers

Guangyan Sun, Mingyu Jin, Zhenting Wang, Cheng-Long Wang, Siqi Ma, Qifan Wang, Tong Geng, Ying Nian Wu, Yongfeng Zhang, and Dongfang Liu. Visual agents as fast and slow thinkers.arXiv preprint arXiv:2408.08862,

work page arXiv

[22] [22]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025a. Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973,

work page internal anchor Pith review arXiv

[24] [24]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528,

work page internal anchor Pith review arXiv

[25] [25]

Show-o turbo: Towards accelerated unified multimodal understanding and generation.arXiv preprint arXiv:2502.05415, 2025

Chenkai Xu, Xu Wang, Zhenyi Liao, Yishun Li, Tianqi Hou, and Zhijie Deng. Show-o turbo: Towards accelerated unified multimodal understanding and generation.arXiv preprint arXiv:2502.05415,

work page arXiv

[26] [26]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Li- juan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision).arXiv preprint arXiv:2309.17421, 9(1):1,

work page internal anchor Pith review arXiv

[28] [28]

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models.arXiv preprint arXiv:2408.04840, 2024a. Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. ...

work page internal anchor Pith review arXiv

[29] [29]

A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pp. 13040–13051, 2024b. Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang,...

work page internal anchor Pith review arXiv 2026

[30] [30]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models.arXiv preprint arXiv:2411.00836, 2024

Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models.arXiv preprint arXiv:2411.00836,

work page arXiv

[32] [32]

type": "function

A PROMPT A.1 SYSTEMPROMPT SYSTEM_PROMPT You are a helpful assistant. # Tools You may call one or more functions to assist with the user query. You are provided with function signatures within <tools></tools> XML tags:,→ <tools> { "type": "function", "function": { "name": "image_zoom_in_tool", "description": "Zoom in on a specific region of an image by cro...

work page 2026