DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Pith reviewed 2026-05-11 14:37 UTC · model grok-4.3
The pith
Reinforcement learning lets vision-language models develop native image-based reasoning without pre-collected data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeepEyes trains a vision-language model end-to-end with reinforcement learning so that it learns to think with images through active perception, using its intrinsic grounding capability rather than external tools or pre-collected reasoning data. A tailored data selection and reward strategy steers the model to strategically ground its reasoning in visual content. The outcome is significant gains on general perception and reasoning benchmarks together with better grounding, lower hallucination rates, and stronger mathematical reasoning. During training the model passes through distinct stages: initial exploratory perception gives way to efficient and accurate exploitation, accompanied by a多样化
What carries the argument
Active perception, the learned strategy by which the model decides when and how to ground its ongoing reasoning directly in visual information.
If this is right
- Performance improves on perception and reasoning benchmarks without any pre-collected reasoning traces.
- Grounding accuracy rises while hallucination rates fall, including on mathematical reasoning tasks.
- The model exhibits an internal progression from exploratory to exploitative visual behavior.
- Diverse thinking patterns appear that parallel human visual reasoning sequences.
Where Pith is reading between the lines
- The same reinforcement-learning incentive structure could be tested on video or audio sequences to induce analogous active-perception loops.
- If the approach scales, training pipelines for multimodal models may require far less curated reasoning data than current supervised routes.
- Longer-horizon tasks could reveal whether the emergent perception strategies remain stable or require additional reward shaping.
- Real-world deployment in dynamic environments would test whether the learned visual-grounding habits transfer beyond static benchmark images.
Load-bearing premise
The custom reward and data selection rules will steer the model toward genuine, useful visual grounding rather than superficial patterns that merely maximize the reward signal.
What would settle it
Run the same reinforcement learning loop with the visual-grounding reward terms removed or replaced by generic accuracy rewards; if benchmark gains and the reported evolution of perception behavior remain unchanged, the claim that active perception drives the improvements is falsified.
read the original abstract
Large Vision-Language Models excel at multimodal understanding but struggle to deeply integrate visual information into their predominantly text-based reasoning processes, a key challenge in mirroring human cognition. To address this, we introduce DeepEyes, a model that learns to "think with images", trained end-to-end with reinforcement learning without requiring pre-collected reasoning data for cold-start supervised fine-tuning (SFT). Notably, this ability emerges natively, leveraging the model's own grounding capability as an intrinsic function rather than relying on external specialized models or APIs. We enable this capability through active perception, where the model learns to strategically ground its reasoning in visual information, guided by a tailored data selection and reward strategy. DeepEyes achieves significant performance gains on general perception and reasoning benchmarks and also demonstrates improvement in grounding, hallucination, and mathematical reasoning tasks. Interestingly, we observe the distinct evolution of active perception from initial exploration to efficient and accurate exploitation, and diverse thinking patterns that closely mirror human visual reasoning processes. Code is available at https://github.com/Visual-Agent/DeepEyes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DeepEyes, a vision-language model trained end-to-end via reinforcement learning to develop native 'thinking with images' capability through active perception. It claims this emerges without any cold-start supervised fine-tuning on pre-collected reasoning data, relying instead on tailored data selection and a custom reward strategy that leverages the model's intrinsic grounding. The approach reportedly yields significant gains on general perception and reasoning benchmarks, plus improvements in grounding, hallucination reduction, and mathematical reasoning, with observed behavioral evolution from exploration to exploitation and diverse human-like thinking patterns.
Significance. If the central claims hold under rigorous verification, the work would be moderately significant for multimodal AI research. It offers an empirical demonstration that RL can elicit integrated visual reasoning in VLMs without heavy reliance on SFT or external tools, potentially reducing data curation costs and enabling more autonomous active perception. The public code release is a clear strength for reproducibility.
major comments (3)
- [Results] Results section (and any associated tables/figures reporting benchmark scores): The manuscript claims 'significant performance gains' on perception and reasoning benchmarks but provides no quantitative deltas, baseline comparisons, statistical significance tests, or error bars. Without these, it is impossible to evaluate whether the gains exceed what data curation alone would produce, which is load-bearing for the claim that the RL mechanism (rather than selection) drives the result.
- [Methods] Methods section on reward design and data selection: The reward strategy is described at a high level as 'tailored' to encourage active perception, but no explicit formulation (e.g., components for grounding accuracy, reasoning utility, or format compliance) or weighting is given. This prevents assessment of whether the policy converges to integrative visual thinking or to superficial high-reward patterns such as periodic token emission, directly undermining the 'natively emerges' and 'causal integration' claims.
- [Analysis] Analysis or ablation subsection (if present): There are no reported ablations that isolate the contribution of the RL reward versus data selection, nor any causal intervention (e.g., forcing or removing visual thought steps and measuring downstream accuracy change). The observed 'evolution from exploration to exploitation' is presented observationally; without metrics tracking grounding utility over training or controlled experiments, the mechanism remains unverified.
minor comments (2)
- [Abstract] The abstract and introduction use the phrase 'significant performance gains' without defining the term or providing supporting numbers; this should be replaced with concrete metrics or removed.
- [Methods] Notation for the active perception loop (e.g., how visual grounding actions are interleaved with text reasoning) is introduced informally; a clear algorithmic pseudocode or diagram would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We appreciate the opportunity to clarify the presentation of our results, methods, and analyses. We address each major comment below and commit to revisions that will strengthen the manuscript.
read point-by-point responses
-
Referee: [Results] Results section (and any associated tables/figures reporting benchmark scores): The manuscript claims 'significant performance gains' on perception and reasoning benchmarks but provides no quantitative deltas, baseline comparisons, statistical significance tests, or error bars. Without these, it is impossible to evaluate whether the gains exceed what data curation alone would produce, which is load-bearing for the claim that the RL mechanism (rather than selection) drives the result.
Authors: We agree that explicit quantitative comparisons are necessary to substantiate the claims. In the revised manuscript we will add tables reporting baseline scores, absolute and relative performance deltas, error bars from multiple runs, and statistical significance tests. We will also include a discussion comparing the observed gains against what data curation alone can achieve, thereby clarifying the contribution of the RL objective. revision: yes
-
Referee: [Methods] Methods section on reward design and data selection: The reward strategy is described at a high level as 'tailored' to encourage active perception, but no explicit formulation (e.g., components for grounding accuracy, reasoning utility, or format compliance) or weighting is given. This prevents assessment of whether the policy converges to integrative visual thinking or to superficial high-reward patterns such as periodic token emission, directly undermining the 'natively emerges' and 'causal integration' claims.
Authors: We acknowledge that the reward formulation was presented at too high a level. The revised Methods section will contain the complete mathematical definition of the reward, explicitly listing each component (grounding accuracy, reasoning utility, format compliance) together with the weighting coefficients. This will enable readers to evaluate convergence behavior and rule out superficial reward hacking. revision: yes
-
Referee: [Analysis] Analysis or ablation subsection (if present): There are no reported ablations that isolate the contribution of the RL reward versus data selection, nor any causal intervention (e.g., forcing or removing visual thought steps and measuring downstream accuracy change). The observed 'evolution from exploration to exploitation' is presented observationally; without metrics tracking grounding utility over training or controlled experiments, the mechanism remains unverified.
Authors: We agree that additional ablations and quantitative tracking would strengthen the mechanistic claims. The revision will include ablation experiments that compare full RL training against data-selection-only baselines, as well as plots of grounding utility and exploration/exploitation metrics across training steps. Full causal interventions (forcing or ablating visual thought steps) would require new controlled runs; we will therefore provide enhanced observational analysis and discuss the limits of the current evidence. revision: partial
Circularity Check
No circularity: empirical RL training with external benchmarks
full rationale
The paper presents an empirical end-to-end RL method for training VLMs to perform active perception and 'think with images' without cold-start SFT. Claims rest on performance gains measured against external perception/reasoning benchmarks and observed behavioral evolution during training. No mathematical derivations, equations, or self-referential definitions are present that would reduce any result to its inputs by construction. The approach is self-contained against independent evaluation data.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 60 Pith papers
-
CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...
-
ETCHR: Editing To Clarify and Harness Reasoning
A decoupled question-conditioned image editor trained via supervised imitation then VLM-reward enhancement improves MLLM visual reasoning Pass@1 by 4.6-5.5 points across models and tasks.
-
CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering
Introduces CaST-Bench, a dataset of 2,066 causal questions on 1,015 videos with annotated causal chains and metrics to evaluate VLMs on spatio-temporal causal reasoning.
-
Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning
AutoTool uses reinforcement learning with dual-mode rewards to train multimodal LLMs to adaptively choose between tool-assisted and text-centric reasoning, yielding accuracy and efficiency gains on V* and POPE benchmarks.
-
Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era
Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable ...
-
GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding
GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolutio...
-
UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs
VLMs show a resolution illusion on UHR Earth observation imagery where higher resolution does not improve micro-target perception; UHR-Micro benchmark and MAP-Agent address this via evidence-centered active inspection.
-
UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.
-
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
State-conditioned commitment depth in a vision-language policy Pareto-dominates fixed-depth baselines on Sliding Puzzle and Sokoban, raising solve rates by up to 12.5 points while using 25% fewer actions and beating l...
-
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
A vision-language policy learns state-conditioned commitment depth to Pareto-dominate fixed-depth baselines on long-horizon puzzles, achieving up to 12.5 pp higher solve rate with 25% fewer actions.
-
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
-
Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning
Pest-Thinker is a reinforcement learning framework that improves MLLMs' expert-level reasoning on pest morphology via synthesized CoT trajectories, GRPO optimization, and an LLM-judged feature reward on new benchmarks...
-
Act2See: Emergent Active Visual Perception for Video Reasoning
Act2See trains VLMs via supervised fine-tuning on verified reasoning traces to interleave active frame calls within text CoTs, yielding SOTA results on video reasoning benchmarks.
-
Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs
Chain of Modality dynamically orchestrates multimodal input topologies and bifurcates cognitive execution to overcome static fusion biases in Omni-MLLMs.
-
TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables
TableVision benchmark shows explicit spatial grounding recovers MLLM reasoning on hierarchical tables, delivering 12.3% accuracy improvement through a decoupled perception-reasoning framework.
-
OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning
OmniSch is the first benchmark exposing gaps in LMMs for PCB schematic visual grounding, topology-to-graph parsing, geometric weighting, and tool-augmented reasoning.
-
V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators
V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the...
-
Motion-o: Trajectory-Grounded Video Reasoning
Motion-o extends VLMs with Motion Chain of Thought (MCoT) using <motion/> tags and perturbation rewards to make object trajectories explicit and supervised in video reasoning.
-
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
-
Q-Probe: Scaling Image Quality Assessment to High Resolution via Context-Aware Agentic Probing
Q-Probe introduces the first agentic IQA framework that scales to high resolutions using context-aware probing, a new Vista-Bench benchmark, and three-stage training to reach state-of-the-art performance across scales.
-
Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning
Omni-R1 unifies multimodal reasoning by generating intermediate images during the process in a SFT-plus-RL framework, with an Omni-R1-Zero variant that matches or exceeds it using only text data.
-
Forest Before Trees: Latent Superposition for Efficient Visual Reasoning
Laser reformulates visual reasoning via Dynamic Windowed Alignment Learning to maintain latent superposition of global features, delivering 5.03% average gains over Monet and over 97% fewer inference tokens on six benchmarks.
-
Code-in-the-Loop Forensics: Agentic Tool Use for Image Forgery Detection
ForenAgent lets MLLMs create and iteratively improve low-level Python tools for image forgery detection via a two-stage training pipeline and a new 100k-image benchmark dataset.
-
Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space
DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.
-
Training Multi-Image Vision Agents via End2End Reinforcement Learning
IMAgent trains a multi-image vision agent via pure end-to-end RL with visual reflection tools and a two-layer motion trajectory masking strategy, reaching SOTA on single- and multi-image benchmarks while revealing too...
-
Latent Visual Reasoning
Latent Visual Reasoning enables autoregressive generation of latent visual states that reconstruct critical image tokens, yielding gains on perception-heavy VQA benchmarks such as 71.67% on MMVP.
-
HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling
HiDe is a training-free hierarchical decoupling method that separates key visual tokens from background interference in high-resolution MLLMs to achieve new state-of-the-art results on V*Bench, HRBench4K, and HRBench8K.
-
DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning
DeFacto trains multimodal models using counterfactual image variants and reinforcement learning rewards to improve both answer accuracy and evidence-answer consistency.
-
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
-
Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles
Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and...
-
Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA via Active Zooming
A structured Zoom-then-Diagnose paradigm with uncertainty-aware GRPO rewards improves lesion localization by 39.3% on liver, breast, and thyroid ultrasound VQA datasets by encouraging caution under ambiguity.
-
Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation
Vision-OPD uses on-policy self-distillation from crop-conditioned to full-image policies within the same MLLM to close the regional-to-global perception gap.
-
Leveraging Latent Visual Reasoning in Silence
Latent visual reasoning improves multimodal models via training effects even without using latent tokens at inference, enabled by an attention-based RL reward that promotes interaction with text tokens.
-
VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation
VideoSeeker integrates agentic reasoning and visual prompts into LVLMs via automated data synthesis, cold-start supervision, and RL training, yielding +13.7% gains on instance-level video tasks over baselines includin...
-
Venus-DeFakerOne: Unified Fake Image Detection & Localization
DeFakerOne integrates InternVL2 and SAM2 into a single model that achieves state-of-the-art results on 39 detection and 9 localization benchmarks for unified fake image detection and localization.
-
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
-
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
GAP introduces three-level alignment for visual latent reasoning in MLLMs, achieving top aggregate perception and reasoning performance on Qwen2.5-VL 7B by addressing decoder-input norm mismatch.
-
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding best aggregate perception/reasoning scores on Qwen2.5-VL 7B among supervised variants while showing task-relevant signal i...
-
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
SCOLAR addresses information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens from LLM hidden states, extending acceptable CoT length over 30x and achieving +14.12% gains on b...
-
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
SCOLAR fixes information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens via a detransformer, extending acceptable CoT length over 30x and delivering +14.12% gains on reasoni...
-
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while pr...
-
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
-
Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning
HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.
-
AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding
AutoFocus converts token perplexity into an anisotropic Gaussian uncertainty field to drive region proposals and shape-aware zooming for improved GUI grounding in VLMs.
-
Agentic AI for Remote Sensing: Technical Challenges and Research Directions
Agentic AI faces structural challenges in remote sensing due to geospatial data properties and workflow constraints, requiring EO-native agents built around structured state, tool-aware reasoning, and validity-aware e...
-
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.
-
SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models
SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.
-
Visual Reasoning through Tool-supervised Reinforcement Learning
ToolsRL trains MLLMs via a tool-specific then accuracy-focused RL curriculum to master visual tools for complex reasoning tasks.
-
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
-
Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding
Chain-of-Glimpse is a reinforcement learning framework that builds progressive, spatially grounded reasoning traces around task-relevant objects in videos to enable more accurate and interpretable multi-step decisions.
-
Towards Long-horizon Agentic Multimodal Search
LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp a...
-
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.
-
AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning
AnomalyAgent uses tool-augmented reinforcement learning with self-reflection to generate realistic industrial anomalies, achieving better metrics than zero-shot methods on MVTec-AD.
-
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
-
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
-
LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models
LAST augments MLLMs with a tool-abstraction sandbox and three-stage training to deliver around 20% gains on spatial reasoning tasks, outperforming closed-source models.
-
CharTool: Tool-Integrated Visual Reasoning for Chart Understanding
CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.
-
Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models
Attention dispersion during extended reasoning impairs MLLM perception on images, and a training-free VRGA framework mitigates it by selecting and reweighting visual attention heads using an entropy-focus criterion.
-
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?
MapTab benchmark shows current MLLMs struggle with multi-criteria multimodal route planning and that combining vision and language frequently underperforms single-modality approaches.
-
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?
MapTab is a new multimodal benchmark with 328 images and nearly 200k queries that shows current MLLMs have substantial difficulty with multi-criteria route planning when visual and tabular information must be combined.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Per- ception tokens enhance visual reasoning in multimodal lan- guage models
Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen, Linda G Shapiro, and Ranjay Krishna. Perception tokens enhance visual reasoning in multimodal language models. arXiv preprint arXiv:2412.03548,
-
[5]
Kezhen Chen, Rahul Thapa, Rahul Chalamala, Ben Athiwaratkun, Shuaiwen Leon Song, and James Zou. Dragonfly: Multi-resolution zoom supercharges large visual-language model.arXiv e-prints, pp. arXiv–2406, 2024a. Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu Cheng. Advancing multimodal reasonin...
-
[6]
10 Published as a conference paper at ICLR 2026 Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024b. Zhe Chen, Jiannan Wu, Wenhai Wang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[7]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025a. Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jia...
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Referitgame: Referring to objects in photographs of natural scenes
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 787–798,
work page 2014
-
[9]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli ´c, and Furu Wei. Imagine while reasoning in space: Multimodal visua...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[10]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InComputer vision– ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pp. 740–755. Springer,
work page 2014
-
[11]
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023b. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowled...
work page internal anchor Pith review arXiv
-
[12]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Xuewen Luo, Fan Ding, Yinsheng Song, Xiaofeng Zhang, and Junnyong Loo. Pkrd-cot: A unified chain-of-thought prompting for multi-modal large language models in autonomous driving.arXiv preprint arXiv:2412.02025,
-
[14]
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2503.07365,
-
[15]
12 Published as a conference paper at ICLR 2026 Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393,
work page Pith review arXiv 2026
-
[16]
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536,
work page internal anchor Pith review arXiv
-
[17]
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284,
work page internal anchor Pith review arXiv
-
[18]
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024a. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song...
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Leyang Shen, Gongwei Chen, Rui Shao, Weili Guan, and Liqiang Nie. Mome: Mixture of multimodal experts for generalist multimodal large language models.arXiv preprint arXiv:2407.12709, 2024b. Fangxun Shu, Yue Liao, Le Zhuo, Chenning Xu, Lei Zhang, Guanghao Zhang, Haonan Shi, Long Chen, Tao Zhong, Wanggui He, et al. Llava-mod: Making llava tiny via moe knowl...
-
[20]
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: In- centivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966,
work page internal anchor Pith review arXiv
-
[21]
Visual agents as fast and slow thinkers
Guangyan Sun, Mingyu Jin, Zhenting Wang, Cheng-Long Wang, Siqi Ma, Qifan Wang, Tong Geng, Ying Nian Wu, Yongfeng Zhang, and Dongfang Liu. Visual agents as fast and slow thinkers.arXiv preprint arXiv:2408.08862,
-
[22]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025a. Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts
Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973,
work page internal anchor Pith review arXiv
-
[24]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528,
work page internal anchor Pith review arXiv
-
[25]
Chenkai Xu, Xu Wang, Zhenyi Liao, Yishun Li, Tianqi Hou, and Zhijie Deng. Show-o turbo: Towards accelerated unified multimodal understanding and generation.arXiv preprint arXiv:2502.05415,
-
[26]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Li- juan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision).arXiv preprint arXiv:2309.17421, 9(1):1,
work page internal anchor Pith review arXiv
-
[28]
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models.arXiv preprint arXiv:2408.04840, 2024a. Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. ...
work page internal anchor Pith review arXiv
-
[29]
A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pp. 13040–13051, 2024b. Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang,...
work page internal anchor Pith review arXiv 2026
-
[30]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models.arXiv preprint arXiv:2411.00836,
-
[32]
A PROMPT A.1 SYSTEMPROMPT SYSTEM_PROMPT You are a helpful assistant. # Tools You may call one or more functions to assist with the user query. You are provided with function signatures within <tools></tools> XML tags:,→ <tools> { "type": "function", "function": { "name": "image_zoom_in_tool", "description": "Zoom in on a specific region of an image by cro...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.