ACTIVE-o3: Empowering MLLMs with Active Perception via Pure Reinforcement Learning

Anzhou Li; Canyu Zhao; Cheng Zou; Chunhua Shen; Hao Chen; Hao Zhong; Jingdong Chen; Ming Yang; Mingyu Liu; Muzhi Zhu

arxiv: 2505.21457 · v2 · pith:VNAKPWUDnew · submitted 2025-05-27 · 💻 cs.CV · cs.AI

ACTIVE-o3: Empowering MLLMs with Active Perception via Pure Reinforcement Learning

Muzhi Zhu , Hao Zhong , Canyu Zhao , Zongze Du , Mingyu Liu , Zheng Huang , Anzhou Li , Hao Chen

show 4 more authors

Cheng Zou Jingdong Chen Ming Yang Chunhua Shen

This is my paper

classification 💻 cs.CV cs.AI

keywords perceptionactiveactive-o3mllmscapabilitiesefficientframeworkfurther

0 comments

read the original abstract

Active vision, also known as active perception, refers to actively selecting where and how to look in order to gather task-relevant information. It is a critical component of efficient perception and decision-making in humans and advanced embodied agents. With the rise of Multimodal Large Language Models (MLLMs) as central planners in robotic systems, the lack of methods for equipping MLLMs with active perception has become a key gap. We first provide a systematic definition of MLLM-based active perception tasks and show that GPT-o3's zoom-in strategy can be viewed as a special case, though it suffers from low efficiency and inaccurate region selection. To address these issues, we propose ACTIVE-o3, a reinforcement learning framework built on GRPO that equips MLLMs with active perception capabilities. Leveraging a modular sensing-action design and a dual-form reward, ACTIVE-o3 autonomously learns efficient and stable region selection strategies without explicit region-selection supervision. We further establish a comprehensive benchmark covering both open-world tasks, including small- and dense-object grounding, and domain-specific scenarios, including remote sensing, autonomous driving, and interactive segmentation. Experimental results demonstrate that ACTIVE-o3 significantly enhances active perception capabilities compared to baselines. Moreover, we show that our framework not only preserves the model's general understanding ability but can also serve as a proxy task for leveraging perception data, further improving performance on benchmarks such as RealWorldQA and MME.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
cs.CV 2026-04 unverdicted novelty 6.0

MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
Boosting Reasoning in Large Multimodal Models via Activation Replay
cs.CV 2025-11 unverdicted novelty 6.0

Activation Replay boosts multimodal reasoning in post-trained LMMs by replaying low-entropy activations from base models to RLVR counterparts at test time via visual token manipulation.
Perception-Aware Policy Optimization for Multimodal Reasoning
cs.CL 2025-07 unverdicted novelty 6.0

PAPO integrates perception-aware supervision via a KL-based loss into RLVR methods like GRPO, yielding 4.4-17.5% gains on multimodal benchmarks and 30.5% fewer perception errors, with larger gains on vision-heavy tasks.
CaptchaMind: Training CAPTCHA Solvers via Reinforcement Learning with Explicit Reasoning Supervision
cs.CV 2026-05 unverdicted novelty 5.0

Presents CaptchaBench benchmark and CaptchaMind RL solver achieving 82.9% success on benchmark tasks and 71% on real-world CAPTCHAs via explicit reasoning process supervision.
DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding
cs.AI 2026-05 unverdicted novelty 5.0

DRS-GUI introduces a dynamic region search method with Focus/Shift/Scatter actions and MCTS-based planning that improves GUI grounding accuracy by 14% on ScreenSpot-Pro for both general and GUI-specific MLLMs without ...
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
cs.CV 2025-09 unverdicted novelty 5.0

Mini-o3 scales visual search reasoning to tens of interaction turns via a new probe dataset, iterative trajectory collection, and over-turn masking in RL, claiming SOTA performance while training only up to six turns.