S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
hub
Visual agentic reinforcement fine-tuning
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 14representative citing papers
Chain of Modality dynamically orchestrates multimodal input topologies and bifurcates cognitive execution to overcome static fusion biases in Omni-MLLMs.
VISOR is a unified agentic VRAG framework with Evidence Space structuring, visual action evaluation/correction, and dynamic sliding-window trajectories trained via GRPO-based RL that achieves SOTA performance on long-horizon visual reasoning benchmarks.
ForenAgent lets MLLMs create and iteratively improve low-level Python tools for image forgery detection via a two-stage training pipeline and a new 100k-image benchmark dataset.
TAPO corrects credit misassignment in RL for multimodal search agents by using tool parameter similarity to share advantages across equivalent actions.
Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and Gemini-2.5-Pro while generalizing to unseen components.
Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while providing a theoretical dominance result.
Introduces TA-MDP and proves GRPO convergence at O(1/sqrt(T)), a reward decomposition bound, and PAC-Bayes generalization for tool-augmented LVLM policies.
POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.
CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.
DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
Generation-to-Understanding synergy lets multimodal models create self-generated visual edits as intermediate steps, improving performance on twelve benchmarks while revealing limits in task-aligned self-reflection.
PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
SimpleSearch-VL improves Qwen3-VL multimodal agent baselines by 15.8-16 points on average using 7K total training examples and reaches parity with Gemini-3-Pro on the 30B variant.
citing papers explorer
-
Code-in-the-Loop Forensics: Agentic Tool Use for Image Forgery Detection
ForenAgent lets MLLMs create and iteratively improve low-level Python tools for image forgery detection via a two-stage training pipeline and a new 100k-image benchmark dataset.
-
TAPO: Tool-Aware Policy Optimization via Credit Transfer for Multimodal Search Agents
TAPO corrects credit misassignment in RL for multimodal search agents by using tool parameter similarity to share advantages across equivalent actions.
-
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while providing a theoretical dominance result.
-
CharTool: Tool-Integrated Visual Reasoning for Chart Understanding
CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.