S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
hub Canonical reference
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Canonical reference. 91% of citing Pith papers cite this work as background.
abstract
Recent progress in multimodal reasoning has been significantly advanced by textual Chain-of-Thought (CoT), a paradigm where models conduct reasoning within language. This text-centric approach, however, treats vision as a static, initial context, creating a fundamental "semantic gap" between rich perceptual data and discrete symbolic thought. Human cognition often transcends language, utilizing vision as a dynamic mental sketchpad. A similar evolution is now unfolding in AI, marking a fundamental paradigm shift from models that merely think about images to those that can truly think with images. This emerging paradigm is characterized by models leveraging visual information as intermediate steps in their thought process, transforming vision from a passive input into a dynamic, manipulable cognitive workspace. In this survey, we chart this evolution of intelligence along a trajectory of increasing cognitive autonomy, which unfolds across three key stages: from external tool exploration, through programmatic manipulation, to intrinsic imagination. To structure this rapidly evolving field, our survey makes four key contributions. (1) We establish the foundational principles of the think with image paradigm and its three-stage framework. (2) We provide a comprehensive review of the core methods that characterize each stage of this roadmap. (3) We analyze the critical landscape of evaluation benchmarks and transformative applications. (4) We identify significant challenges and outline promising future directions. By providing this structured overview, we aim to offer a clear roadmap for future research towards more powerful and human-aligned multimodal AI.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.
StemBind benchmark diagnoses MLLM failures in abstract visual reasoning by separating perception, rule induction, and answer selection on shared stems, finding a persistent rule-to-instance binding gap even when perception and rule are correct.
LatentOmni proposes a latent-space cross-modal reasoning framework that uses feature-level supervision and Omni-Sync Position Embedding to align and synchronize audio-visual latents, supported by a new 35K interleaved reasoning dataset and showing gains over text CoT baselines.
Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
Proposes an equation-anchored tool-use method for MLLMs that writes the pinhole back-projection equation in Chain-of-Thought and substitutes retrieved camera intrinsics and depths to achieve robustness in 3D object detection and visual grounding under rescaled intrinsics.
GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolution remote sensing benchmarks.
UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.
CollabVR improves video reasoning performance by coupling vision-language models and video generation models in a closed-loop step-level collaboration that detects and repairs generation failures.
Pest-Thinker is a reinforcement learning framework that improves MLLMs' expert-level reasoning on pest morphology via synthesized CoT trajectories, GRPO optimization, and an LLM-judged feature reward on new benchmarks QFSD and AgriInsect.
LAT-Audio introduces a global-to-local reasoning approach with TWA-CoT that outperforms prior models on temporal tasks for audio up to 30 minutes.
Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.
GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.
TableVision benchmark shows explicit spatial grounding recovers MLLM reasoning on hierarchical tables, delivering 12.3% accuracy improvement through a decoupled perception-reasoning framework.
DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while providing a theoretical dominance result.
RIS improves MLLM latent visual reasoning by retrieving spatial-semantic evidence, integrating it via attention bottlenecks, and synthesizing it with language transition tokens, yielding gains on V*, HRBench, MMVP, and BLINK benchmarks.
AutoFocus converts token perplexity into an anisotropic Gaussian uncertainty field to drive region proposals and shape-aware zooming for improved GUI grounding in VLMs.
PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.
LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp and MMSearch-Plus.
GRASP improves multimodal sarcasm target identification by anchoring visual regions in grounded chain-of-thought reasoning and using dual-stage optimization on a new balanced dataset.
MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
citing papers explorer
-
TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables
TableVision benchmark shows explicit spatial grounding recovers MLLM reasoning on hierarchical tables, delivering 12.3% accuracy improvement through a decoupled perception-reasoning framework.
-
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
-
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while providing a theoretical dominance result.
-
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.