hub

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models , url = · 2024 · arXiv 2406.09403

29 Pith papers cite this work. Polarity classification is still indexing.

29 Pith papers citing it

read on arXiv browse 29 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

cs.CL · 2024-10-06 · unverdicted · novelty 8.0

ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

Future-L1 interleaves latent visual spans with text in MLLM decoding, trained on a custom Future-L1-50K dataset via LA-DAPO RL, and reports SOTA gains on FutureBench (61.0 to 85.4) and TwiFF-Bench (2.44 to 3.04).

Sandboxed Coding Agents are Competitive Omni-modal Task Solvers

cs.CL · 2026-05-30 · unverdicted · novelty 7.0

Sandboxed coding agents with text+image access match or outperform native omnimodal models on audio-video benchmarks by converting tasks into code-driven retrieval and processing.

Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

Draw2Think recasts geometric reasoning as agentic interaction with a constraint engine, achieving 95.9% predicate-level construction fidelity and up to 16.4% accuracy gains on solid geometry tasks.

From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.

Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

cs.CV · 2025-12-14 · unverdicted · novelty 7.0

DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.

Latent Visual Reasoning

cs.CV · 2025-09-29 · unverdicted · novelty 7.0

Latent Visual Reasoning enables autoregressive generation of latent visual states that reconstruct critical image tokens, yielding gains on perception-heavy VQA benchmarks such as 71.67% on MMVP.

PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts

cs.CL · 2025-06-06 · conditional · novelty 7.0

PuzzleWorld benchmark reveals state-of-the-art AI models solve only 18% of complex puzzlehunt problems with 40% stepwise accuracy, matching novices but trailing enthusiasts, while fine-tuning on traces yields modest gains.

Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning

cs.CV · 2026-06-17 · unverdicted · novelty 6.0

Visual-OPSD distills reasoning from a privileged visual-thought teacher to a text-only student using on-policy JSD, delivering +3.40pp accuracy gain and 14.3x speedup over the generative teacher on nine benchmarks.

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

cs.AI · 2026-06-02 · unverdicted · novelty 6.0

IPT supervision improves spatial reasoning in VLMs on perspective taking, path tracing, and multiview counting tasks, often outperforming textual chain-of-thought while remaining consistent with observed inputs.

VESTA: Visual Exploration with Statistical Tool Agents

cs.AI · 2026-05-29 · unverdicted · novelty 6.0

VESTA introduces dynamic tool creation for VLMs that outperforms static-tool and no-tool baselines on distribution fitting, time series, and astronomy tasks in the new DAWN benchmark.

Self-Prophetic Decoding to Unlock Visual Search in LVLMs

cs.CV · 2026-05-27 · unverdicted · novelty 6.0

SeProD is a plug-and-play self-prophetic decoding framework that combines pre- and post-training LVLM capabilities via probability-based sampling to improve coherent visual search and multi-step reasoning.

ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning

cs.CV · 2026-05-27 · unverdicted · novelty 6.0

ROVER introduces a learnable routing plugin for object-centric visual evidence in MLLMs via token triplets and differential attention, reporting gains on MM-GCoT and VideoEspresso when integrated into Qwen2.5-VL-7B.

When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning

cs.AI · 2026-05-11 · unverdicted · novelty 6.0 · 3 refs

Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while providing a theoretical dominance result.

Visual Reasoning through Tool-supervised Reinforcement Learning

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

ToolsRL trains MLLMs via a tool-specific then accuracy-focused RL curriculum to master visual tools for complex reasoning tasks.

Mull-Tokens: Modality-Agnostic Latent Thinking

cs.CV · 2025-12-11 · unverdicted · novelty 6.0

Mull-Tokens are modality-agnostic latent tokens that enable free-form multimodal thinking and deliver up to 16% gains on spatial reasoning benchmarks.

How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

cs.CV · 2025-07-02 · unverdicted · novelty 6.0

Multimodal foundation models achieve respectable but sub-specialist performance on semantic vision tasks and weaker results on geometric tasks when evaluated through prompt chaining on established benchmarks.

Grounded Reinforcement Learning for Visual Reasoning

cs.CV · 2025-05-29 · unverdicted · novelty 6.0

ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

cs.CV · 2025-03-27 · unverdicted · novelty 6.0

CoT-VLA is a 7B VLA that generates future visual frames autoregressively as planning goals before actions, outperforming prior VLAs by 17% on real-world tasks and 6% in simulation.

OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

cs.CV · 2025-03-21 · conditional · novelty 6.0

Iterative SFT-RL cycles enable a 7B LVLM to develop sophisticated visual chain-of-thought reasoning and improve performance on math and general reasoning benchmarks.

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

cs.CL · 2025-01-13 · unverdicted · novelty 6.0

MVoT lets multimodal models create coherent images during chain-of-thought reasoning via a token discrepancy loss, yielding competitive or better results than text-only CoT on dynamic spatial tasks.

V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning

cs.CV · 2026-06-24 · unverdicted · novelty 5.0

V-Zero trains MLLMs for visual reasoning without answer labels by gating on-policy distillation trajectories using contrastive evidence from relevant versus negative image crops.

Latent Visual States for Efficient Multimodal Reasoning

cs.CV · 2026-06-23 · unverdicted · novelty 5.0

EVA generates adaptive Latent_slot tokens as internal visual thoughts, trained end-to-end with text tokens via D-GSPO on the EVA-230K dataset, claiming performance gains and better inference efficiency.

MathVis-Fine: Aligning Visual Supervision with Necessity via Progressive Dependency-Guided Training for Multimodal Mathematical Reasoning

cs.AI · 2026-06-16 · unverdicted · novelty 5.0

MathVis-Fine proposes a dataset with fine-grained visual annotations and dependency ratings plus a progressive two-stage training paradigm to align visual supervision with sample-specific necessity in multimodal mathematical reasoning.

citing papers explorer

Showing 18 of 18 citing papers after filters.

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction cs.CV · 2026-06-04 · unverdicted · none · ref 37
Future-L1 interleaves latent visual spans with text in MLLM decoding, trained on a custom Future-L1-50K dataset via LA-DAPO RL, and reports SOTA gains on FutureBench (61.0 to 85.4) and TwiFF-Bench (2.44 to 3.04).
Sandboxed Coding Agents are Competitive Omni-modal Task Solvers cs.CL · 2026-05-30 · unverdicted · none · ref 2
Sandboxed coding agents with text+image access match or outperform native omnimodal models on audio-video benchmarks by converting tasks into code-driven retrieval and processing.
Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction cs.CV · 2026-05-20 · unverdicted · none · ref 18
Draw2Think recasts geometric reasoning as agentic interaction with a constraint engine, achieving 95.9% predicate-level construction fidelity and up to 16.4% accuracy gains on solid geometry tasks.
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing cs.CV · 2026-05-14 · unverdicted · none · ref 13
A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning cs.CV · 2026-06-17 · unverdicted · none · ref 9
Visual-OPSD distills reasoning from a privileged visual-thought teacher to a text-only student using on-policy JSD, delivering +3.40pp accuracy gain and 14.3x speedup over the generative teacher on nine benchmarks.
Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models cs.AI · 2026-06-02 · unverdicted · none · ref 17
IPT supervision improves spatial reasoning in VLMs on perspective taking, path tracing, and multiview counting tasks, often outperforming textual chain-of-thought while remaining consistent with observed inputs.
VESTA: Visual Exploration with Statistical Tool Agents cs.AI · 2026-05-29 · unverdicted · none · ref 20
VESTA introduces dynamic tool creation for VLMs that outperforms static-tool and no-tool baselines on distribution fitting, time series, and astronomy tasks in the new DAWN benchmark.
Self-Prophetic Decoding to Unlock Visual Search in LVLMs cs.CV · 2026-05-27 · unverdicted · none · ref 5
SeProD is a plug-and-play self-prophetic decoding framework that combines pre- and post-training LVLM capabilities via probability-based sampling to improve coherent visual search and multi-step reasoning.
ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning cs.CV · 2026-05-27 · unverdicted · none · ref 23
ROVER introduces a learnable routing plugin for object-centric visual evidence in MLLMs via token triplets and differential attention, reporting gains on MM-GCoT and VideoEspresso when integrated into Qwen2.5-VL-7B.
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning cs.AI · 2026-05-11 · unverdicted · none · ref 22 · 3 links
Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while providing a theoretical dominance result.
Visual Reasoning through Tool-supervised Reinforcement Learning cs.CV · 2026-04-21 · unverdicted · none · ref 4
ToolsRL trains MLLMs via a tool-specific then accuracy-focused RL curriculum to master visual tools for complex reasoning tasks.
V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning cs.CV · 2026-06-24 · unverdicted · none · ref 48
V-Zero trains MLLMs for visual reasoning without answer labels by gating on-policy distillation trajectories using contrastive evidence from relevant versus negative image crops.
Latent Visual States for Efficient Multimodal Reasoning cs.CV · 2026-06-23 · unverdicted · none · ref 15
EVA generates adaptive Latent_slot tokens as internal visual thoughts, trained end-to-end with text tokens via D-GSPO on the EVA-230K dataset, claiming performance gains and better inference efficiency.
MathVis-Fine: Aligning Visual Supervision with Necessity via Progressive Dependency-Guided Training for Multimodal Mathematical Reasoning cs.AI · 2026-06-16 · unverdicted · none · ref 22
MathVis-Fine proposes a dataset with fine-grained visual annotations and dependency ratings plus a progressive two-stage training paradigm to align visual supervision with sample-specific necessity in multimodal mathematical reasoning.
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning cs.RO · 2026-04-20 · unverdicted · none · ref 16
ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.
MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding cs.CV · 2026-04-10 · unverdicted · none · ref 19
MAG-3D is a training-free multi-agent framework that coordinates planning, grounding, and coding agents with off-the-shelf VLMs to achieve grounded 3D reasoning and state-of-the-art benchmark results.
The Periodic Table of LLM Reasoning: A Structured Survey of Reasoning Paradigms, Methods, and Failure Modes cs.CL · 2026-06-09 · unverdicted · none · ref 96
A literature survey that introduces a taxonomy for LLM reasoning paradigms, analyzes methodological trends, and synthesizes failure modes from over 300 papers.
UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning cs.CV · 2026-05-05 · unverdicted · none · ref 44
UnAC improves LMM performance on visual reasoning benchmarks by combining adaptive visual prompting, image abstraction, and gradual self-checking.

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer