Multi-step visual reasoning with visual tokens scaling and verification

Tianyi Bai, Zengjie Hu, Fupeng Sun, Jiantao Qiu, Yizhen Jiang, Guangxin He, Bohan Zeng, Conghui He, Binhang Yuan, Wentao Zhang · 2025 · arXiv 2506.07235

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

VABench: A Comprehensive Benchmark for Audio-Video Generation

cs.CV · 2025-12-10 · unverdicted · novelty 7.0

VABench is a new multi-dimensional benchmark for evaluating synchronous audio-video generation across text-to-AV, image-to-AV, and stereo tasks.

ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning

cs.CV · 2026-05-27 · unverdicted · novelty 6.0

ROVER introduces a learnable routing plugin for object-centric visual evidence in MLLMs via token triplets and differential attention, reporting gains on MM-GCoT and VideoEspresso when integrated into Qwen2.5-VL-7B.

Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and Gemini-2.5-Pro while generalizing to unseen components.

Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images

cs.CV · 2026-04-13 · unverdicted · novelty 5.0

TTSP resolves the Grounding Paradox by treating perception as a scalable test-time process that generates, filters, and iteratively refines multiple visual exploration traces, outperforming baselines on high-resolution and multimodal reasoning tasks.

OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

cs.CV · 2026-04-06

citing papers explorer

Showing 5 of 5 citing papers.

VABench: A Comprehensive Benchmark for Audio-Video Generation cs.CV · 2025-12-10 · unverdicted · none · ref 2
VABench is a new multi-dimensional benchmark for evaluating synchronous audio-video generation across text-to-AV, image-to-AV, and stereo tasks.
ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning cs.CV · 2026-05-27 · unverdicted · none · ref 5
ROVER introduces a learnable routing plugin for object-centric visual evidence in MLLMs via token triplets and differential attention, reporting gains on MM-GCoT and VideoEspresso when integrated into Qwen2.5-VL-7B.
Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles cs.LG · 2026-05-21 · unverdicted · none · ref 5
Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and Gemini-2.5-Pro while generalizing to unseen components.
Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images cs.CV · 2026-04-13 · unverdicted · none · ref 4
TTSP resolves the Grounding Paradox by treating perception as a scalable test-time process that generates, filters, and iteratively refines multiple visual exploration traces, outperforming baselines on high-resolution and multimodal reasoning tasks.
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models cs.CV · 2026-04-06 · unreviewed · ref 9

Multi-step visual reasoning with visual tokens scaling and verification

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer