Vision-deepresearch benchmark: Rethinking visual and textual search for multimodal large language models, 2026

Yu Zeng, Wenxuan Huang, Zhen Fang, Shuang Chen, Yufan Shen, Yishuo Cai, Xiaoman Wang, Zhenfei Yin, Lin Chen, Zehui Chen, Shiting Huang, Yiming Zhao, Xu Tang, Yao Hu, Philip Torr, Wanli Ouyang, Shaosheng Cao · 2026 · arXiv 2602.02185

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

representative citing papers

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

A new image-bank harness and closed-loop on-policy data evolution method raises multimodal agent performance on visual search benchmarks from 24.9% to 39.0% for an 8B model and from 30.6% to 41.5% for a 30B model.

TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

TRACER attaches verifiable sentence-level provenance records to multimodal agent outputs using tool-turn alignment and semantic relations, yielding 78.23% answer accuracy and fewer tool calls than baselines on TRACE-Bench.

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

cs.AI · 2026-04-19 · unverdicted · novelty 7.0

SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.

EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents

cs.CL · 2026-04-07 · unverdicted · novelty 7.0

EpiBench is a new episodic multi-turn multimodal benchmark where even leading AI agents score only 29.23% on hard tasks requiring cross-paper evidence integration from figures and tables.

SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.

ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence

cs.CV · 2026-05-13 · unverdicted · novelty 5.0

ViDR treats source figures as retrievable and verifiable evidence objects in multimodal deep research reports and introduces MMR Bench+ to measure improvements in visual integration and verifiability.

citing papers explorer

Showing 6 of 6 citing papers.

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents cs.CL · 2026-05-11 · unverdicted · none · ref 21
A new image-bank harness and closed-loop on-policy data evolution method raises multimodal agent performance on visual search benchmarks from 24.9% to 39.0% for an 8B model and from 30.6% to 41.5% for a 30B model.
TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents cs.CL · 2026-05-11 · unverdicted · none · ref 16
TRACER attaches verifiable sentence-level provenance records to multimodal agent outputs using tool-turn alignment and semantic relations, yielding 78.23% answer accuracy and fewer tool calls than baselines on TRACE-Bench.
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents cs.AI · 2026-04-19 · unverdicted · none · ref 40
SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents cs.CL · 2026-04-07 · unverdicted · none · ref 36
EpiBench is a new episodic multi-turn multimodal benchmark where even leading AI agents score only 29.23% on hard tasks requiring cross-paper evidence integration from figures and tables.
SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation cs.CV · 2026-05-08 · unverdicted · none · ref 44
SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.
ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence cs.CV · 2026-05-13 · unverdicted · none · ref 30
ViDR treats source figures as retrievable and verifiable evidence objects in multimodal deep research reports and introduces MMR Bench+ to measure improvements in visual integration and verifiability.

Vision-deepresearch benchmark: Rethinking visual and textual search for multimodal large language models, 2026

fields

years

verdicts

representative citing papers

citing papers explorer