TRACER attaches verifiable sentence-level provenance records to multimodal agent outputs using tool-turn alignment and semantic relations, yielding 78.23% answer accuracy and fewer tool calls than baselines on TRACE-Bench.
Vision-deepresearch benchmark: Rethinking visual and textual search for multimodal large language models
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 7roles
background 3polarities
background 3representative citing papers
SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
EpiBench is a new episodic multi-turn multimodal benchmark where even leading AI agents score only 29.23% on hard tasks requiring cross-paper evidence integration from figures and tables.
REVERSE uses tool-grounded trajectories and process rewards on visual grounding, query utility, and evidence discrimination to train a 4B model that outperforms retrieval-augmented baselines on Im2GPS3k and YFCC4k.
SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.
ViDR treats source figures as retrievable and verifiable evidence objects in multimodal deep research reports and introduces MMR Bench+ to measure improvements in visual integration and verifiability.
citing papers explorer
-
TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents
TRACER attaches verifiable sentence-level provenance records to multimodal agent outputs using tool-turn alignment and semantic relations, yielding 78.23% answer accuracy and fewer tool calls than baselines on TRACE-Bench.
-
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
-
EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents
EpiBench is a new episodic multi-turn multimodal benchmark where even leading AI agents score only 29.23% on hard tasks requiring cross-paper evidence integration from figures and tables.
-
REVERSE: Reinforcing Evidence Verification and Search for Agentic Image geo-localization
REVERSE uses tool-grounded trajectories and process rewards on visual grounding, query utility, and evidence discrimination to train a 4B model that outperforms retrieval-augmented baselines on Im2GPS3k and YFCC4k.
-
SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation
SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.
-
ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence
ViDR treats source figures as retrievable and verifiable evidence objects in multimodal deep research reports and introduces MMR Bench+ to measure improvements in visual integration and verifiability.
- Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents