arXiv preprint arXiv:2411.02937 , year=

Benchmarking multimodal retrieval augmented generation with dynamic vqa dataset, self-adaptive planning agent , author= · 2024 · cs.CL · arXiv 2411.02937

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

open full Pith review browse 8 citing papers arXiv PDF

abstract

Multimodal Retrieval Augmented Generation (mRAG) plays an important role in mitigating the "hallucination" issue inherent in multimodal large language models (MLLMs). Although promising, existing heuristic mRAGs typically predefined fixed retrieval processes, which causes two issues: (1) Non-adaptive Retrieval Queries. (2) Overloaded Retrieval Queries. However, these flaws cannot be adequately reflected by current knowledge-seeking visual question answering (VQA) datasets, since the most required knowledge can be readily obtained with a standard two-step retrieval. To bridge the dataset gap, we first construct Dyn-VQA dataset, consisting of three types of "dynamic" questions, which require complex knowledge retrieval strategies variable in query, tool, and time: (1) Questions with rapidly changing answers. (2) Questions requiring multi-modal knowledge. (3) Multi-hop questions. Experiments on Dyn-VQA reveal that existing heuristic mRAGs struggle to provide sufficient and precisely relevant knowledge for dynamic questions due to their rigid retrieval processes. Hence, we further propose the first self-adaptive planning agent for multimodal retrieval, OmniSearch. The underlying idea is to emulate the human behavior in question solution which dynamically decomposes complex multimodal questions into sub-question chains with retrieval action. Extensive experiments prove the effectiveness of our OmniSearch, also provide direction for advancing mRAG. The code and dataset will be open-sourced at https://github.com/Alibaba-NLP/OmniSearch.

citation-role summary

background 2 baseline 1

citation-polarity summary

background 2 baseline 1

representative citing papers

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

cs.AI · 2026-05-18 · unverdicted · novelty 7.0 · 2 refs

SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practical agents, and oracle knowledge.

GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces

cs.CL · 2026-04-05 · unverdicted · novelty 7.0

GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.

MMAgent-R$^2$: Learning to Rerank and Reject for Agentic mRAG

cs.CV · 2026-07-08 · conditional · novelty 6.0

An agentic mRAG framework uses GRPO-trained visual reranking and active rejection to verify retrieved candidate entities, achieving state-of-the-art on three KB-VQA benchmarks.

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

cs.CL · 2025-11-25 · unverdicted · novelty 6.0

Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.

Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation

cs.CL · 2025-05-28 · unverdicted · novelty 6.0

MoRE enables MLLMs to dynamically coordinate heterogeneous retrieval experts via Step-GRPO training, yielding over 7% average gains on open-domain QA benchmarks.

Agentic Reasoning for Large Language Models

cs.AI · 2026-01-18 · unverdicted · novelty 4.0

The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.

From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

cs.AI · 2025-04-28 · accept · novelty 4.0

A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.

citing papers explorer

Showing 8 of 8 citing papers.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees cs.CV · 2026-04-17 · unverdicted · none · ref 29
RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.
SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain cs.AI · 2026-05-18 · unverdicted · none · ref 52 · 2 links
SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practical agents, and oracle knowledge.
GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces cs.CL · 2026-04-05 · unverdicted · none · ref 27
GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.
MMAgent-R$^2$: Learning to Rerank and Reject for Agentic mRAG cs.CV · 2026-07-08 · conditional · none · ref 16 · internal anchor
An agentic mRAG framework uses GRPO-trained visual reranking and active rejection to verify retrieved candidate entities, achieving state-of-the-art on three KB-VQA benchmarks.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory cs.CL · 2025-11-25 · unverdicted · none · ref 198
Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.
Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation cs.CL · 2025-05-28 · unverdicted · none · ref 23
MoRE enables MLLMs to dynamically coordinate heterogeneous retrieval experts via Step-GRPO training, yielding over 7% average gains on open-domain QA benchmarks.
Agentic Reasoning for Large Language Models cs.AI · 2026-01-18 · unverdicted · none · ref 183
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review cs.AI · 2025-04-28 · accept · none · ref 9
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.

arXiv preprint arXiv:2411.02937 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer