hub Canonical reference

Large multimodal agents: A survey

Junlin Xie, Zhihong Chen, Ruifei Zhang, Xiang Wan, Guanbin Li · 2024 · arXiv 2402.15116

Canonical reference. 83% of citing Pith papers cite this work as background.

12 Pith papers citing it

Background 83% of classified citations

read on arXiv browse 12 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 baseline 1

citation-polarity summary

background 5 baseline 1

representative citing papers

Hierarchical Attacks for Multi-Modal Multi-Agent Reasoning

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

HAM³ achieves up to 78.3% attack success rate on the GQA benchmark by hierarchically attacking perception, communication, and reasoning layers in multi-modal multi-agent systems.

OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning

cs.CV · 2026-04-18 · unverdicted · novelty 7.0

OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.

From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

cs.CV · 2026-03-02 · unverdicted · novelty 7.0

MM-Mem distills video input through a hierarchical memory of sensory buffer, episodic stream, and symbolic schema, optimized by a semantic information bottleneck and SIB-GRPO, to achieve SOTA on long-horizon video benchmarks.

SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

cs.CV · 2026-02-26 · conditional · novelty 7.0

SUPERGLASSES is the first VQA benchmark built from actual smart glasses data, and SUPERLENS is an agent using automatic object detection, query decoupling, and multimodal search that outperforms GPT-4o by 2.19% on it.

ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs

cs.RO · 2026-02-09 · unverdicted · novelty 7.0

ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.

FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks

cs.AI · 2025-05-26 · unverdicted · novelty 7.0

A new benchmark dataset and evaluation framework for testing multimodal AI agents on real field work tasks derived from on-site data and worker interviews.

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

cs.AI · 2024-07-01 · accept · novelty 7.0

WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.

Formalizing the Safety, Security, and Functional Properties of Agentic AI Systems

cs.AI · 2025-10-15 · unverdicted · novelty 6.0

Introduces host agent and task lifecycle models plus 30 temporal logic properties to enable formal verification of liveness, safety, completeness, and fairness in agentic AI systems.

Retrieval-Augmented Generation for Natural Language Processing: A Survey

cs.CL · 2024-07-18 · accept · novelty 6.0

The survey organizes RAG methods via a taxonomy of query-based, logits-based, latent, and parametric fusion with comparisons on accessibility, efficiency, applications, and challenges.

Large Language Model-Brained GUI Agents: A Survey

cs.AI · 2024-11-27 · unverdicted · novelty 4.0

A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.

A Survey on Efficient Inference for Large Language Models

cs.CL · 2024-04-22 · accept · novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

LLM-Powered AI Agent Systems and Their Applications in Industry

cs.AI · 2025-05-22 · unverdicted · novelty 2.0

A survey categorizing LLM-powered agent systems into software-based, physical, and hybrid types, covering industrial applications and challenges such as latency and security.

citing papers explorer

Showing 12 of 12 citing papers.

Hierarchical Attacks for Multi-Modal Multi-Agent Reasoning cs.AI · 2026-05-13 · unverdicted · none · ref 40
HAM³ achieves up to 78.3% attack success rate on the GQA benchmark by hierarchically attacking perception, communication, and reasoning layers in multi-modal multi-agent systems.
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning cs.CV · 2026-04-18 · unverdicted · none · ref 49
OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.
From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents cs.CV · 2026-03-02 · unverdicted · none · ref 37
MM-Mem distills video input through a hierarchical memory of sensory buffer, episodic stream, and symbolic schema, optimized by a semantic information bottleneck and SIB-GRPO, to achieve SOTA on long-horizon video benchmarks.
SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses cs.CV · 2026-02-26 · conditional · none · ref 53
SUPERGLASSES is the first VQA benchmark built from actual smart glasses data, and SUPERLENS is an agent using automatic object detection, query decoupling, and multimodal search that outperforms GPT-4o by 2.19% on it.
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs cs.RO · 2026-02-09 · unverdicted · none · ref 119
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks cs.AI · 2025-05-26 · unverdicted · none · ref 25
A new benchmark dataset and evaluation framework for testing multimodal AI agents on real field work tasks derived from on-site data and worker interviews.
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning? cs.AI · 2024-07-01 · accept · none · ref 101
WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
Formalizing the Safety, Security, and Functional Properties of Agentic AI Systems cs.AI · 2025-10-15 · unverdicted · none · ref 27
Introduces host agent and task lifecycle models plus 30 temporal logic properties to enable formal verification of liveness, safety, completeness, and fairness in agentic AI systems.
Retrieval-Augmented Generation for Natural Language Processing: A Survey cs.CL · 2024-07-18 · accept · none · ref 184
The survey organizes RAG methods via a taxonomy of query-based, logits-based, latent, and parametric fusion with comparisons on accessibility, efficiency, applications, and challenges.
Large Language Model-Brained GUI Agents: A Survey cs.AI · 2024-11-27 · unverdicted · none · ref 60
A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.
A Survey on Efficient Inference for Large Language Models cs.CL · 2024-04-22 · accept · none · ref 298
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
LLM-Powered AI Agent Systems and Their Applications in Industry cs.AI · 2025-05-22 · unverdicted · none · ref 4
A survey categorizing LLM-powered agent systems into software-based, physical, and hybrid types, covering industrial applications and challenges such as latency and security.

Large multimodal agents: A survey

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer