MemLens benchmark shows long-context LVLMs lose accuracy with length while memory agents lose visual fidelity, with multi-session reasoning below 30% for most systems and neither approach solving the task alone.
M2A: Multimodal memory agent with dual-layer hybrid memory for long-term personalized interactions, 2026
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
TaskMem uses RL in two phases to learn a task-focused memorization policy for multimodal agents, yielding 5.3-7.0% VQA accuracy gains on reformulated streaming benchmarks from VideoMME, EgoLife, and EgoTempo.
POLAR organizes prior interactions into a multimodal knowledge graph with semantic and episodic memory to improve personalized embodied task execution across multiple MLLM backbones.
MemEye benchmark evaluates multimodal memory on visual granularity and evidence synthesis, finding that 13 methods across 4 VLMs struggle with fine details and temporal state changes.
citing papers explorer
-
Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions
POLAR organizes prior interactions into a multimodal knowledge graph with semantic and episodic memory to improve personalized embodied task execution across multiple MLLM backbones.