MM-Mem distills video input through a hierarchical memory of sensory buffer, episodic stream, and symbolic schema, optimized by a semantic information bottleneck and SIB-GRPO, to achieve SOTA on long-horizon video benchmarks.
Vgent: Graph-based retrieval- reasoning-augmented generation for long video understanding
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4verdicts
UNVERDICTED 4roles
background 2polarities
background 2representative citing papers
VisReflect generates continuous latent visual reflections to emphasize relevant visual features and guide attention in LVLMs, yielding 4.1% gains on image benchmarks and 1.8% on video benchmarks with 44% less inference time than zooming methods.
VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.
GLANCE introduces a bi-loop multi-agent framework with global-local coordination mechanisms that outperforms baselines by up to 33% on music-grounded nonlinear video editing tasks using a new MVEBench benchmark.
citing papers explorer
-
VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG
VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.
-
GLANCE: A Global-Local Coordination Multi-Agent Framework for Music-Grounded Non-Linear Video Editing
GLANCE introduces a bi-loop multi-agent framework with global-local coordination mechanisms that outperforms baselines by up to 33% on music-grounded nonlinear video editing tasks using a new MVEBench benchmark.