R3-Streaming uses cascaded control with age-aware memory forgetting and TB-GRPO reinforcement learning to reach SOTA scores of 57.92 on OVO-Bench and 76.36 on StreamingBench with 95-96% fewer visual tokens.
hub
Streammem: Query-agnostic kv cache memory for stream- ing video understanding.arXiv preprint arXiv:2508.15717
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.
SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.
SPOT-Bench tests real-time streaming video perception with timeliness metrics, exposing limitations in current models and introducing AsynKV as an improved baseline.
Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.
STAC compresses KV caches in streaming 3D reconstruction transformers via temporal token preservation with decayed attention, spatial voxel compression, and chunked multi-frame optimization, delivering 10x memory reduction and 4x faster inference at SOTA quality.
FlowNar achieves bounded memory and 3x higher throughput for streaming narration on Ego4D, EgoExo4D, and EpicKitchens100 by combining dynamic historical context removal with a Cross Linear Attentive Memory module.
CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baselines with 0-8% F1 drop.
HERMES organizes the KV cache into a hierarchical memory to enable real-time streaming video understanding in MLLMs, achieving 10x faster TTFT and up to 11.4% accuracy gains on streaming benchmarks with 68% fewer tokens.
LiveVLM introduces VSB and PaR to compress and retrieve KV cache in streaming video LLMs, enabling LLaVA-OneVision to reach SOTA accuracy among training-free query-agnostic and training-based online models.
StateKV is an inference-time technique that replaces quadratic self-attention prefill in video VLMs with a fixed-capacity importance-based recurrent state, keeping accuracy near full attention on long-video benchmarks without retraining.
OmniMem achieves 2-4% higher accuracy than training-free baselines on long video benchmarks for audio-visual LLMs by using modality-aware KV cache allocation and perturbation-aware state selection, with further gains from budget-aware fine-tuning.
MuKV adds multi-grained KV cache compression at patch-frame-segment levels plus semi-hierarchical retrieval to raise accuracy and cut memory in long video question-answering.
DSCache decouples cumulative past and instant KV caches with position-agnostic encoding to adapt offline VideoVLLMs to streaming video, delivering 2.5% average accuracy gains on QA benchmarks.
citing papers explorer
-
OmniMem: Perturbation-aware Memory Compression for Streaming Audio-Visual LLMs
OmniMem achieves 2-4% higher accuracy than training-free baselines on long video benchmarks for audio-visual LLMs by using modality-aware KV cache allocation and perturbation-aware state selection, with further gains from budget-aware fine-tuning.