pith. sign in

hub

LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it
abstract

Recent developments in Video Large Language Models (Video LLMs) have enabled models to process hour-long videos and exhibit exceptional performance. Nonetheless, the Key-Value (KV) cache expands linearly over time, leading to substantial memory overhead and response delay--critical challenges in various real-world online applications, such as Deepseek services, autonomous driving and robotics. To mitigate these issues, we propose $\textbf{LiveVLM}$, a training-free and query-agnostic framework specifically designed for online video understanding and real-time interaction. LiveVLM employs a Vision Sink Bucketing (VSB) mechanism to process video streams in real time, retain long-term video details and eliminate redundant KVs. This mechanism utilizes vision-to-vision attention scores as the metric and seeks to maximize the coverage of contextual information during compression. Noting that KV cache compressed in a query-agnostic manner inevitably retains irrelevant information for specific queries, LiveVLM incorporates a Position-agnostic KV Retrieval (PaR) mechanism to reduce interference from redundant context. The keypoint of PaR lies in decoupling positional embeddings to enhance the similarity between key tensors, thereby supporting efficient retrieval at the granularity of pages. Extensive experiments demonstrate that LiveVLM enables the foundation LLaVA-OneVision model to achieve state-of-the-art accuracy among both training-free query-agnostic methods and training-based online models.

hub tools

citation-role summary

background 4

citation-polarity summary

years

2026 11

roles

background 4

polarities

background 4

representative citing papers

Online Reasoning Video Object Segmentation

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

The work introduces the ORVOS task, the ORVOSB benchmark with causal annotations across 210 videos, and a baseline using updated prompts plus a temporal token reservoir.

Linear Scaling Video VLMs for Long Video Understanding

cs.CV · 2026-05-29 · unverdicted · novelty 5.0

StateKV is an inference-time technique that replaces quadratic self-attention prefill in video VLMs with a fixed-capacity importance-based recurrent state, keeping accuracy near full attention on long-video benchmarks without retraining.

citing papers explorer

Showing 11 of 11 citing papers.