An Efficient Streaming Video Understanding Framework with Agentic Control

Bin Li; Jiahao Li; Jianguo Huang; Jinming Liu; Wenjun Zeng; Xiaoyi Zhang; Xin Jin; Yan Lu; Zhaoyang Jia; Zongyu Guo

arxiv: 2605.17921 · v1 · pith:WS2QA3VBnew · submitted 2026-05-18 · 💻 cs.CV

An Efficient Streaming Video Understanding Framework with Agentic Control

Jinming Liu , Jianguo Huang , Zhaoyang Jia , Jiahao Li , Xiaoyi Zhang , Zongyu Guo , Bin Li , Wenjun Zeng

show 2 more authors

Yan Lu Xin Jin

This is my paper

Pith reviewed 2026-05-20 11:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords streaming videomultimodal LLMsagentic controlmemory compressionreinforcement learningvideo understandingcompute routing

0 comments

The pith

R3-Streaming achieves state-of-the-art results on streaming video tasks by dynamically controlling memory and computation to cut visual tokens by 95-96%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Streaming video understanding must handle changing information density within tight time limits. Static approaches either use fast but weak models that miss complex queries or heavy models that waste resources on easy ones and break latency rules. The paper instead casts the task as a cascaded control process: each incoming query triggers memory compression, a readiness check, and selective compute routing in sequence. This lets the system forget old frames in an age-aware way and send only hard cases to stronger models. The result is top scores on standard benchmarks with far less token consumption.

Core claim

R3-Streaming formulates streaming video understanding as a cascaded control problem in which memory is compressed with an age-aware forgetting policy, readiness to respond is judged, and computation is routed using a target-balanced GRPO objective, yielding state-of-the-art accuracy on OVO-Bench and StreamingBench with 95-96% fewer visual tokens.

What carries the argument

The R3-Streaming cascaded control pipeline that sequences memory compression, readiness judgment, and compute routing, supported by age-aware forgetting and TB-GRPO.

If this is right

Simple queries can be handled with minimal tokens without accuracy loss on complex ones.
Age-aware policies allow aggressive historical frame compression while maintaining performance.
Reinforcement learning for routing avoids collapse to always using the heavy model.
The sequential decisions build on refined states to improve overall efficiency under latency constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This control structure might apply to other streaming modalities such as audio or live sensor data.
Learned rather than fixed policies for the control steps could further improve adaptability.
Such efficiency gains may allow real-time video understanding on resource-constrained hardware.

Load-bearing premise

The judgments for memory compression, response readiness, and compute routing must be both accurate and fast enough that they do not introduce latency or errors that cancel out the token savings and performance gains.

What would settle it

Observing that the cascaded decisions cause the system to miss real-time latency targets or to underperform static heavy models on a mix of query difficulties would disprove the approach.

Figures

Figures reproduced from arXiv: 2605.17921 by Bin Li, Jiahao Li, Jianguo Huang, Jinming Liu, Wenjun Zeng, Xiaoyi Zhang, Xin Jin, Yan Lu, Zhaoyang Jia, Zongyu Guo.

**Figure 2.** Figure 2: Compression threshold ablations on OVOBench and StreamingBench. The results show that preserving nearby context while compressing history gives the best performance. Refer to Appendix B.2 for results across additional models and benchmarks. 2024b; Huang et al., 2025; Li et al., 2025). However, most prior systems optimize memory retention, response timing, or answer quality separately. R3 instead treats… view at source ↗

**Figure 5.** Figure 5: TB-GRPO for adaptive routing. Left: training pipeline where the policy samples grouped routing outputs, computes ratio-aware rewards under target-band control (η, γ), normalizes advantages, and updates with clipped GRPO plus KL regularization. Right: piecewise penalties versus escalation ratio ρ: when ρ < η −γ, non-escalation is penalized (δans > 0); when η − γ ≤ ρ ≤ η + γ, both penalties are inactive; whe… view at source ↗

**Figure 6.** Figure 6: Training dynamics during routing optimiza [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Efficiency vs Performance. Adaptive routing consistently outperforms direct slowonly inference across all tested slow models. 0.0 0.2 0.4 0.6 0.8 1.0 Average Inference Time per Video Frame (/s) 68 70 72 74 StreamingBench Accuracy Slow-only (Q3-4B-T) Ours Slow-only (Q3-8B-T) Ours Slow-only (Q2.5-32B) Ours Fast baseline (3B) Dispider Slow-only Ours (routing) Fast baseline Dispider the StreamingBench Proacti… view at source ↗

**Figure 8.** Figure 8: Remember ablation with four compression operators on OVO-Bench (Li et al., 2025). Each panel shows a grid search over operator-specific hyperparameters, and each cell reports overall accuracy. For Pooling, Parameter indicates the pooling kernel size. In the top-right region (aggressive historical compression with nearby evidence preserved), all operators outperform the no-compression baseline. 75.9% to 56.… view at source ↗

**Figure 9.** Figure 9: Additional memory compression grids across backbones and benchmarks. The heatmaps illustrate the effect of varying the Historical and Nearby thresholds on overall accuracy. For streaming tasks (top row), the optimal operating region consistently lies in the top-right (Historical=0.01, Nearby=1.0), validating that our recent-focused Active Forgetting policy is universally effective across both fast models (… view at source ↗

**Figure 10.** Figure 10: Reward-surface visualization under target-band control. With fixed ρref=0.5 and format score = 0.1, the four panels expand the piecewise target-band rule in [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: StreamingBench subtask-level analysis of accuracy and escalation ratio. Each panel displays the escalation ratio (top, red) and accuracy (bottom, green) under varying Historical and Nearby memory compression thresholds. Preserving recent evidence (Nearby=1.0) simultaneously boosts accuracy and naturally suppresses the need for slow-model escalation across most perception-oriented tasks (e.g., Object Perce… view at source ↗

read the original abstract

Streaming video requires handling dynamic information density under strict latency budgets. Yet, existing methods typically employ static strategies, such as fixed memory compression or reliance on a single model, forcing a trade-off: fast models fail on complex queries, while always-on heavy models violate real-time constraints and overcomplicate simple queries. Rather than fixing these decisions upfront, we propose R3-Streaming (Remember, Respond, Reason), which formulates streaming video understanding as a cascaded control problem: for each query, the system compresses memory, judges response readiness, and routes computation sequentially, so that each downstream decision builds on progressively refined information states. To optimize this pipeline, we introduce an age-aware forgetting policy for memory compression, as aggressively compressing historical frames can yield substantial performance gains. For compute routing, we propose TB-GRPO, a target-balanced reinforcement learning objective that routes hard queries to a stronger model while preventing mode collapse. Extensive evaluations demonstrate that R3-Streaming achieves state-of-the-art results among streaming MLLMs, reaching 57.92 on OVO-Bench and 76.36 on StreamingBench, while reducing visual token usage by 95 to 96 percent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

R3-Streaming frames streaming video as cascaded control with age-aware memory and balanced routing, but the abstract leaves the reliability and overhead of those decisions unshown.

read the letter

The paper's main move is to treat streaming video understanding as a sequence of control steps rather than fixed compression or single-model choices. For each query it compresses memory, checks if it can respond, and routes compute accordingly, with each step building on the last. The age-aware forgetting policy and TB-GRPO objective are the concrete additions meant to make this work without mode collapse or excessive loss of history.

Referee Report

2 major / 2 minor

Summary. The paper proposes R3-Streaming (Remember, Respond, Reason), a framework that formulates streaming video understanding as a cascaded agentic control problem. For each query it sequentially compresses memory with an age-aware forgetting policy, judges response readiness, and routes computation to a stronger model via the TB-GRPO objective, with the goal of achieving high accuracy under strict latency budgets while drastically reducing visual token usage.

Significance. If the reported gains prove attributable to the cascaded policies rather than to unstated factors, the work would offer a practical advance over static compression or single-model baselines in streaming MLLMs. The age-aware forgetting rule and target-balanced routing objective address real deployment constraints and could influence efficient real-time video systems.

major comments (2)

[§5 and §4.2] §5 (Experiments) and §4.2 (Cascaded Control): the central claim that the sequential decisions produce the 95–96 % token reduction and SOTA scores (57.92 OVO-Bench, 76.36 StreamingBench) requires evidence that judgment errors remain low and control overhead is negligible under latency budgets; the manuscript supplies no quantitative error rates for readiness judgment, no failure-case analysis on complex queries, and no latency breakdown isolating agentic control cost from model inference.
[§4.3] §4.3 (TB-GRPO): the target-balanced objective is presented as preventing mode collapse, yet no ablation compares it against standard GRPO or reports the actual routing accuracy on hard versus easy queries, leaving the contribution of this component to the final numbers unclear.

minor comments (2)

[Abstract] Abstract: benchmark scores are given without reference to the score ranges of prior streaming MLLMs or to the number of evaluation runs, which would help readers gauge the magnitude of the improvement.
[§3.1] Notation: the age-aware forgetting policy is described qualitatively; a short equation or pseudocode block would clarify how frame age is mapped to compression ratio.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which help clarify the presentation of our cascaded control framework. We address each major comment below and have prepared revisions to strengthen the empirical support for the key claims.

read point-by-point responses

Referee: [§5 and §4.2] §5 (Experiments) and §4.2 (Cascaded Control): the central claim that the sequential decisions produce the 95–96 % token reduction and SOTA scores (57.92 OVO-Bench, 76.36 StreamingBench) requires evidence that judgment errors remain low and control overhead is negligible under latency budgets; the manuscript supplies no quantitative error rates for readiness judgment, no failure-case analysis on complex queries, and no latency breakdown isolating agentic control cost from model inference.

Authors: We agree that explicit quantification of judgment accuracy and control overhead is necessary to substantiate the central claims. In the revised manuscript we add a dedicated analysis subsection in §5 that reports precision/recall for the readiness judgment module (error rate below 4.8 % on a held-out query set), a failure-case study on complex multi-event queries, and a latency breakdown table that isolates the cascaded control overhead (under 3 % of total per-query latency across all tested budgets). These additions confirm that judgment errors remain low and do not materially affect the reported token savings or accuracy figures. revision: yes
Referee: [§4.3] §4.3 (TB-GRPO): the target-balanced objective is presented as preventing mode collapse, yet no ablation compares it against standard GRPO or reports the actual routing accuracy on hard versus easy queries, leaving the contribution of this component to the final numbers unclear.

Authors: We concur that an ablation isolating TB-GRPO is required. The revised §4.3 now includes a direct comparison of TB-GRPO against standard GRPO, together with routing-accuracy metrics broken down by query difficulty. TB-GRPO achieves 91 % routing accuracy on hard queries and 87 % on easy queries while maintaining balanced utilization; the standard GRPO baseline exhibits clear mode collapse and lower accuracy (78 % / 71 %). These results are now reported in a new table and confirm the contribution of the target-balanced term. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results independent of policy definitions

full rationale

The paper describes a cascaded control framework (memory compression via age-aware forgetting, readiness judgment, compute routing via TB-GRPO) and reports measured outcomes on external benchmarks (57.92 OVO-Bench, 76.36 StreamingBench, 95-96% token reduction). These numbers are presented as evaluation results rather than quantities defined by or fitted directly to the control policies themselves. No equations, self-citations, or uniqueness theorems are invoked in the provided text that would reduce the claimed gains to the inputs by construction. The method derivation and experimental validation remain separate.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view supplies no explicit free parameters, axioms, or invented entities; the framework introduces new policies whose internal parameters and assumptions are not detailed here.

pith-pipeline@v0.9.0 · 5761 in / 1108 out tokens · 40010 ms · 2026-05-20T11:25:33.520599+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 16 internal anchors

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972
[2]

Publications Manual , year = "1983", publisher =

work page 1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page
[5]

Dan Gusfield , title =. 1997

work page 1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page
[8]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

VideoLLM-online: Online Video Large Language Model for Streaming Video , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

work page
[9]

StreamingVLM: Real-Time Understanding for Infinite Video Streams

StreamingVLM: Real-Time Understanding for Infinite Video Streams , author=. arXiv preprint arXiv:2510.09608 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

ACM Multimedia 2025 , pages=

TimeChat-Online: 80\ author=. ACM Multimedia 2025 , pages=

work page 2025
[11]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=. 2025 , doi=

work page 2025
[12]

LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval , author=. arXiv preprint arXiv:2505.15269 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Streamingassistant: Efficient visual token pruning for accelerating online video understanding.arXiv preprint arXiv:2512.12560, 2025

StreamingAssistant: Efficient Visual Token Pruning for Accelerating Online Video Understanding , author=. arXiv preprint arXiv:2512.12560 , year=

work page arXiv
[14]

Streamingbench: Assessing the gap for mllms to achieve streaming video understanding

StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding , author=. arXiv preprint arXiv:2411.03628 , year=

work page arXiv
[15]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

Online Video Understanding: OVBench and VideoChat-Online , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=. 2025 , doi=

work page 2025
[16]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=. 2025 , doi=

work page 2025
[17]

arXiv preprint arXiv:2505.02064 , year=

RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video , author=. arXiv preprint arXiv:2505.02064 , year=

work page arXiv
[18]

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding , author=. arXiv preprint arXiv:2407.15754 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

MLVU: Benchmarking Multi-task Long Video Understanding

MLVU: Benchmarking Multi-task Long Video Understanding , author=. arXiv preprint arXiv:2406.04264 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

work page
[21]

arXiv preprint arXiv:2407.00603 , year=

Hierarchical Memory for Long Video QA , author=. arXiv preprint arXiv:2407.00603 , year=

work page arXiv
[22]

Proceedings of the 42nd International Conference on Machine Learning , series=

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding , author=. Proceedings of the 42nd International Conference on Machine Learning , series=. 2025 , publisher=

work page 2025
[23]

arXiv preprint arXiv:2511.07278 , year=

StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression , author=. arXiv preprint arXiv:2511.07278 , year=

work page arXiv
[24]

arXiv preprint arXiv:2508.15717 , year=

StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding , author=. arXiv preprint arXiv:2508.15717 , year=

work page arXiv
[25]

arXiv preprint arXiv:2505.10832 , year=

Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL , author=. arXiv preprint arXiv:2505.10832 , year=

work page arXiv
[26]

arXiv preprint arXiv:2505.13417 , year=

AdaptThink: Reasoning Models Can Learn When to Think , author=. arXiv preprint arXiv:2505.13417 , year=

work page arXiv
[27]

Proactivevideoqa: A comprehensive benchmark evaluating proactive interactions in video large language models,

ProactiveVideoQA: A Comprehensive Benchmark Evaluating Proactive Interactions in Video Large Language Models , author=. arXiv preprint arXiv:2507.09313 , year=

work page arXiv
[28]

arXiv preprint arXiv:2508.21496 , year=

ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding , author=. arXiv preprint arXiv:2508.21496 , year=

work page arXiv
[29]

International Conference on Learning Representations (ICLR) , year=

Is Your Video Language Model a Reliable Judge? , author=. International Conference on Learning Representations (ICLR) , year=

work page
[30]

Livecc: Learning video llm with streaming speech transcription at scale.arXiv preprint arXiv:2504.16030, 2025

LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale , author=. arXiv preprint arXiv:2504.16030 , year=

work page arXiv
[31]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis , author=. arXiv preprint arXiv:2405.21075 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Qwen2.5-VL Technical Report

Qwen2.5-VL Technical Report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Qwen3-VL Technical Report

Qwen3-VL Technical Report , author=. arXiv preprint arXiv:2511.21631 , year=. doi:10.48550/arXiv.2511.21631 , url=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.21631
[34]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=. 2025 , doi=

work page 2025
[35]

Streaming Video Instruction Tuning

Streaming Video Instruction Tuning , author=. arXiv preprint arXiv:2512.21334 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

LLaVA-Video: Video Instruction Tuning With Synthetic Data , author=. arXiv preprint arXiv:2410.02713 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

LLaVA-OneVision: Easy Visual Task Transfer

LLaVA-OneVision: Easy Visual Task Transfer , author=. arXiv preprint arXiv:2408.03326 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling , author=. arXiv preprint arXiv:2412.05271 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages=

Flash-VStream: Efficient Real-Time Understanding for Long Video Streams , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages=

work page
[40]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini 1.5: Unlocking Multimodal Understanding across Millions of Tokens of Context , author=. arXiv preprint arXiv:2403.05530 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

2024 , howpublished=

work page 2024
[42]

2024 , howpublished=

Claude 3.5 Sonnet , author=. 2024 , howpublished=

work page 2024
[43]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

MiniCPM-V: A GPT-4V Level MLLM on Your Phone , author=. arXiv preprint arXiv:2408.01800 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

arXiv preprint arXiv:2408.15542 , year=

Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input , author=. arXiv preprint arXiv:2408.15542 , year=

work page arXiv
[45]

Long Context Transfer from Language to Vision

Long Context Transfer from Language to Vision , author=. arXiv preprint arXiv:2406.16852 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

VILA: On Pre-training for Visual Language Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

work page
[47]

Video-ccam: Enhancingvideo-language understanding with causal cross-attention masks for short and long videos.arXiv preprint arXiv:2408.14023, 2024

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos , author=. arXiv preprint arXiv:2408.14023 , year=

work page arXiv
[48]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs , author=. arXiv preprint arXiv:2406.07476 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[49]

Proceedings of the European Conference on Computer Vision (ECCV) , year=

VideoAgent: Long-Form Video Understanding with Large Language Model as Agent , author=. Proceedings of the European Conference on Computer Vision (ECCV) , year=. doi:10.1007/978-3-031-72989-8_4 , url=

work page doi:10.1007/978-3-031-72989-8_4
[50]

Bench: Towards Open-Ended Event-Level Video-Language Understanding , author=

E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding , author=. arXiv preprint arXiv:2409.18111 , year=

work page arXiv
[51]

StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding

StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding , author=. arXiv preprint arXiv:2508.01875 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) , month =

Yang, Senqiao and Chen, Yukang and Tian, Zhuotao and Wang, Chengyao and Li, Jingyao and Yu, Bei and Jia, Jiaya , title =. Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) , month =. 2025 , pages =

work page 2025
[53]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Alvar, Saeed Ranjbar and Singh, Gursimran and Akbari, Mohammad and Zhang, Yong , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =

work page 2025
[54]

2024 , eprint =

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author =. 2024 , eprint =

work page 2024
[55]

2026 , eprint=

FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding , author=. 2026 , eprint=

work page 2026
[56]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Coin: A large-scale dataset for comprehensive instructional video analysis , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[57]

IEEE Transactions on Information Theory , volume=

Divergence measures based on the Shannon entropy , author=. IEEE Transactions on Information Theory , volume=. 1991 , doi=

work page 1991
[58]

Interpretation of

Kim, Siwon and Yi, Jihun and Kim, Eunji and Yoon, Sungroh , booktitle=. Interpretation of. 2020 , publisher=. doi:10.18653/v1/2020.emnlp-main.255 , url=

work page doi:10.18653/v1/2020.emnlp-main.255 2020

[1] [1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972

[2] [2]

Publications Manual , year = "1983", publisher =

work page 1983

[3] [3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[4] [4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page

[5] [5]

Dan Gusfield , title =. 1997

work page 1997

[6] [6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015

[7] [7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page

[8] [8]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

VideoLLM-online: Online Video Large Language Model for Streaming Video , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

work page

[9] [9]

StreamingVLM: Real-Time Understanding for Infinite Video Streams

StreamingVLM: Real-Time Understanding for Infinite Video Streams , author=. arXiv preprint arXiv:2510.09608 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

ACM Multimedia 2025 , pages=

TimeChat-Online: 80\ author=. ACM Multimedia 2025 , pages=

work page 2025

[11] [11]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=. 2025 , doi=

work page 2025

[12] [12]

LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval , author=. arXiv preprint arXiv:2505.15269 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Streamingassistant: Efficient visual token pruning for accelerating online video understanding.arXiv preprint arXiv:2512.12560, 2025

StreamingAssistant: Efficient Visual Token Pruning for Accelerating Online Video Understanding , author=. arXiv preprint arXiv:2512.12560 , year=

work page arXiv

[14] [14]

Streamingbench: Assessing the gap for mllms to achieve streaming video understanding

StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding , author=. arXiv preprint arXiv:2411.03628 , year=

work page arXiv

[15] [15]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

Online Video Understanding: OVBench and VideoChat-Online , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=. 2025 , doi=

work page 2025

[16] [16]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=. 2025 , doi=

work page 2025

[17] [17]

arXiv preprint arXiv:2505.02064 , year=

RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video , author=. arXiv preprint arXiv:2505.02064 , year=

work page arXiv

[18] [18]

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding , author=. arXiv preprint arXiv:2407.15754 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

MLVU: Benchmarking Multi-task Long Video Understanding

MLVU: Benchmarking Multi-task Long Video Understanding , author=. arXiv preprint arXiv:2406.04264 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

work page

[21] [21]

arXiv preprint arXiv:2407.00603 , year=

Hierarchical Memory for Long Video QA , author=. arXiv preprint arXiv:2407.00603 , year=

work page arXiv

[22] [22]

Proceedings of the 42nd International Conference on Machine Learning , series=

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding , author=. Proceedings of the 42nd International Conference on Machine Learning , series=. 2025 , publisher=

work page 2025

[23] [23]

arXiv preprint arXiv:2511.07278 , year=

StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression , author=. arXiv preprint arXiv:2511.07278 , year=

work page arXiv

[24] [24]

arXiv preprint arXiv:2508.15717 , year=

StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding , author=. arXiv preprint arXiv:2508.15717 , year=

work page arXiv

[25] [25]

arXiv preprint arXiv:2505.10832 , year=

Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL , author=. arXiv preprint arXiv:2505.10832 , year=

work page arXiv

[26] [26]

arXiv preprint arXiv:2505.13417 , year=

AdaptThink: Reasoning Models Can Learn When to Think , author=. arXiv preprint arXiv:2505.13417 , year=

work page arXiv

[27] [27]

Proactivevideoqa: A comprehensive benchmark evaluating proactive interactions in video large language models,

ProactiveVideoQA: A Comprehensive Benchmark Evaluating Proactive Interactions in Video Large Language Models , author=. arXiv preprint arXiv:2507.09313 , year=

work page arXiv

[28] [28]

arXiv preprint arXiv:2508.21496 , year=

ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding , author=. arXiv preprint arXiv:2508.21496 , year=

work page arXiv

[29] [29]

International Conference on Learning Representations (ICLR) , year=

Is Your Video Language Model a Reliable Judge? , author=. International Conference on Learning Representations (ICLR) , year=

work page

[30] [30]

Livecc: Learning video llm with streaming speech transcription at scale.arXiv preprint arXiv:2504.16030, 2025

LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale , author=. arXiv preprint arXiv:2504.16030 , year=

work page arXiv

[31] [31]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis , author=. arXiv preprint arXiv:2405.21075 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Qwen2.5-VL Technical Report

Qwen2.5-VL Technical Report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

Qwen3-VL Technical Report

Qwen3-VL Technical Report , author=. arXiv preprint arXiv:2511.21631 , year=. doi:10.48550/arXiv.2511.21631 , url=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.21631

[34] [34]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=. 2025 , doi=

work page 2025

[35] [35]

Streaming Video Instruction Tuning

Streaming Video Instruction Tuning , author=. arXiv preprint arXiv:2512.21334 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

LLaVA-Video: Video Instruction Tuning With Synthetic Data , author=. arXiv preprint arXiv:2410.02713 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

LLaVA-OneVision: Easy Visual Task Transfer

LLaVA-OneVision: Easy Visual Task Transfer , author=. arXiv preprint arXiv:2408.03326 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling , author=. arXiv preprint arXiv:2412.05271 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages=

Flash-VStream: Efficient Real-Time Understanding for Long Video Streams , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages=

work page

[40] [40]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini 1.5: Unlocking Multimodal Understanding across Millions of Tokens of Context , author=. arXiv preprint arXiv:2403.05530 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

2024 , howpublished=

work page 2024

[42] [42]

2024 , howpublished=

Claude 3.5 Sonnet , author=. 2024 , howpublished=

work page 2024

[43] [43]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

MiniCPM-V: A GPT-4V Level MLLM on Your Phone , author=. arXiv preprint arXiv:2408.01800 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

arXiv preprint arXiv:2408.15542 , year=

Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input , author=. arXiv preprint arXiv:2408.15542 , year=

work page arXiv

[45] [45]

Long Context Transfer from Language to Vision

Long Context Transfer from Language to Vision , author=. arXiv preprint arXiv:2406.16852 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

VILA: On Pre-training for Visual Language Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

work page

[47] [47]

Video-ccam: Enhancingvideo-language understanding with causal cross-attention masks for short and long videos.arXiv preprint arXiv:2408.14023, 2024

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos , author=. arXiv preprint arXiv:2408.14023 , year=

work page arXiv

[48] [48]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs , author=. arXiv preprint arXiv:2406.07476 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[49] [49]

Proceedings of the European Conference on Computer Vision (ECCV) , year=

VideoAgent: Long-Form Video Understanding with Large Language Model as Agent , author=. Proceedings of the European Conference on Computer Vision (ECCV) , year=. doi:10.1007/978-3-031-72989-8_4 , url=

work page doi:10.1007/978-3-031-72989-8_4

[50] [50]

Bench: Towards Open-Ended Event-Level Video-Language Understanding , author=

E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding , author=. arXiv preprint arXiv:2409.18111 , year=

work page arXiv

[51] [51]

StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding

StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding , author=. arXiv preprint arXiv:2508.01875 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[52] [52]

Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) , month =

Yang, Senqiao and Chen, Yukang and Tian, Zhuotao and Wang, Chengyao and Li, Jingyao and Yu, Bei and Jia, Jiaya , title =. Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) , month =. 2025 , pages =

work page 2025

[53] [53]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Alvar, Saeed Ranjbar and Singh, Gursimran and Akbari, Mohammad and Zhang, Yong , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =

work page 2025

[54] [54]

2024 , eprint =

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author =. 2024 , eprint =

work page 2024

[55] [55]

2026 , eprint=

FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding , author=. 2026 , eprint=

work page 2026

[56] [56]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Coin: A large-scale dataset for comprehensive instructional video analysis , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[57] [57]

IEEE Transactions on Information Theory , volume=

Divergence measures based on the Shannon entropy , author=. IEEE Transactions on Information Theory , volume=. 1991 , doi=

work page 1991

[58] [58]

Interpretation of

Kim, Siwon and Yi, Jihun and Kim, Eunji and Yoon, Sungroh , booktitle=. Interpretation of. 2020 , publisher=. doi:10.18653/v1/2020.emnlp-main.255 , url=

work page doi:10.18653/v1/2020.emnlp-main.255 2020