arxiv: 2604.09000 · v2 · submitted 2026-04-10 · 💻 cs.CV

Recognition: unknown

StreamMeCo: Long-Term Agent Memory Compression for Efficient Streaming Video Understanding

Junxi Wang , Te Sun , Jiayi Zhu , Junxian Li , Haowen Xu , Zichen Wen , Xuming Hu , Zhiyu Li

show 1 more author

Linfeng Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords agent memory compressionstreaming video understandingmemory graphvision agentsvideo understandingmemory retrievallong-term memorymemory pruning

0 comments

The pith

StreamMeCo compresses vision agent memory graphs by 70 percent to speed retrieval while holding or improving video understanding accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the storage and computation costs that arise when vision agents keep memories of long streaming videos. It builds a memory graph from past observations and uses the graph's connections to decide which memory nodes to keep or discard. Isolated nodes are handled with a minmax sampling step that avoids edges, while connected nodes undergo weight-based pruning to remove less central information. A separate time-decay rule in retrieval further offsets any accuracy loss from the compression. On robot, web, and long-video benchmarks the method delivers faster memory access and a modest average accuracy gain even after substantial reduction in stored memory.

Core claim

StreamMeCo is an efficient Stream Agent Memory Compression framework that evicts redundant memory nodes based on the connectivity of the memory graph. Isolated nodes are removed via edge-free minmax sampling and connected nodes via edge-aware weight pruning, while a time-decay memory retrieval mechanism compensates for any resulting performance degradation. On the M3-Bench-robot, M3-Bench-web and Video-MME-Long datasets this approach sustains accuracy under 70 percent memory graph compression and produces a 1.87 times speedup in memory retrieval together with a 1.0 percent average accuracy improvement.

What carries the argument

Memory graph connectivity that identifies redundant nodes for eviction, implemented through edge-free minmax sampling on isolated nodes and edge-aware weight pruning on connected nodes, together with time-decay retrieval to preserve task performance.

If this is right

Agents can maintain memory of much longer video streams without exhausting available storage or compute.
Memory retrieval operations complete faster, supporting lower-latency responses during streaming tasks.
Removal of certain nodes can reduce noise and produce small accuracy gains rather than losses.
The same compression pipeline works across robot navigation, web interaction, and extended video benchmarks.
Overall system costs for storage and repeated retrieval drop substantially for continuous video input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Graph connectivity rules for eviction could transfer to memory management in non-video agents that store sequences of observations or actions.
Making the compression ratio adjust automatically to video length or scene complexity would address cases where fixed 70 percent reduction is too aggressive.
Pairing the method with low-precision storage formats might multiply the memory savings beyond the reported compression level.
Experiments on videos with higher levels of irrelevant detail would reveal whether connectivity alone continues to separate signal from noise.

Load-bearing premise

The memory graph's connectivity accurately identifies which nodes are redundant without removing information that remains necessary for correct video understanding.

What would settle it

If a new long streaming video dataset shows that 70 percent compression under StreamMeCo produces a consistent accuracy drop larger than 2 percent even after applying the time-decay retrieval, the claim that graph connectivity safely guides eviction would be contradicted.

Figures

Figures reproduced from arXiv: 2604.09000 by Haowen Xu, Jiayi Zhu, Junxian Li, Junxi Wang, Linfeng Zhang, Te Sun, Xuming Hu, Zhiyu Li, Zichen Wen.

**Figure 1.** Figure 1: An example from M3-Bench-robot. (a) Composition of the memory graph. (b) Analysis of memory retrieval time with respect to the number of text nodes, indicating that the increment of memory nodes makes the retrieval time not acceptable. Zhang et al., 2024; Yang et al., 2025a) can only rely on the limited information observed before the user’s question arrives, making how to efficiently process continuously… view at source ↗

**Figure 2.** Figure 2: The overview of M3-Agent. (a) The memory [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The overview of StreamMeCo. Efficient memory graph compression is achieved via the EMsampling [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Impact of the TMR Mechanism on M3- Bench-robot and M3-Bench-web at different compression ratios. this phenomenon to the reduced memory capacity after compression, which limits the amount of effective information the model can retrieve, lowers the confidence in the retrieval results, and requires more retrieval rounds to finalize the answer. In contrast, with the introduction of the TMR mechanism, Avg-Ret… view at source ↗

**Figure 5.** Figure 5: Time Efficiency Analysis on M3-Bench-Robot, M3-Bench-Web and Video-MME-Long. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Impact of cluster ratio a and balance boefficient b on M3-Bench-robot and M3-Bench-web. model to overly rely on entity importance, thereby ignoring embedding similarity information and ultimately degrading the overall performance. On Other Decay Methods We first compare linear decay with the exponential decay used in our paper (evaluated on the M3-Bench-Robot dataset). The results are shown in [PITH_FUL… view at source ↗

**Figure 7.** Figure 7: Numbers of different types of nodes and their [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Impact of memory decay coefficient λ on M3-Bench-robot. Differential Token Drop (DTD) module, which reduces visual redundancy in streaming videos by selectively preserving significant temporal changes while discarding static content between frames. This approach helps achieve an 82.8% reduction in video tokens, maintaining high performance in real-time video question answering (QA). Additionally, it inte… view at source ↗

**Figure 9.** Figure 9: Impact of the TMR Mechanism on VideoMME-Long at different compression ratios. Components M3-Bench-Robot M3-Bench-Web A B C Accuracy Accuracy – – – 30.3 (100.0%) 47.9 (100.0%) 26.6 (87.8%) 41.8 (87.3%) ✓ 27.0 (89.1%) 42.6 (88.9%) ✓ ✓ 28.7 (94.7%) 43.9 (91.6%) ✓ ✓ 28.4 (93.7%) 42.3 (88.3%) ✓ ✓ 28.9 (95.4%) 44.2 (92.3%) ✓ ✓ ✓ 30.6 (101.0%) 44.7 (93.3%) [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

read the original abstract

Vision agent memory has shown remarkable effectiveness in streaming video understanding. However, storing such memory for videos incurs substantial memory overhead, leading to high costs in both storage and computation. To address this issue, we propose StreamMeCo, an efficient Stream Agent Memory Compression framework. Specifically, based on the connectivity of the memory graph, StreamMeCo introduces edge-free minmax sampling for the isolated nodes and an edge-aware weight pruning for connected nodes, evicting the redundant memory nodes while maintaining the accuracy. In addition, we introduce a time-decay memory retrieval mechanism to further eliminate the performance degradation caused by memory compression. Extensive experiments on three challenging benchmark datasets (M3-Bench-robot, M3-Bench-web and Video-MME-Long) demonstrate that under 70% memory graph compression, StreamMeCo achieves a 1.87* speedup in memory retrieval while delivering an average accuracy improvement of 1.0%. Our code is available at https://github.com/Celina-love-sweet/StreamMeCo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StreamMeCo gives a concrete graph-pruning recipe for video-agent memory with reported speedups, but the accuracy numbers look like they may ride on the time-decay retrieval more than the compression itself.

read the letter

The paper introduces StreamMeCo, which builds a memory graph from video agent states and prunes it using connectivity: minmax sampling on isolated nodes and weight-based pruning on connected ones, then layers in time-decay retrieval to offset losses. That combination is the main new piece, along with the claim that it works on three streaming-video benchmarks at 70% compression for a 1.87x retrieval speedup and a 1% average accuracy lift. Code release is a plus for anyone who wants to try the pruning rules directly.

Referee Report

2 major / 2 minor

Summary. The paper introduces StreamMeCo, a framework for compressing long-term agent memory graphs in streaming video understanding. It evicts nodes via edge-free minmax sampling on isolated nodes and edge-aware weight pruning on connected components (targeting 70% compression), while adding a time-decay retrieval mechanism to offset potential accuracy loss. On M3-Bench-robot, M3-Bench-web, and Video-MME-Long, it reports 1.87× memory retrieval speedup and 1.0% average accuracy gain.

Significance. If the graph-based eviction reliably preserves task-critical information and the accuracy lift is attributable to the method rather than the time-decay component alone, the work could enable scalable long-horizon video agents by cutting storage and retrieval costs. Code release at the cited GitHub repository is a clear strength for reproducibility and follow-up work.

major comments (2)

[Method description (abstract and §3)] The central claim that connectivity-based eviction (minmax sampling on isolates and weight pruning on connected components) safely removes 70% of nodes without losing critical video-understanding information is load-bearing but unsupported. No analysis, ablation, or correlation study is provided showing that low-connectivity or low-weight nodes are informationally redundant rather than task-critical.
[Experiments (abstract and §4)] Experimental evaluation: the reported 1.0% average accuracy improvement and 1.87× speedup lack any description of baselines, number of runs, error bars, statistical significance, or ablations that isolate the contribution of graph pruning versus the separately introduced time-decay retrieval. This prevents attribution of gains to the compression step and makes the counter-intuitive accuracy lift under aggressive compression difficult to evaluate.

minor comments (2)

[Abstract] The abstract introduces terms such as 'edge-free minmax sampling' and 'edge-aware weight pruning' without a concise definition or high-level intuition, which would aid readers unfamiliar with the memory-graph construction.
[Method] No mention of how the memory graph is initially constructed (node embeddings, edge weighting criteria) or any hyper-parameters controlling the 70% target compression rate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will make substantial revisions to strengthen the methodological justification and experimental reporting in the manuscript.

read point-by-point responses

Referee: [Method description (abstract and §3)] The central claim that connectivity-based eviction (minmax sampling on isolates and weight pruning on connected components) safely removes 70% of nodes without losing critical video-understanding information is load-bearing but unsupported. No analysis, ablation, or correlation study is provided showing that low-connectivity or low-weight nodes are informationally redundant rather than task-critical.

Authors: We agree that the current manuscript lacks explicit supporting analysis for the eviction criteria. In the revised version, we will expand Section 3 with a new correlation analysis between node connectivity/weight and task relevance (measured via removal impact on downstream accuracy), plus an ablation comparing eviction of low- versus high-connectivity nodes. This will provide direct evidence that the pruned nodes are redundant for video-understanding tasks. revision: yes
Referee: [Experiments (abstract and §4)] Experimental evaluation: the reported 1.0% average accuracy improvement and 1.87× speedup lack any description of baselines, number of runs, error bars, statistical significance, or ablations that isolate the contribution of graph pruning versus the separately introduced time-decay retrieval. This prevents attribution of gains to the compression step and makes the counter-intuitive accuracy lift under aggressive compression difficult to evaluate.

Authors: We acknowledge these gaps in experimental rigor. We will revise Section 4 to: (i) fully describe all baselines, (ii) report results over multiple runs (minimum 5 seeds) with means, standard deviations, error bars, and statistical significance tests, and (iii) add ablations that isolate graph pruning from time-decay retrieval (including a no-time-decay variant). These changes will allow clear attribution of the observed accuracy gain, which we hypothesize arises from noise reduction via removal of low-relevance nodes. revision: yes

Circularity Check

0 steps flagged

Empirical compression framework with no circular derivation

full rationale

The paper proposes StreamMeCo as an empirical method: it constructs a memory graph, applies connectivity-based eviction (edge-free minmax sampling on isolates and edge-aware weight pruning on connected components), and adds a time-decay retrieval mechanism. Performance is measured directly on three external benchmarks (M3-Bench-robot, M3-Bench-web, Video-MME-Long) under 70% compression, reporting 1.87× retrieval speedup and +1.0% average accuracy. No mathematical derivation, first-principles prediction, or fitted parameter is presented whose output is definitionally equivalent to its input. No self-citations are used to justify uniqueness, ansatz, or load-bearing premises. The claims rest on experimental outcomes rather than on any reduction to the method's own definitions or prior author results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim relies on the domain assumption of graph-structured memory and introduces a new framework without additional free parameters explicitly mentioned in the abstract.

axioms (1)

domain assumption Agent memory for video understanding can be effectively modeled as a graph where node connectivity indicates redundancy.
This underpins the choice of edge-free minmax sampling for isolated nodes and edge-aware pruning for connected nodes.

pith-pipeline@v0.9.0 · 5499 in / 1188 out tokens · 50200 ms · 2026-05-10T18:15:48.339385+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Similarity to Structure: Training-free LLM Context Compression with Hybrid Graph Priors
cs.CL 2026-04 unverdicted novelty 5.0

A hybrid graph-based training-free framework for LLM context compression matches strong baselines and shows larger gains on long-document benchmarks.

Reference graph

Works this paper leans on

9 extracted references · 7 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18407–18418

Videollm-online: Online video large lan- guage model for streaming video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18407–18418. Xueyi Chen, Keda Tao, Kele Shao, and Huan Wang
[2]

Streamingtom: Streaming token compres- sion for efficient video understanding.arXiv preprint arXiv:2510.18269, 2025

Streamingtom: Streaming token compression for efficient video understanding.arXiv preprint arXiv:2510.18269. Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. 2025a. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413. Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh...

work page arXiv 2001
[3]

InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 2803–2813

Learning musical representations for music performance question answering. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 2803–2813. Xingjian Diao, Chunhui Zhang, Weiyi Wu, Zhongyu Ouyang, Peijun Qing, Ming Cheng, Soroush V osoughi, and Jiang Gui. 2025c. Temporal work- ing memory: Query-guided segment refinement for enhance...

work page arXiv 2024
[4]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Moviechat: From dense token to sparse mem- ory for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232. Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. 2025. Adaptive keyframe sampling for long video understanding. InProceed- ings of the Computer Vision...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Haomiao Xiong, Zongxin Yang, Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Jiawen Zhu, and Huchuan Lu

Spark: Strategic policy-aware exploration via dynamic branching for long-horizon agentic learn- ing.arXiv preprint arXiv:2601.20209. Haomiao Xiong, Zongxin Yang, Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Jiawen Zhu, and Huchuan Lu

work page arXiv
[6]

arXiv preprint arXiv:2501.13468 , year=

Streaming video understanding and multi- round interaction with memory-enhanced knowl- edge.arXiv preprint arXiv:2501.13468. Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, and 1 others. 2025a. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215. Ruyi Xu, Guangxuan Xiao, Yukang Chen...

work page arXiv 2025
[7]

Memgen: Weaving generative latent memory for self-evolving agents.arXiv preprint arXiv:2509.24704, 2025

Memgen: Weaving generative latent mem- ory for self-evolving agents.arXiv preprint arXiv:2509.24704. Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin

work page arXiv
[8]

Flash-vstream: Memory-based real-time un- derstanding for long video streams.arXiv preprint arXiv:2406.08085. Jiaquan Zhang, Qigan Sun, Chaoning Zhang, Xudong Wang, Zhenzhen Huang, Yitian Zhou, Pengcheng Zheng, Chi lok Andy Tai, Sung-Ho Bae, Zeyu Ma, Caiyan Qin, Jinyu Guo, Yang Yang, and Heng- tao Shen. 2026a. Tda-rc: Task-driven alignment for knowledge-b...

work page arXiv 2025
[9]

E Experiments on Other Graph-Based Memory Frameworks Our method can be readily adapted to other graph- based Agent Memory frameworks

Therefore, the information loss introduced by compression is theoretically minimal. E Experiments on Other Graph-Based Memory Frameworks Our method can be readily adapted to other graph- based Agent Memory frameworks. Specifically, we adapt it to the Mem0 (graph) (Chhikara et al., 2025b) framework, whose memory paradigm in- cludes entity types (e.g., pers...