pith. sign in

arxiv: 2607.00692 · v1 · pith:HFJIU373new · submitted 2026-07-01 · 💻 cs.AI

Self-GC: Self-Governing Context for Long-Horizon LLM Agents

Pith reviewed 2026-07-02 12:47 UTC · model grok-4.3

classification 💻 cs.AI
keywords self-governing contextLLM agentscontext pruninglong-horizon taskstoken efficiencycontext managementagent memoryrecoverable objects
0
0 comments X

The pith

Self-GC indexes agent context as recoverable objects and uses a planner to prune 43.95% of prefix tokens while keeping 84.85% of future continuations unaffected.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Long-horizon LLM agents build up structured context from tools, files, plans and constraints that heuristics either discard blindly or summarize at the cost of exact details. Self-GC converts these elements into indexed objects, routes decisions through a side-channel planner that suggests fold, mask or prune actions, and applies a harness to keep sidecars recoverable and commits safe. On a 33-session hard benchmark it removes nearly 44% of early tokens with 85% of later turns continuing unchanged, beating heuristic baselines whose no-impact rates range from 55% to 70%. The same pattern holds on a 332-session production suite and yields 10-15% daytime token savings in live deployment. The shift is from reactive text cleanup to explicit lifecycle control over context objects.

Core claim

Self-GC turns user turns, tool spans, and skill state into indexed objects; asks a side-channel planner to propose fold, mask, and prune actions; and lets the harness enforce recoverable sidecars, safe commit boundaries, and cache-aware commit. On the 33-session Hard Set this prunes 43.95% of prefix tokens while leaving 84.85% of future continuations unaffected, outperforming heuristic baselines whose no-impact rates range from 54.55% to 69.70%. On the 332-session production-derived suite the three planner backbones reach no-impact rates of 91.27% to 94.58% while baselines stay at 77.71% to 87.46%. In production an online account-level split reduces daytime average input tokens by 10% to 15%

What carries the argument

Side-channel planner that proposes fold, mask, and prune actions on indexed context objects, with harness enforcement of recoverability and commit safety.

If this is right

  • Agents can run longer sessions before hitting context limits because prefix tokens are reduced without breaking most continuations.
  • Production deployments see consistent 10-15% lower average input token counts during daytime operation.
  • Planner-based decisions outperform chronological pruning and tool-output masking on both hard and production-derived session sets.
  • Context objects remain recoverable through sidecars, allowing later restoration if a prune proves too aggressive.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same object-indexing approach could be applied to multi-agent systems where one agent's pruned context must stay legible to others.
  • If the planner backbone improves, the method might support even longer horizons by combining pruning with selective summarization of non-critical spans.
  • Production token savings suggest the technique could be inserted into existing agent runtimes with only modest changes to the harness layer.

Load-bearing premise

That an unaffected future continuation reliably shows all critical context information for task completion has been retained.

What would settle it

Run the 33-session hard set with human annotations of every essential locator, evidence item, and constraint; measure whether Self-GC prunes any annotated item at a materially higher rate than the baselines while still reporting the 84.85% no-impact figure.

Figures

Figures reproduced from arXiv: 2607.00692 by Chenpeng Cao, Hongjin Meng, Jiawei Zhu, Xin Yin, Xubin Hao.

Figure 1
Figure 1. Figure 1: Object-level context management on a shared agent [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Self-GC runtime framework. The context-engine hook exposes indexed context objects to a side [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Lifecycle actions over indexed objects. Fold moves [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Compression and preservation trade-off on the Hard Set and Production Suite. Each point reports mean pruning rate [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Online average input tokens for the Self-GC group and control group during daytime production traffic, May 25 to 30, [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Planning, rehearsal, and commit sequence used by Self-GC. Planning runs in a side channel, while accepted plans are [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Context-object data model used by Self-GC. User-turn and tool-span objects carry stable identifiers, lifecycle state, [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
read the original abstract

Long-horizon LLM agents accumulate tool results, files, plans, and user constraints that are too structured to be treated as a disposable text suffix. Current systems mostly rely on in-run heuristics such as chronological pruning and tool-output masking, or on final self-summary near a context limit. Heuristics are cheap but blind to future dependencies; summaries preserve narrative state but often hide exact evidence, locators, and editable artifacts. We present Self-GC, where GC denotes self-governing context while deliberately echoing garbage collection: the system does not merely reclaim unused tokens, but governs the lifecycle of agent context objects. Self-GC turns user turns, tool spans, and skill state into indexed objects; asks a side-channel planner to propose fold, mask, and prune actions; and lets the harness enforce recoverable sidecars, safe commit boundaries, and cache-aware commit. On a 33-session Hard Set, Self-GC prunes 43.95% of prefix tokens while leaving 84.85% of future continuations unaffected, compared with no-impact rates of 54.55% to 69.70% for heuristic baselines. On a 332-session production-derived suite, three planner backbones reach no-impact rates of 91.27% to 94.58%, while baselines remain at 77.71% to 87.46%. In production, an online account-level split reduces daytime average input tokens by 10% to 15%, with peak reductions near 20%. These results point to context management as runtime lifecycle control over indexed, recoverable objects rather than post hoc text cleanup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Self-GC, a context management framework for long-horizon LLM agents that models user turns, tool spans, and skill state as indexed objects. A side-channel planner proposes fold, mask, and prune actions, which a harness enforces via recoverable sidecars, safe commit boundaries, and cache-aware commits. On a 33-session Hard Set the system prunes 43.95% of prefix tokens while leaving 84.85% of future continuations unaffected (versus 54.55–69.70% for heuristic baselines); on a 332-session production-derived suite three planner backbones reach 91.27–94.58% no-impact rates (baselines 77.71–87.46%); production deployment yields 10–15% average daytime token reduction.

Significance. If the no-impact metric reliably indicates preservation of task-critical information, the work offers a structured alternative to heuristic pruning or late-stage summarization and demonstrates measurable efficiency gains both in controlled suites and live production. The emphasis on recoverable objects and planner-driven lifecycle control is a concrete step beyond post-hoc text cleanup.

major comments (2)
  1. [§4] §4 (Evaluation): the definition and computation of the 'future continuations unaffected' (no-impact) metric are not provided. The manuscript must specify (a) the exact procedure for generating and comparing continuations, (b) how the 33 Hard Set sessions were selected, and (c) any controls or ablations for delayed dependencies or task-success correlation; without these the 84.85% figure cannot be assessed as evidence that critical context is preserved.
  2. [§3.2] §3.2 (Planner and Harness): the side-channel planner and recoverable-sidecar mechanisms are central to the claimed advantage over heuristics, yet no ablation isolates their contribution from the underlying LLM backbone or from the commit-boundary logic. A controlled removal of either component on the Hard Set would be required to substantiate that the reported deltas are attributable to the proposed architecture rather than to planner quality alone.
minor comments (2)
  1. [§1] The abstract and §1 use 'GC' to echo garbage collection; a brief sentence clarifying the analogy (and its limits) would aid readers unfamiliar with the metaphor.
  2. Table captions and axis labels for the production token-reduction results should explicitly state the time window and account-level split used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation metric and the need for component ablations. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and experiments.

read point-by-point responses
  1. Referee: [§4] §4 (Evaluation): the definition and computation of the 'future continuations unaffected' (no-impact) metric are not provided. The manuscript must specify (a) the exact procedure for generating and comparing continuations, (b) how the 33 Hard Set sessions were selected, and (c) any controls or ablations for delayed dependencies or task-success correlation; without these the 84.85% figure cannot be assessed as evidence that critical context is preserved.

    Authors: We agree that the no-impact metric requires a precise, reproducible definition to support the reported 84.85% figure. The manuscript currently states the outcome but omits the procedural details. In the revised §4 we will add: (a) the exact procedure, consisting of replaying each Hard Set session from the pruned context, generating the next agent continuation with the same backbone, and scoring it as unaffected if it matches the original continuation on task-critical elements (via both automated embedding similarity above a threshold and manual review for dependency preservation); (b) the selection process for the 33 sessions, drawn from production logs as the longest-horizon traces containing at least three cross-turn dependencies; (c) controls consisting of explicit checks that no pruned object is referenced in the subsequent 10 turns and a correlation analysis between no-impact rate and end-to-end task success. These additions will be included in the next version. revision: yes

  2. Referee: [§3.2] §3.2 (Planner and Harness): the side-channel planner and recoverable-sidecar mechanisms are central to the claimed advantage over heuristics, yet no ablation isolates their contribution from the underlying LLM backbone or from the commit-boundary logic. A controlled removal of either component on the Hard Set would be required to substantiate that the reported deltas are attributable to the proposed architecture rather than to planner quality alone.

    Authors: We acknowledge that the current results compare different planner backbones but do not isolate the side-channel planner or the recoverable-sidecar enforcement from the commit-boundary logic. In the revised manuscript we will add a controlled ablation on the Hard Set that (i) disables the planner entirely and falls back to the strongest heuristic baseline, and (ii) retains the planner but removes recoverable sidecars (forcing immediate non-recoverable commits). The resulting no-impact and token-reduction deltas will be reported to quantify the incremental contribution of each mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical system evaluation

full rationale

The manuscript describes an agent context-management system (Self-GC) and evaluates it via direct measurements on two fixed test collections (33-session Hard Set, 332-session production suite). Reported figures are pruning percentages and continuation-impact rates obtained by running the system and baselines on those sets; no equations, fitted parameters, or first-principles derivations are present that could reduce to their own inputs. Self-citations, if any, are not load-bearing for any claimed derivation. The evaluation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper introduces new architectural components whose effectiveness is demonstrated empirically but whose design assumptions are not independently verified in the abstract.

axioms (1)
  • domain assumption It is possible to represent agent context elements as indexed objects that can be independently acted upon without breaking dependencies.
    This underpins the entire object-based management approach described.
invented entities (2)
  • side-channel planner no independent evidence
    purpose: Proposes fold, mask, and prune actions on context objects.
    A new component introduced to enable the governance of context lifecycle.
  • recoverable sidecars no independent evidence
    purpose: Enforce safe commit boundaries and cache-aware commits for context changes.
    Part of the system harness for safe management.

pith-pipeline@v0.9.1-grok · 5831 in / 1345 out tokens · 39783 ms · 2026-07-02T12:47:25.716780+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 31 canonical work pages · 15 internal anchors

  1. [1]

    Xu, Bo and Zhang, Danny and Arunachalam, Rohit , year =

  2. [2]

    Cassano, Federico and Rush, Sasha , year =

  3. [5]

    Ye, Rui and Zhang, Zhongwang and Li, Kuan and Yin, Huifeng and Tao, Zhengwei and Zhao, Yida and Su, Liangcai and Zhang, Liwen and Qiao, Zile and Wang, Xinyu and Xie, Pengjun and Huang, Fei and Zhou, Jingren and Chen, Siheng and Jiang, Yong , year =

  4. [7]

    and Xu, Ruifeng and Wong, Kam-Fai , year=

    Du, Yiming and Wang, Bingbing and He, Yang and Liang, Bin and Wang, Baojun and Li, Zhongyang and Gui, Lin and Pan, Jeff Z. and Xu, Ruifeng and Wong, Kam-Fai , year=. MemGuide: Intent-Driven Memory Selection for Goal-Oriented Multi-Session LLM Agents , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , publisher=. doi:10.1609/aaai.v40...

  5. [8]

    ARTEM: Enhancing Large Language Model Agents with Spatial-Temporal Episodic Memory , volume=

    Tan, Cassandra Hui-Ming and Subagdja, Budhitama and Tan, Ah-Hwee , year=. ARTEM: Enhancing Large Language Model Agents with Spatial-Temporal Episodic Memory , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , publisher=. doi:10.1609/aaai.v40i30.39773 , number=

  6. [9]

    MemoryART: Enhancing LLMs via Multi-Memory Models with Adaptive Resonance Theory for Healthcare Agents , volume=

    Dai, Renke and Hu, Hebin and Zhang, Jiahui and Kang, Yilin and Tan, Ah-Hwee , year=. MemoryART: Enhancing LLMs via Multi-Memory Models with Adaptive Resonance Theory for Healthcare Agents , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , publisher=. doi:10.1609/aaai.v40i25.39205 , number=

  7. [10]

    (2026, February 13)

    Zhong, Wanjun and Guo, Lianghong and Gao, Qiqi and Ye, He and Wang, Yanlin , year=. MemoryBank: Enhancing Large Language Models with Long-Term Memory , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , publisher=. doi:10.1609/aaai.v38i17.29946 , number=

  8. [11]

    ExpeL: LLM Agents Are Experiential Learners , volume=

    Zhao, Andrew and Huang, Daniel and Xu, Quentin and Lin, Matthieu and Liu, Yong-Jin and Huang, Gao , year=. ExpeL: LLM Agents Are Experiential Learners , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , publisher=. doi:10.1609/aaai.v38i17.29936 , number=

  9. [12]

    AutoTool: Efficient Tool Selection for Large Language Model Agents , volume=

    Jia, Jingyi and Li, Qinbin , year=. AutoTool: Efficient Tool Selection for Large Language Model Agents , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , publisher=. doi:10.1609/aaai.v40i37.40389 , number=

  10. [13]

    MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools , volume=

    Guo, Zikang and Xu, Benfeng and Zhu, Chiwei and Hong, Wentao and Wang, Xiaorui and Mao, Zhendong , year=. MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , publisher=. doi:10.1609/aaai.v40i37.40347 , number=

  11. [14]

    SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning , volume=

    Liao, Huanxuan and Xu, Yixing and He, Shizhu and Li, Guanchen and Yin, Xuanwu and Li, Dong and Barsoum, Emad and Zhao, Jun and Liu, Kang , year=. SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , publisher=. doi:10.1609/aaai.v40i38.40466 , number=

  12. [15]

    2024 , eprint=

    MemGPT: Towards LLMs as Operating Systems , author=. 2024 , eprint=

  13. [16]

    2023 , eprint=

    Augmenting Language Models with Long-Term Memory , author=. 2023 , eprint=

  14. [17]

    2025 , eprint=

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory , author=. 2025 , eprint=

  15. [18]

    2025 , eprint=

    A-MEM: Agentic Memory for LLM Agents , author=. 2025 , eprint=

  16. [19]

    2024 , eprint=

    HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models , author=. 2024 , eprint=

  17. [20]

    2026 , eprint=

    REMem: Reasoning with Episodic Memory in Language Agent , author=. 2026 , eprint=

  18. [21]

    2026 , eprint=

    TiMem: Temporal-Hierarchical Memory Consolidation for Long-Horizon Conversational Agents , author=. 2026 , eprint=

  19. [22]

    2020 , eprint=

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. 2020 , eprint=

  20. [23]

    2026 , eprint=

    SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents , author=. 2026 , eprint=

  21. [24]

    2026 , eprint=

    ContextBudget: Budget-Aware Context Management for Long-Horizon Search Agents , author=. 2026 , eprint=

  22. [25]

    2025 , eprint=

    ACON: Optimizing Context Compression for Long-horizon LLM Agents , author=. 2025 , eprint=

  23. [26]

    2023 , eprint=

    H _2 O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models , author=. 2023 , eprint=

  24. [27]

    2023 , eprint=

    Efficient Streaming Language Models with Attention Sinks , author=. 2023 , eprint=

  25. [28]

    2024 , eprint=

    SnapKV: LLM Knows What You are Looking for Before Generation , author=. 2024 , eprint=

  26. [29]

    2024 , eprint=

    PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling , author=. 2024 , eprint=

  27. [30]

    2023 , eprint=

    AgentBench: Evaluating LLMs as Agents , author=. 2023 , eprint=

  28. [31]

    2024 , eprint=

    WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. 2024 , eprint=

  29. [32]

    2023 , eprint=

    JudgeLM: Fine-tuned Large Language Models are Scalable Judges , author=. 2023 , eprint=

  30. [33]

    2023 , eprint=

    G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment , author=. 2023 , eprint=

  31. [34]

    Cai, Z.; Zhang, Y.; Gao, B.; Liu, Y.; Li, Y.; Liu, T.; Lu, K.; Xiong, W.; Dong, Y.; Hu, J.; and Xiao, W. 2024. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling. arXiv:2406.02069

  32. [35]

    Cassano, F.; and Rush, S. 2026. Training Composer for Longer Horizons . Cursor research blog, accessed 2026-06-04

  33. [36]

    Chhikara, P.; Khant, D.; Aryan, S.; Singh, T.; and Yadav, D. 2025. Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv:2504.19413

  34. [37]

    Chroma . 2026. Chroma Context-1: Training a Self-Editing Search Agent . Research blog, accessed 2026-06-04

  35. [38]

    Dai, R.; Hu, H.; Zhang, J.; Kang, Y.; and Tan, A.-H. 2026. MemoryART: Enhancing LLMs via Multi-Memory Models with Adaptive Resonance Theory for Healthcare Agents. Proceedings of the AAAI Conference on Artificial Intelligence, 40(25): 20676–20683

  36. [39]

    DeepSeek-AI . 2025. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models . arXiv:2512.02556

  37. [40]

    Z.; Xu, R.; and Wong, K.-F

    Du, Y.; Wang, B.; He, Y.; Liang, B.; Wang, B.; Li, Z.; Gui, L.; Pan, J. Z.; Xu, R.; and Wong, K.-F. 2026. MemGuide: Intent-Driven Memory Selection for Goal-Oriented Multi-Session LLM Agents. Proceedings of the AAAI Conference on Artificial Intelligence, 40(36): 30584–30592

  38. [41]

    Guo, Z.; Xu, B.; Zhu, C.; Hong, W.; Wang, X.; and Mao, Z. 2026. MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools. Proceedings of the AAAI Conference on Artificial Intelligence, 40(37): 30888–30896

  39. [42]

    J.; Shu, Y.; Gu, Y.; Yasunaga, M.; and Su, Y

    Gutiérrez, B. J.; Shu, Y.; Gu, Y.; Yasunaga, M.; and Su, Y. 2024. HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. arXiv:2405.14831

  40. [43]

    Jia, J.; and Li, Q. 2026. AutoTool: Efficient Tool Selection for Large Language Model Agents. Proceedings of the AAAI Conference on Artificial Intelligence, 40(37): 31265–31273

  41. [44]

    ACON: Optimizing Context Compression for Long-horizon LLM Agents

    Kang, M.; Chen, W.-N.; Han, D.; Inan, H. A.; Wutschitz, L.; Chen, Y.; Sim, R.; and Rajmohan, S. 2025. ACON: Optimizing Context Compression for Long-horizon LLM Agents. arXiv:2510.00615

  42. [45]

    Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; tau Yih, W.; Rocktäschel, T.; Riedel, S.; and Kiela, D. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401

  43. [46]

    Li, K.; Yu, X.; Ni, Z.; Zeng, Y.; Xu, Y.; Zhang, Z.; Li, X.; Sang, J.; Duan, X.; Wang, X.; Liu, C.; and Tan, J. 2026. TiMem: Temporal-Hierarchical Memory Consolidation for Long-Horizon Conversational Agents. arXiv:2601.02845

  44. [47]

    Li, Y.; Huang, Y.; Yang, B.; Venkitesh, B.; Locatelli, A.; Ye, H.; Cai, T.; Lewis, P.; and Chen, D. 2024. SnapKV: LLM Knows What You are Looking for Before Generation. arXiv:2404.14469

  45. [48]

    Liao, H.; Xu, Y.; He, S.; Li, G.; Yin, X.; Li, D.; Barsoum, E.; Zhao, J.; and Liu, K. 2026. SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(38): 31961–31969

  46. [49]

    Liu, X.; Yu, H.; Zhang, H.; Xu, Y.; Lei, X.; Lai, H.; Gu, Y.; Ding, H.; Men, K.; Yang, K.; Zhang, S.; Deng, X.; Zeng, A.; Du, Z.; Zhang, C.; Shen, S.; Zhang, T.; Su, Y.; Sun, H.; Huang, M.; Dong, Y.; and Tang, J. 2023 a . AgentBench: Evaluating LLMs as Agents. arXiv:2308.03688

  47. [50]

    Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; and Zhu, C. 2023 b . G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. arXiv:2303.16634

  48. [51]

    Lumer, E.; Nizar, F.; Jangiti, A.; Frank, K.; Gulati, A.; Phadate, M.; and Subbiah, V. K. 2026. Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks . arXiv:2601.06007

  49. [52]

    MemGPT: Towards LLMs as Operating Systems

    Packer, C.; Wooders, S.; Lin, K.; Fang, V.; Patil, S. G.; Stoica, I.; and Gonzalez, J. E. 2024. MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560

  50. [53]

    Shi, Y.; Qian, Y.; Zhang, H.; Shen, B.; and Gu, X. 2025. LongCodeZip: Compress Long Context for Code Language Models . ASE 2025 Research Paper, arXiv:2510.00446

  51. [54]

    P.; Gao, X.; Gutiérrez, B

    Shu, Y.; Jonnalagedda, S. P.; Gao, X.; Gutiérrez, B. J.; Qi, W.; Das, K.; Sun, H.; and Su, Y. 2026. REMem: Reasoning with Episodic Memory in Language Agent. arXiv:2602.13530

  52. [55]

    H.-M.; Subagdja, B.; and Tan, A.-H

    Tan, C. H.-M.; Subagdja, B.; and Tan, A.-H. 2026. ARTEM: Enhancing Large Language Model Agents with Spatial-Temporal Episodic Memory. Proceedings of the AAAI Conference on Artificial Intelligence, 40(30): 25753–25760

  53. [56]

    Wang, W.; Dong, L.; Cheng, H.; Liu, X.; Yan, X.; Gao, J.; and Wei, F. 2023. Augmenting Language Models with Long-Term Memory. arXiv:2306.07174

  54. [57]

    Wang, Y.; Shi, Y.; Yang, M.; Zhang, R.; He, S.; Lian, H.; Chen, Y.; Ye, S.; Cai, K.; and Gu, X. 2026. SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents. arXiv:2601.16746

  55. [58]

    Wu, Y.; Zheng, Y.; Xu, T.; Zhang, Z.; Yu, Y.; Zhu, J.; Ma, C.; Lin, B.; Dong, B.; Zhu, H.; Huang, R.; and Yu, G. 2026. ContextBudget: Budget-Aware Context Management for Long-Horizon Search Agents. arXiv:2604.01664

  56. [59]

    Xiao, G.; Tian, Y.; Chen, B.; Han, S.; and Lewis, M. 2023. Efficient Streaming Language Models with Attention Sinks. arXiv:2309.17453

  57. [60]

    Xiao, Y.-A.; Gao, P.; Peng, C.; and Xiong, Y. 2025. Reducing Cost of LLM Agents with Trajectory Reduction . FSE 2026 Research Paper, arXiv:2509.23586

  58. [61]

    Xu, B.; Zhang, D.; and Arunachalam, R. 2026. From Model to Agent: Equipping the Responses API with a Computer Environment . Engineering blog, accessed 2026-06-04

  59. [62]

    Xu, W.; Liang, Z.; Mei, K.; Gao, H.; Tan, J.; and Zhang, Y. 2025. A-MEM: Agentic Memory for LLM Agents. arXiv:2502.12110

  60. [63]

    Ye, R.; Zhang, Z.; Li, K.; Yin, H.; Tao, Z.; Zhao, Y.; Su, L.; Zhang, L.; Qiao, Z.; Wang, X.; Xie, P.; Huang, F.; Zhou, J.; Chen, S.; and Jiang, Y. 2026. AgentFold: Long-Horizon Web Agents with Proactive Context Folding . ICLR 2026 Poster

  61. [64]

    Zhang, Z.; Sheng, Y.; Zhou, T.; Chen, T.; Zheng, L.; Cai, R.; Song, Z.; Tian, Y.; Ré, C.; Barrett, C.; Wang, Z.; and Chen, B. 2023. H _2 O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. arXiv:2306.14048

  62. [65]

    Zhao, A.; Huang, D.; Xu, Q.; Lin, M.; Liu, Y.-J.; and Huang, G. 2024. ExpeL: LLM Agents Are Experiential Learners. Proceedings of the AAAI Conference on Artificial Intelligence, 38(17): 19632–19642

  63. [66]

    Zhong, W.; Guo, L.; Gao, Q.; Ye, H.; and Wang, Y. 2024. MemoryBank: Enhancing Large Language Models with Long-Term Memory. Proceedings of the AAAI Conference on Artificial Intelligence, 38(17): 19724–19731

  64. [67]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Zhou, S.; Xu, F. F.; Zhu, H.; Zhou, X.; Lo, R.; Sridhar, A.; Cheng, X.; Ou, T.; Bisk, Y.; Fried, D.; Alon, U.; and Neubig, G. 2024. WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854

  65. [68]

    Zhu, L.; Wang, X.; and Wang, X. 2023. JudgeLM: Fine-tuned Large Language Models are Scalable Judges. arXiv:2310.17631