Self-GC: Self-Governing Context for Long-Horizon LLM Agents

Chenpeng Cao; Hongjin Meng; Jiawei Zhu; Xin Yin; Xubin Hao

arxiv: 2607.00692 · v1 · pith:HFJIU373new · submitted 2026-07-01 · 💻 cs.AI

Self-GC: Self-Governing Context for Long-Horizon LLM Agents

Xubin Hao , Hongjin Meng , Xin Yin , Jiawei Zhu , Chenpeng Cao This is my paper

Pith reviewed 2026-07-02 12:47 UTC · model grok-4.3

classification 💻 cs.AI

keywords self-governing contextLLM agentscontext pruninglong-horizon taskstoken efficiencycontext managementagent memoryrecoverable objects

0 comments

The pith

Self-GC indexes agent context as recoverable objects and uses a planner to prune 43.95% of prefix tokens while keeping 84.85% of future continuations unaffected.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Long-horizon LLM agents build up structured context from tools, files, plans and constraints that heuristics either discard blindly or summarize at the cost of exact details. Self-GC converts these elements into indexed objects, routes decisions through a side-channel planner that suggests fold, mask or prune actions, and applies a harness to keep sidecars recoverable and commits safe. On a 33-session hard benchmark it removes nearly 44% of early tokens with 85% of later turns continuing unchanged, beating heuristic baselines whose no-impact rates range from 55% to 70%. The same pattern holds on a 332-session production suite and yields 10-15% daytime token savings in live deployment. The shift is from reactive text cleanup to explicit lifecycle control over context objects.

Core claim

Self-GC turns user turns, tool spans, and skill state into indexed objects; asks a side-channel planner to propose fold, mask, and prune actions; and lets the harness enforce recoverable sidecars, safe commit boundaries, and cache-aware commit. On the 33-session Hard Set this prunes 43.95% of prefix tokens while leaving 84.85% of future continuations unaffected, outperforming heuristic baselines whose no-impact rates range from 54.55% to 69.70%. On the 332-session production-derived suite the three planner backbones reach no-impact rates of 91.27% to 94.58% while baselines stay at 77.71% to 87.46%. In production an online account-level split reduces daytime average input tokens by 10% to 15%

What carries the argument

Side-channel planner that proposes fold, mask, and prune actions on indexed context objects, with harness enforcement of recoverability and commit safety.

If this is right

Agents can run longer sessions before hitting context limits because prefix tokens are reduced without breaking most continuations.
Production deployments see consistent 10-15% lower average input token counts during daytime operation.
Planner-based decisions outperform chronological pruning and tool-output masking on both hard and production-derived session sets.
Context objects remain recoverable through sidecars, allowing later restoration if a prune proves too aggressive.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same object-indexing approach could be applied to multi-agent systems where one agent's pruned context must stay legible to others.
If the planner backbone improves, the method might support even longer horizons by combining pruning with selective summarization of non-critical spans.
Production token savings suggest the technique could be inserted into existing agent runtimes with only modest changes to the harness layer.

Load-bearing premise

That an unaffected future continuation reliably shows all critical context information for task completion has been retained.

What would settle it

Run the 33-session hard set with human annotations of every essential locator, evidence item, and constraint; measure whether Self-GC prunes any annotated item at a materially higher rate than the baselines while still reporting the 84.85% no-impact figure.

Figures

Figures reproduced from arXiv: 2607.00692 by Chenpeng Cao, Hongjin Meng, Jiawei Zhu, Xin Yin, Xubin Hao.

**Figure 2.** Figure 2: Overview of the Self-GC runtime framework. The context-engine hook exposes indexed context objects to a side [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Lifecycle actions over indexed objects. Fold moves [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Compression and preservation trade-off on the Hard Set and Production Suite. Each point reports mean pruning rate [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Online average input tokens for the Self-GC group and control group during daytime production traffic, May 25 to 30, [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Planning, rehearsal, and commit sequence used by Self-GC. Planning runs in a side channel, while accepted plans are [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Context-object data model used by Self-GC. User-turn and tool-span objects carry stable identifiers, lifecycle state, [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

read the original abstract

Long-horizon LLM agents accumulate tool results, files, plans, and user constraints that are too structured to be treated as a disposable text suffix. Current systems mostly rely on in-run heuristics such as chronological pruning and tool-output masking, or on final self-summary near a context limit. Heuristics are cheap but blind to future dependencies; summaries preserve narrative state but often hide exact evidence, locators, and editable artifacts. We present Self-GC, where GC denotes self-governing context while deliberately echoing garbage collection: the system does not merely reclaim unused tokens, but governs the lifecycle of agent context objects. Self-GC turns user turns, tool spans, and skill state into indexed objects; asks a side-channel planner to propose fold, mask, and prune actions; and lets the harness enforce recoverable sidecars, safe commit boundaries, and cache-aware commit. On a 33-session Hard Set, Self-GC prunes 43.95% of prefix tokens while leaving 84.85% of future continuations unaffected, compared with no-impact rates of 54.55% to 69.70% for heuristic baselines. On a 332-session production-derived suite, three planner backbones reach no-impact rates of 91.27% to 94.58%, while baselines remain at 77.71% to 87.46%. In production, an online account-level split reduces daytime average input tokens by 10% to 15%, with peak reductions near 20%. These results point to context management as runtime lifecycle control over indexed, recoverable objects rather than post hoc text cleanup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Self-GC gives a concrete object-based system for pruning agent context with reported gains over heuristics, but the key metric needs more transparent validation.

read the letter

The main takeaway is that Self-GC models context as indexed objects, routes pruning decisions through a side-channel planner, and adds recoverable sidecars plus commit rules. This is a step past simple chronological cuts or end-of-run summaries.

What the paper actually does is turn user turns, tool outputs, and state into manageable objects and then measure how much prefix can be dropped while future model continuations stay the same. On the 33-session hard set it reports 43.95% token reduction at 84.85% no-impact, beating the heuristic baselines. The larger 332-session suite and the production split showing 10-15% daytime token cuts are the most practical numbers.

The evaluation is empirical and the deltas are presented clearly against stated baselines. That gives readers something concrete to test against their own agent runs.

The soft spot is the no-impact metric itself. The abstract does not spell out how continuations are compared, whether the metric correlates with actual task completion, or how the 33 sessions were sampled. If delayed dependencies or rare but critical facts are the real failure mode, the 84.85% figure could overstate safety. The hard set is also small enough that one or two edge cases could move the numbers.

This paper is for engineers and researchers who already run long-horizon agents and need a more structured alternative to ad-hoc pruning. It is worth sending to peer review so the metric definition, session selection, and any ablation on dependency types can be checked directly.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Self-GC, a context management framework for long-horizon LLM agents that models user turns, tool spans, and skill state as indexed objects. A side-channel planner proposes fold, mask, and prune actions, which a harness enforces via recoverable sidecars, safe commit boundaries, and cache-aware commits. On a 33-session Hard Set the system prunes 43.95% of prefix tokens while leaving 84.85% of future continuations unaffected (versus 54.55–69.70% for heuristic baselines); on a 332-session production-derived suite three planner backbones reach 91.27–94.58% no-impact rates (baselines 77.71–87.46%); production deployment yields 10–15% average daytime token reduction.

Significance. If the no-impact metric reliably indicates preservation of task-critical information, the work offers a structured alternative to heuristic pruning or late-stage summarization and demonstrates measurable efficiency gains both in controlled suites and live production. The emphasis on recoverable objects and planner-driven lifecycle control is a concrete step beyond post-hoc text cleanup.

major comments (2)

[§4] §4 (Evaluation): the definition and computation of the 'future continuations unaffected' (no-impact) metric are not provided. The manuscript must specify (a) the exact procedure for generating and comparing continuations, (b) how the 33 Hard Set sessions were selected, and (c) any controls or ablations for delayed dependencies or task-success correlation; without these the 84.85% figure cannot be assessed as evidence that critical context is preserved.
[§3.2] §3.2 (Planner and Harness): the side-channel planner and recoverable-sidecar mechanisms are central to the claimed advantage over heuristics, yet no ablation isolates their contribution from the underlying LLM backbone or from the commit-boundary logic. A controlled removal of either component on the Hard Set would be required to substantiate that the reported deltas are attributable to the proposed architecture rather than to planner quality alone.

minor comments (2)

[§1] The abstract and §1 use 'GC' to echo garbage collection; a brief sentence clarifying the analogy (and its limits) would aid readers unfamiliar with the metaphor.
Table captions and axis labels for the production token-reduction results should explicitly state the time window and account-level split used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation metric and the need for component ablations. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and experiments.

read point-by-point responses

Referee: [§4] §4 (Evaluation): the definition and computation of the 'future continuations unaffected' (no-impact) metric are not provided. The manuscript must specify (a) the exact procedure for generating and comparing continuations, (b) how the 33 Hard Set sessions were selected, and (c) any controls or ablations for delayed dependencies or task-success correlation; without these the 84.85% figure cannot be assessed as evidence that critical context is preserved.

Authors: We agree that the no-impact metric requires a precise, reproducible definition to support the reported 84.85% figure. The manuscript currently states the outcome but omits the procedural details. In the revised §4 we will add: (a) the exact procedure, consisting of replaying each Hard Set session from the pruned context, generating the next agent continuation with the same backbone, and scoring it as unaffected if it matches the original continuation on task-critical elements (via both automated embedding similarity above a threshold and manual review for dependency preservation); (b) the selection process for the 33 sessions, drawn from production logs as the longest-horizon traces containing at least three cross-turn dependencies; (c) controls consisting of explicit checks that no pruned object is referenced in the subsequent 10 turns and a correlation analysis between no-impact rate and end-to-end task success. These additions will be included in the next version. revision: yes
Referee: [§3.2] §3.2 (Planner and Harness): the side-channel planner and recoverable-sidecar mechanisms are central to the claimed advantage over heuristics, yet no ablation isolates their contribution from the underlying LLM backbone or from the commit-boundary logic. A controlled removal of either component on the Hard Set would be required to substantiate that the reported deltas are attributable to the proposed architecture rather than to planner quality alone.

Authors: We acknowledge that the current results compare different planner backbones but do not isolate the side-channel planner or the recoverable-sidecar enforcement from the commit-boundary logic. In the revised manuscript we will add a controlled ablation on the Hard Set that (i) disables the planner entirely and falls back to the strongest heuristic baseline, and (ii) retains the planner but removes recoverable sidecars (forcing immediate non-recoverable commits). The resulting no-impact and token-reduction deltas will be reported to quantify the incremental contribution of each mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical system evaluation

full rationale

The manuscript describes an agent context-management system (Self-GC) and evaluates it via direct measurements on two fixed test collections (33-session Hard Set, 332-session production suite). Reported figures are pruning percentages and continuation-impact rates obtained by running the system and baselines on those sets; no equations, fitted parameters, or first-principles derivations are present that could reduce to their own inputs. Self-citations, if any, are not load-bearing for any claimed derivation. The evaluation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper introduces new architectural components whose effectiveness is demonstrated empirically but whose design assumptions are not independently verified in the abstract.

axioms (1)

domain assumption It is possible to represent agent context elements as indexed objects that can be independently acted upon without breaking dependencies.
This underpins the entire object-based management approach described.

invented entities (2)

side-channel planner no independent evidence
purpose: Proposes fold, mask, and prune actions on context objects.
A new component introduced to enable the governance of context lifecycle.
recoverable sidecars no independent evidence
purpose: Enforce safe commit boundaries and cache-aware commits for context changes.
Part of the system harness for safe management.

pith-pipeline@v0.9.1-grok · 5831 in / 1345 out tokens · 39783 ms · 2026-07-02T12:47:25.716780+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 31 canonical work pages · 15 internal anchors

[1]

Xu, Bo and Zhang, Danny and Arunachalam, Rohit , year =
[2]

Cassano, Federico and Rush, Sasha , year =
[5]

Ye, Rui and Zhang, Zhongwang and Li, Kuan and Yin, Huifeng and Tao, Zhengwei and Zhao, Yida and Su, Liangcai and Zhang, Liwen and Qiao, Zile and Wang, Xinyu and Xie, Pengjun and Huang, Fei and Zhou, Jingren and Chen, Siheng and Jiang, Yong , year =
[7]

and Xu, Ruifeng and Wong, Kam-Fai , year=

Du, Yiming and Wang, Bingbing and He, Yang and Liang, Bin and Wang, Baojun and Li, Zhongyang and Gui, Lin and Pan, Jeff Z. and Xu, Ruifeng and Wong, Kam-Fai , year=. MemGuide: Intent-Driven Memory Selection for Goal-Oriented Multi-Session LLM Agents , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , publisher=. doi:10.1609/aaai.v40...

work page doi:10.1609/aaai.v40i36.40313
[8]

ARTEM: Enhancing Large Language Model Agents with Spatial-Temporal Episodic Memory , volume=

Tan, Cassandra Hui-Ming and Subagdja, Budhitama and Tan, Ah-Hwee , year=. ARTEM: Enhancing Large Language Model Agents with Spatial-Temporal Episodic Memory , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , publisher=. doi:10.1609/aaai.v40i30.39773 , number=

work page doi:10.1609/aaai.v40i30.39773
[9]

MemoryART: Enhancing LLMs via Multi-Memory Models with Adaptive Resonance Theory for Healthcare Agents , volume=

Dai, Renke and Hu, Hebin and Zhang, Jiahui and Kang, Yilin and Tan, Ah-Hwee , year=. MemoryART: Enhancing LLMs via Multi-Memory Models with Adaptive Resonance Theory for Healthcare Agents , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , publisher=. doi:10.1609/aaai.v40i25.39205 , number=

work page doi:10.1609/aaai.v40i25.39205
[10]

(2026, February 13)

Zhong, Wanjun and Guo, Lianghong and Gao, Qiqi and Ye, He and Wang, Yanlin , year=. MemoryBank: Enhancing Large Language Models with Long-Term Memory , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , publisher=. doi:10.1609/aaai.v38i17.29946 , number=

work page doi:10.1609/aaai.v38i17.29946
[11]

ExpeL: LLM Agents Are Experiential Learners , volume=

Zhao, Andrew and Huang, Daniel and Xu, Quentin and Lin, Matthieu and Liu, Yong-Jin and Huang, Gao , year=. ExpeL: LLM Agents Are Experiential Learners , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , publisher=. doi:10.1609/aaai.v38i17.29936 , number=

work page doi:10.1609/aaai.v38i17.29936
[12]

AutoTool: Efficient Tool Selection for Large Language Model Agents , volume=

Jia, Jingyi and Li, Qinbin , year=. AutoTool: Efficient Tool Selection for Large Language Model Agents , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , publisher=. doi:10.1609/aaai.v40i37.40389 , number=

work page doi:10.1609/aaai.v40i37.40389
[13]

MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools , volume=

Guo, Zikang and Xu, Benfeng and Zhu, Chiwei and Hong, Wentao and Wang, Xiaorui and Mao, Zhendong , year=. MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , publisher=. doi:10.1609/aaai.v40i37.40347 , number=

work page doi:10.1609/aaai.v40i37.40347
[14]

SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning , volume=

Liao, Huanxuan and Xu, Yixing and He, Shizhu and Li, Guanchen and Yin, Xuanwu and Li, Dong and Barsoum, Emad and Zhao, Jun and Liu, Kang , year=. SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , publisher=. doi:10.1609/aaai.v40i38.40466 , number=

work page doi:10.1609/aaai.v40i38.40466
[15]

2024 , eprint=

MemGPT: Towards LLMs as Operating Systems , author=. 2024 , eprint=

2024
[16]

2023 , eprint=

Augmenting Language Models with Long-Term Memory , author=. 2023 , eprint=

2023
[17]

2025 , eprint=

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory , author=. 2025 , eprint=

2025
[18]

2025 , eprint=

A-MEM: Agentic Memory for LLM Agents , author=. 2025 , eprint=

2025
[19]

2024 , eprint=

HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models , author=. 2024 , eprint=

2024
[20]

2026 , eprint=

REMem: Reasoning with Episodic Memory in Language Agent , author=. 2026 , eprint=

2026
[21]

2026 , eprint=

TiMem: Temporal-Hierarchical Memory Consolidation for Long-Horizon Conversational Agents , author=. 2026 , eprint=

2026
[22]

2020 , eprint=

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. 2020 , eprint=

2020
[23]

2026 , eprint=

SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents , author=. 2026 , eprint=

2026
[24]

2026 , eprint=

ContextBudget: Budget-Aware Context Management for Long-Horizon Search Agents , author=. 2026 , eprint=

2026
[25]

2025 , eprint=

ACON: Optimizing Context Compression for Long-horizon LLM Agents , author=. 2025 , eprint=

2025
[26]

2023 , eprint=

H _2 O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models , author=. 2023 , eprint=

2023
[27]

2023 , eprint=

Efficient Streaming Language Models with Attention Sinks , author=. 2023 , eprint=

2023
[28]

2024 , eprint=

SnapKV: LLM Knows What You are Looking for Before Generation , author=. 2024 , eprint=

2024
[29]

2024 , eprint=

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling , author=. 2024 , eprint=

2024
[30]

2023 , eprint=

AgentBench: Evaluating LLMs as Agents , author=. 2023 , eprint=

2023
[31]

2024 , eprint=

WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. 2024 , eprint=

2024
[32]

2023 , eprint=

JudgeLM: Fine-tuned Large Language Models are Scalable Judges , author=. 2023 , eprint=

2023
[33]

2023 , eprint=

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment , author=. 2023 , eprint=

2023
[34]

Cai, Z.; Zhang, Y.; Gao, B.; Liu, Y.; Li, Y.; Liu, T.; Lu, K.; Xiong, W.; Dong, Y.; Hu, J.; and Xiao, W. 2024. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling. arXiv:2406.02069

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Cassano, F.; and Rush, S. 2026. Training Composer for Longer Horizons . Cursor research blog, accessed 2026-06-04

2026
[36]

Chhikara, P.; Khant, D.; Aryan, S.; Singh, T.; and Yadav, D. 2025. Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv:2504.19413

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Chroma . 2026. Chroma Context-1: Training a Self-Editing Search Agent . Research blog, accessed 2026-06-04

2026
[38]

Dai, R.; Hu, H.; Zhang, J.; Kang, Y.; and Tan, A.-H. 2026. MemoryART: Enhancing LLMs via Multi-Memory Models with Adaptive Resonance Theory for Healthcare Agents. Proceedings of the AAAI Conference on Artificial Intelligence, 40(25): 20676–20683

2026
[39]

DeepSeek-AI . 2025. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models . arXiv:2512.02556

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Z.; Xu, R.; and Wong, K.-F

Du, Y.; Wang, B.; He, Y.; Liang, B.; Wang, B.; Li, Z.; Gui, L.; Pan, J. Z.; Xu, R.; and Wong, K.-F. 2026. MemGuide: Intent-Driven Memory Selection for Goal-Oriented Multi-Session LLM Agents. Proceedings of the AAAI Conference on Artificial Intelligence, 40(36): 30584–30592

2026
[41]

Guo, Z.; Xu, B.; Zhu, C.; Hong, W.; Wang, X.; and Mao, Z. 2026. MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools. Proceedings of the AAAI Conference on Artificial Intelligence, 40(37): 30888–30896

2026
[42]

J.; Shu, Y.; Gu, Y.; Yasunaga, M.; and Su, Y

Gutiérrez, B. J.; Shu, Y.; Gu, Y.; Yasunaga, M.; and Su, Y. 2024. HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. arXiv:2405.14831

work page arXiv 2024
[43]

Jia, J.; and Li, Q. 2026. AutoTool: Efficient Tool Selection for Large Language Model Agents. Proceedings of the AAAI Conference on Artificial Intelligence, 40(37): 31265–31273

2026
[44]

ACON: Optimizing Context Compression for Long-horizon LLM Agents

Kang, M.; Chen, W.-N.; Han, D.; Inan, H. A.; Wutschitz, L.; Chen, Y.; Sim, R.; and Rajmohan, S. 2025. ACON: Optimizing Context Compression for Long-horizon LLM Agents. arXiv:2510.00615

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; tau Yih, W.; Rocktäschel, T.; Riedel, S.; and Kiela, D. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401

work page internal anchor Pith review Pith/arXiv arXiv 2020
[46]

Li, K.; Yu, X.; Ni, Z.; Zeng, Y.; Xu, Y.; Zhang, Z.; Li, X.; Sang, J.; Duan, X.; Wang, X.; Liu, C.; and Tan, J. 2026. TiMem: Temporal-Hierarchical Memory Consolidation for Long-Horizon Conversational Agents. arXiv:2601.02845

work page internal anchor Pith review Pith/arXiv arXiv 2026
[47]

Li, Y.; Huang, Y.; Yang, B.; Venkitesh, B.; Locatelli, A.; Ye, H.; Cai, T.; Lewis, P.; and Chen, D. 2024. SnapKV: LLM Knows What You are Looking for Before Generation. arXiv:2404.14469

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Liao, H.; Xu, Y.; He, S.; Li, G.; Yin, X.; Li, D.; Barsoum, E.; Zhao, J.; and Liu, K. 2026. SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(38): 31961–31969

2026
[49]

Liu, X.; Yu, H.; Zhang, H.; Xu, Y.; Lei, X.; Lai, H.; Gu, Y.; Ding, H.; Men, K.; Yang, K.; Zhang, S.; Deng, X.; Zeng, A.; Du, Z.; Zhang, C.; Shen, S.; Zhang, T.; Su, Y.; Sun, H.; Huang, M.; Dong, Y.; and Tang, J. 2023 a . AgentBench: Evaluating LLMs as Agents. arXiv:2308.03688

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; and Zhu, C. 2023 b . G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. arXiv:2303.16634

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Lumer, E.; Nizar, F.; Jangiti, A.; Frank, K.; Gulati, A.; Phadate, M.; and Subbiah, V. K. 2026. Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks . arXiv:2601.06007

work page arXiv 2026
[52]

MemGPT: Towards LLMs as Operating Systems

Packer, C.; Wooders, S.; Lin, K.; Fang, V.; Patil, S. G.; Stoica, I.; and Gonzalez, J. E. 2024. MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Shi, Y.; Qian, Y.; Zhang, H.; Shen, B.; and Gu, X. 2025. LongCodeZip: Compress Long Context for Code Language Models . ASE 2025 Research Paper, arXiv:2510.00446

work page arXiv 2025
[54]

P.; Gao, X.; Gutiérrez, B

Shu, Y.; Jonnalagedda, S. P.; Gao, X.; Gutiérrez, B. J.; Qi, W.; Das, K.; Sun, H.; and Su, Y. 2026. REMem: Reasoning with Episodic Memory in Language Agent. arXiv:2602.13530

work page arXiv 2026
[55]

H.-M.; Subagdja, B.; and Tan, A.-H

Tan, C. H.-M.; Subagdja, B.; and Tan, A.-H. 2026. ARTEM: Enhancing Large Language Model Agents with Spatial-Temporal Episodic Memory. Proceedings of the AAAI Conference on Artificial Intelligence, 40(30): 25753–25760

2026
[56]

Wang, W.; Dong, L.; Cheng, H.; Liu, X.; Yan, X.; Gao, J.; and Wei, F. 2023. Augmenting Language Models with Long-Term Memory. arXiv:2306.07174

work page arXiv 2023
[57]

Wang, Y.; Shi, Y.; Yang, M.; Zhang, R.; He, S.; Lian, H.; Chen, Y.; Ye, S.; Cai, K.; and Gu, X. 2026. SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents. arXiv:2601.16746

work page internal anchor Pith review Pith/arXiv arXiv 2026
[58]

Wu, Y.; Zheng, Y.; Xu, T.; Zhang, Z.; Yu, Y.; Zhu, J.; Ma, C.; Lin, B.; Dong, B.; Zhu, H.; Huang, R.; and Yu, G. 2026. ContextBudget: Budget-Aware Context Management for Long-Horizon Search Agents. arXiv:2604.01664

work page arXiv 2026
[59]

Xiao, G.; Tian, Y.; Chen, B.; Han, S.; and Lewis, M. 2023. Efficient Streaming Language Models with Attention Sinks. arXiv:2309.17453

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Xiao, Y.-A.; Gao, P.; Peng, C.; and Xiong, Y. 2025. Reducing Cost of LLM Agents with Trajectory Reduction . FSE 2026 Research Paper, arXiv:2509.23586

work page arXiv 2025
[61]

Xu, B.; Zhang, D.; and Arunachalam, R. 2026. From Model to Agent: Equipping the Responses API with a Computer Environment . Engineering blog, accessed 2026-06-04

2026
[62]

Xu, W.; Liang, Z.; Mei, K.; Gao, H.; Tan, J.; and Zhang, Y. 2025. A-MEM: Agentic Memory for LLM Agents. arXiv:2502.12110

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Ye, R.; Zhang, Z.; Li, K.; Yin, H.; Tao, Z.; Zhao, Y.; Su, L.; Zhang, L.; Qiao, Z.; Wang, X.; Xie, P.; Huang, F.; Zhou, J.; Chen, S.; and Jiang, Y. 2026. AgentFold: Long-Horizon Web Agents with Proactive Context Folding . ICLR 2026 Poster

2026
[64]

Zhang, Z.; Sheng, Y.; Zhou, T.; Chen, T.; Zheng, L.; Cai, R.; Song, Z.; Tian, Y.; Ré, C.; Barrett, C.; Wang, Z.; and Chen, B. 2023. H _2 O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. arXiv:2306.14048

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

Zhao, A.; Huang, D.; Xu, Q.; Lin, M.; Liu, Y.-J.; and Huang, G. 2024. ExpeL: LLM Agents Are Experiential Learners. Proceedings of the AAAI Conference on Artificial Intelligence, 38(17): 19632–19642

2024
[66]

Zhong, W.; Guo, L.; Gao, Q.; Ye, H.; and Wang, Y. 2024. MemoryBank: Enhancing Large Language Models with Long-Term Memory. Proceedings of the AAAI Conference on Artificial Intelligence, 38(17): 19724–19731

2024
[67]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Zhou, S.; Xu, F. F.; Zhu, H.; Zhou, X.; Lo, R.; Sridhar, A.; Cheng, X.; Ou, T.; Bisk, Y.; Fried, D.; Alon, U.; and Neubig, G. 2024. WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854

work page internal anchor Pith review Pith/arXiv arXiv 2024
[68]

Zhu, L.; Wang, X.; and Wang, X. 2023. JudgeLM: Fine-tuned Large Language Models are Scalable Judges. arXiv:2310.17631

work page arXiv 2023

[1] [1]

Xu, Bo and Zhang, Danny and Arunachalam, Rohit , year =

[2] [2]

Cassano, Federico and Rush, Sasha , year =

[3] [5]

Ye, Rui and Zhang, Zhongwang and Li, Kuan and Yin, Huifeng and Tao, Zhengwei and Zhao, Yida and Su, Liangcai and Zhang, Liwen and Qiao, Zile and Wang, Xinyu and Xie, Pengjun and Huang, Fei and Zhou, Jingren and Chen, Siheng and Jiang, Yong , year =

[4] [7]

and Xu, Ruifeng and Wong, Kam-Fai , year=

Du, Yiming and Wang, Bingbing and He, Yang and Liang, Bin and Wang, Baojun and Li, Zhongyang and Gui, Lin and Pan, Jeff Z. and Xu, Ruifeng and Wong, Kam-Fai , year=. MemGuide: Intent-Driven Memory Selection for Goal-Oriented Multi-Session LLM Agents , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , publisher=. doi:10.1609/aaai.v40...

work page doi:10.1609/aaai.v40i36.40313

[5] [8]

ARTEM: Enhancing Large Language Model Agents with Spatial-Temporal Episodic Memory , volume=

Tan, Cassandra Hui-Ming and Subagdja, Budhitama and Tan, Ah-Hwee , year=. ARTEM: Enhancing Large Language Model Agents with Spatial-Temporal Episodic Memory , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , publisher=. doi:10.1609/aaai.v40i30.39773 , number=

work page doi:10.1609/aaai.v40i30.39773

[6] [9]

MemoryART: Enhancing LLMs via Multi-Memory Models with Adaptive Resonance Theory for Healthcare Agents , volume=

Dai, Renke and Hu, Hebin and Zhang, Jiahui and Kang, Yilin and Tan, Ah-Hwee , year=. MemoryART: Enhancing LLMs via Multi-Memory Models with Adaptive Resonance Theory for Healthcare Agents , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , publisher=. doi:10.1609/aaai.v40i25.39205 , number=

work page doi:10.1609/aaai.v40i25.39205

[7] [10]

(2026, February 13)

Zhong, Wanjun and Guo, Lianghong and Gao, Qiqi and Ye, He and Wang, Yanlin , year=. MemoryBank: Enhancing Large Language Models with Long-Term Memory , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , publisher=. doi:10.1609/aaai.v38i17.29946 , number=

work page doi:10.1609/aaai.v38i17.29946

[8] [11]

ExpeL: LLM Agents Are Experiential Learners , volume=

Zhao, Andrew and Huang, Daniel and Xu, Quentin and Lin, Matthieu and Liu, Yong-Jin and Huang, Gao , year=. ExpeL: LLM Agents Are Experiential Learners , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , publisher=. doi:10.1609/aaai.v38i17.29936 , number=

work page doi:10.1609/aaai.v38i17.29936

[9] [12]

AutoTool: Efficient Tool Selection for Large Language Model Agents , volume=

Jia, Jingyi and Li, Qinbin , year=. AutoTool: Efficient Tool Selection for Large Language Model Agents , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , publisher=. doi:10.1609/aaai.v40i37.40389 , number=

work page doi:10.1609/aaai.v40i37.40389

[10] [13]

MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools , volume=

Guo, Zikang and Xu, Benfeng and Zhu, Chiwei and Hong, Wentao and Wang, Xiaorui and Mao, Zhendong , year=. MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , publisher=. doi:10.1609/aaai.v40i37.40347 , number=

work page doi:10.1609/aaai.v40i37.40347

[11] [14]

SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning , volume=

Liao, Huanxuan and Xu, Yixing and He, Shizhu and Li, Guanchen and Yin, Xuanwu and Li, Dong and Barsoum, Emad and Zhao, Jun and Liu, Kang , year=. SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , publisher=. doi:10.1609/aaai.v40i38.40466 , number=

work page doi:10.1609/aaai.v40i38.40466

[12] [15]

2024 , eprint=

MemGPT: Towards LLMs as Operating Systems , author=. 2024 , eprint=

2024

[13] [16]

2023 , eprint=

Augmenting Language Models with Long-Term Memory , author=. 2023 , eprint=

2023

[14] [17]

2025 , eprint=

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory , author=. 2025 , eprint=

2025

[15] [18]

2025 , eprint=

A-MEM: Agentic Memory for LLM Agents , author=. 2025 , eprint=

2025

[16] [19]

2024 , eprint=

HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models , author=. 2024 , eprint=

2024

[17] [20]

2026 , eprint=

REMem: Reasoning with Episodic Memory in Language Agent , author=. 2026 , eprint=

2026

[18] [21]

2026 , eprint=

TiMem: Temporal-Hierarchical Memory Consolidation for Long-Horizon Conversational Agents , author=. 2026 , eprint=

2026

[19] [22]

2020 , eprint=

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. 2020 , eprint=

2020

[20] [23]

2026 , eprint=

SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents , author=. 2026 , eprint=

2026

[21] [24]

2026 , eprint=

ContextBudget: Budget-Aware Context Management for Long-Horizon Search Agents , author=. 2026 , eprint=

2026

[22] [25]

2025 , eprint=

ACON: Optimizing Context Compression for Long-horizon LLM Agents , author=. 2025 , eprint=

2025

[23] [26]

2023 , eprint=

H _2 O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models , author=. 2023 , eprint=

2023

[24] [27]

2023 , eprint=

Efficient Streaming Language Models with Attention Sinks , author=. 2023 , eprint=

2023

[25] [28]

2024 , eprint=

SnapKV: LLM Knows What You are Looking for Before Generation , author=. 2024 , eprint=

2024

[26] [29]

2024 , eprint=

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling , author=. 2024 , eprint=

2024

[27] [30]

2023 , eprint=

AgentBench: Evaluating LLMs as Agents , author=. 2023 , eprint=

2023

[28] [31]

2024 , eprint=

WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. 2024 , eprint=

2024

[29] [32]

2023 , eprint=

JudgeLM: Fine-tuned Large Language Models are Scalable Judges , author=. 2023 , eprint=

2023

[30] [33]

2023 , eprint=

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment , author=. 2023 , eprint=

2023

[31] [34]

Cai, Z.; Zhang, Y.; Gao, B.; Liu, Y.; Li, Y.; Liu, T.; Lu, K.; Xiong, W.; Dong, Y.; Hu, J.; and Xiao, W. 2024. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling. arXiv:2406.02069

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [35]

Cassano, F.; and Rush, S. 2026. Training Composer for Longer Horizons . Cursor research blog, accessed 2026-06-04

2026

[33] [36]

Chhikara, P.; Khant, D.; Aryan, S.; Singh, T.; and Yadav, D. 2025. Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv:2504.19413

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [37]

Chroma . 2026. Chroma Context-1: Training a Self-Editing Search Agent . Research blog, accessed 2026-06-04

2026

[35] [38]

Dai, R.; Hu, H.; Zhang, J.; Kang, Y.; and Tan, A.-H. 2026. MemoryART: Enhancing LLMs via Multi-Memory Models with Adaptive Resonance Theory for Healthcare Agents. Proceedings of the AAAI Conference on Artificial Intelligence, 40(25): 20676–20683

2026

[36] [39]

DeepSeek-AI . 2025. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models . arXiv:2512.02556

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [40]

Z.; Xu, R.; and Wong, K.-F

Du, Y.; Wang, B.; He, Y.; Liang, B.; Wang, B.; Li, Z.; Gui, L.; Pan, J. Z.; Xu, R.; and Wong, K.-F. 2026. MemGuide: Intent-Driven Memory Selection for Goal-Oriented Multi-Session LLM Agents. Proceedings of the AAAI Conference on Artificial Intelligence, 40(36): 30584–30592

2026

[38] [41]

Guo, Z.; Xu, B.; Zhu, C.; Hong, W.; Wang, X.; and Mao, Z. 2026. MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools. Proceedings of the AAAI Conference on Artificial Intelligence, 40(37): 30888–30896

2026

[39] [42]

J.; Shu, Y.; Gu, Y.; Yasunaga, M.; and Su, Y

Gutiérrez, B. J.; Shu, Y.; Gu, Y.; Yasunaga, M.; and Su, Y. 2024. HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. arXiv:2405.14831

work page arXiv 2024

[40] [43]

Jia, J.; and Li, Q. 2026. AutoTool: Efficient Tool Selection for Large Language Model Agents. Proceedings of the AAAI Conference on Artificial Intelligence, 40(37): 31265–31273

2026

[41] [44]

ACON: Optimizing Context Compression for Long-horizon LLM Agents

Kang, M.; Chen, W.-N.; Han, D.; Inan, H. A.; Wutschitz, L.; Chen, Y.; Sim, R.; and Rajmohan, S. 2025. ACON: Optimizing Context Compression for Long-horizon LLM Agents. arXiv:2510.00615

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [45]

Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; tau Yih, W.; Rocktäschel, T.; Riedel, S.; and Kiela, D. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401

work page internal anchor Pith review Pith/arXiv arXiv 2020

[43] [46]

Li, K.; Yu, X.; Ni, Z.; Zeng, Y.; Xu, Y.; Zhang, Z.; Li, X.; Sang, J.; Duan, X.; Wang, X.; Liu, C.; and Tan, J. 2026. TiMem: Temporal-Hierarchical Memory Consolidation for Long-Horizon Conversational Agents. arXiv:2601.02845

work page internal anchor Pith review Pith/arXiv arXiv 2026

[44] [47]

Li, Y.; Huang, Y.; Yang, B.; Venkitesh, B.; Locatelli, A.; Ye, H.; Cai, T.; Lewis, P.; and Chen, D. 2024. SnapKV: LLM Knows What You are Looking for Before Generation. arXiv:2404.14469

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [48]

Liao, H.; Xu, Y.; He, S.; Li, G.; Yin, X.; Li, D.; Barsoum, E.; Zhao, J.; and Liu, K. 2026. SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(38): 31961–31969

2026

[46] [49]

Liu, X.; Yu, H.; Zhang, H.; Xu, Y.; Lei, X.; Lai, H.; Gu, Y.; Ding, H.; Men, K.; Yang, K.; Zhang, S.; Deng, X.; Zeng, A.; Du, Z.; Zhang, C.; Shen, S.; Zhang, T.; Su, Y.; Sun, H.; Huang, M.; Dong, Y.; and Tang, J. 2023 a . AgentBench: Evaluating LLMs as Agents. arXiv:2308.03688

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [50]

Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; and Zhu, C. 2023 b . G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. arXiv:2303.16634

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [51]

Lumer, E.; Nizar, F.; Jangiti, A.; Frank, K.; Gulati, A.; Phadate, M.; and Subbiah, V. K. 2026. Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks . arXiv:2601.06007

work page arXiv 2026

[49] [52]

MemGPT: Towards LLMs as Operating Systems

Packer, C.; Wooders, S.; Lin, K.; Fang, V.; Patil, S. G.; Stoica, I.; and Gonzalez, J. E. 2024. MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [53]

Shi, Y.; Qian, Y.; Zhang, H.; Shen, B.; and Gu, X. 2025. LongCodeZip: Compress Long Context for Code Language Models . ASE 2025 Research Paper, arXiv:2510.00446

work page arXiv 2025

[51] [54]

P.; Gao, X.; Gutiérrez, B

Shu, Y.; Jonnalagedda, S. P.; Gao, X.; Gutiérrez, B. J.; Qi, W.; Das, K.; Sun, H.; and Su, Y. 2026. REMem: Reasoning with Episodic Memory in Language Agent. arXiv:2602.13530

work page arXiv 2026

[52] [55]

H.-M.; Subagdja, B.; and Tan, A.-H

Tan, C. H.-M.; Subagdja, B.; and Tan, A.-H. 2026. ARTEM: Enhancing Large Language Model Agents with Spatial-Temporal Episodic Memory. Proceedings of the AAAI Conference on Artificial Intelligence, 40(30): 25753–25760

2026

[53] [56]

Wang, W.; Dong, L.; Cheng, H.; Liu, X.; Yan, X.; Gao, J.; and Wei, F. 2023. Augmenting Language Models with Long-Term Memory. arXiv:2306.07174

work page arXiv 2023

[54] [57]

Wang, Y.; Shi, Y.; Yang, M.; Zhang, R.; He, S.; Lian, H.; Chen, Y.; Ye, S.; Cai, K.; and Gu, X. 2026. SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents. arXiv:2601.16746

work page internal anchor Pith review Pith/arXiv arXiv 2026

[55] [58]

Wu, Y.; Zheng, Y.; Xu, T.; Zhang, Z.; Yu, Y.; Zhu, J.; Ma, C.; Lin, B.; Dong, B.; Zhu, H.; Huang, R.; and Yu, G. 2026. ContextBudget: Budget-Aware Context Management for Long-Horizon Search Agents. arXiv:2604.01664

work page arXiv 2026

[56] [59]

Xiao, G.; Tian, Y.; Chen, B.; Han, S.; and Lewis, M. 2023. Efficient Streaming Language Models with Attention Sinks. arXiv:2309.17453

work page internal anchor Pith review Pith/arXiv arXiv 2023

[57] [60]

Xiao, Y.-A.; Gao, P.; Peng, C.; and Xiong, Y. 2025. Reducing Cost of LLM Agents with Trajectory Reduction . FSE 2026 Research Paper, arXiv:2509.23586

work page arXiv 2025

[58] [61]

Xu, B.; Zhang, D.; and Arunachalam, R. 2026. From Model to Agent: Equipping the Responses API with a Computer Environment . Engineering blog, accessed 2026-06-04

2026

[59] [62]

Xu, W.; Liang, Z.; Mei, K.; Gao, H.; Tan, J.; and Zhang, Y. 2025. A-MEM: Agentic Memory for LLM Agents. arXiv:2502.12110

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [63]

Ye, R.; Zhang, Z.; Li, K.; Yin, H.; Tao, Z.; Zhao, Y.; Su, L.; Zhang, L.; Qiao, Z.; Wang, X.; Xie, P.; Huang, F.; Zhou, J.; Chen, S.; and Jiang, Y. 2026. AgentFold: Long-Horizon Web Agents with Proactive Context Folding . ICLR 2026 Poster

2026

[61] [64]

Zhang, Z.; Sheng, Y.; Zhou, T.; Chen, T.; Zheng, L.; Cai, R.; Song, Z.; Tian, Y.; Ré, C.; Barrett, C.; Wang, Z.; and Chen, B. 2023. H _2 O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. arXiv:2306.14048

work page internal anchor Pith review Pith/arXiv arXiv 2023

[62] [65]

Zhao, A.; Huang, D.; Xu, Q.; Lin, M.; Liu, Y.-J.; and Huang, G. 2024. ExpeL: LLM Agents Are Experiential Learners. Proceedings of the AAAI Conference on Artificial Intelligence, 38(17): 19632–19642

2024

[63] [66]

Zhong, W.; Guo, L.; Gao, Q.; Ye, H.; and Wang, Y. 2024. MemoryBank: Enhancing Large Language Models with Long-Term Memory. Proceedings of the AAAI Conference on Artificial Intelligence, 38(17): 19724–19731

2024

[64] [67]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Zhou, S.; Xu, F. F.; Zhu, H.; Zhou, X.; Lo, R.; Sridhar, A.; Cheng, X.; Ou, T.; Bisk, Y.; Fried, D.; Alon, U.; and Neubig, G. 2024. WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854

work page internal anchor Pith review Pith/arXiv arXiv 2024

[65] [68]

Zhu, L.; Wang, X.; and Wang, X. 2023. JudgeLM: Fine-tuned Large Language Models are Scalable Judges. arXiv:2310.17631

work page arXiv 2023