pith. sign in

arxiv: 2605.15315 · v1 · submitted 2026-05-14 · 💻 cs.AI · cs.CL

Context Pruning for Coding Agents via Multi-Rubric Latent Reasoning

Pith reviewed 2026-05-19 16:10 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords context pruningcoding agentsLLM agentsmulti-rubric reasoningCRFAST analysistoken efficiencycontext compression
0
0 comments X p. Extension

The pith

LaMR decomposes code relevance into separate semantic and dependency models to prune agent context without losing performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that single-objective sequence labelers create a bottleneck for coding agent context pruning because one transition matrix cannot capture both contiguous semantic spans and sparse structural dependencies. LaMR solves this by modeling each dimension with its own CRF, using a query-conditioned mixture-of-experts gate to fuse emissions, and deriving supervision labels from AST analysis to denoise the original binary labels. This structured approach lets the model discard distracting code while preserving task-critical information. A sympathetic reader would care because coding agents spend most of their token budget on repository files, so better pruning directly cuts cost and latency. Experiments show the method frequently matches or exceeds full-context baselines on real benchmarks.

Core claim

LaMR is a structured pruning framework that decomposes code relevance into two interpretable quality dimensions, semantic evidence and dependency support, each modeled by a dedicated CRF with dimension-specific transition dynamics. A mixture-of-experts gating network dynamically weights the per-rubric emissions conditioned on the query, and a final CRF layer on the fused emissions produces the aggregate keep-or-prune decision. Multi-rubric labels are derived from the existing training corpus via AST-based program analysis, which simultaneously denoises the teacher's binary labels.

What carries the argument

The LaMR framework, which uses two separate CRFs for semantic evidence and dependency support together with a mixture-of-experts gate that fuses their emissions before a final decision CRF.

If this is right

  • LaMR wins 12 of 16 head-to-head multi-turn comparisons on the four evaluated benchmarks.
  • The method saves up to 31 percent more tokens than prior pruners on multi-turn agent tasks.
  • Exact Match improves by up to 3.5 points on single-turn tasks while performance remains competitive with full context.
  • Context denoising from the multi-rubric approach frequently raises task accuracy above the unpruned baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of relevance dimensions could be tested on long-document question answering or retrieval-augmented generation outside code.
  • Adding further rubrics for aspects such as runtime behavior or security properties might extend the framework without new human labels.
  • Lower token budgets from pruning could allow agents to maintain longer interaction histories within fixed compute limits.
  • The observed outperformance over full context suggests that selective context may become a general principle for noisy long-context agent tasks.

Load-bearing premise

That AST-based program analysis can reliably generate labels for the two relevance dimensions and denoise the original binary labels without systematic bias or missing key patterns.

What would settle it

A controlled experiment on a new benchmark in which LaMR-pruned contexts produce consistently lower task success rates than the corresponding full contexts would falsify the claim that the pruning preserves or improves performance.

Figures

Figures reproduced from arXiv: 2605.15315 by Ana S. Carreon-Rascon, Feiyang Cai, Feng Luo, Huayu Li, Jingjing Wang, Wenhui Zhu, Xiwen Chen, Xuanzhao Dong, ZhengXiao He.

Figure 1
Figure 1. Figure 1: Effect of LaMR across two backbone models and three benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A single-objective pruner collapses semantic and structural relevance into one score, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The LaMR workflow. Operating as an agentic middleware, it intercepts file reads, routes features through parallel latent rubrics, and returns a syntactically pruned context via the loop. Base training corpus. We build on the training set constructed by SWE-Pruner [4]: code snippets sampled from high-quality GitHub repositories are paired with task-oriented queries synthesized by a teacher LLM. Each example… view at source ↗
read the original abstract

LLM-powered coding agents spend the majority of their token budget reading repository files, yet much of the retrieved code is irrelevant to the task at hand. Existing learned pruners compress this context with a single-objective sequence labeler, collapsing all facets of code relevance into one score and one transition matrix. We show that this formulation creates a modeling bottleneck: a single CRF transition prior must serve heterogeneous retention patterns, including contiguous semantic spans and sparse structural support lines. We propose LaMR (Latent Multi-Rubric), a structured pruning framework that decomposes code relevance into two interpretable quality dimensions, semantic evidence and dependency support, each modeled by a dedicated CRF with dimension-specific transition dynamics. A mixture-of-experts gating network dynamically weights the per-rubric emissions conditioned on the query, and a final CRF layer on the fused emissions produces the aggregate keep-or-prune decision. To supervise each dimension without additional annotation cost, we derive multi-rubric labels from the existing training corpus via AST-based program analysis, simultaneously denoising the teacher's binary labels. By effectively filtering distracting noise, LaMR frequently matches or even outperforms unpruned full-context baselines. Experiments on four benchmarks (SWE-Bench Verified, SWE-QA, LCC, LongCodeQA) show that LaMR wins 12 of 16 head-to-head multi-turn comparisons. It saves up to 31% more tokens on multi-turn agent tasks and improves Exact Match by up to +3.5 on single-turn tasks, while performance is frequently enhanced by denoising the context, and any remaining drops are marginal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes LaMR (Latent Multi-Rubric), a structured pruning framework for LLM coding agents. It decomposes code relevance into two dimensions—semantic evidence and dependency support—each modeled by a dedicated CRF with its own transition dynamics. A mixture-of-experts gating network weights the per-rubric emissions based on the query, and a final CRF produces the keep/prune decisions. Multi-rubric labels are derived via AST-based program analysis from the existing training corpus to supervise the rubrics and denoise the original binary teacher labels. On four benchmarks (SWE-Bench Verified, SWE-QA, LCC, LongCodeQA), LaMR wins 12 of 16 head-to-head multi-turn comparisons, saves up to 31% more tokens on multi-turn tasks, and improves Exact Match by up to +3.5 on single-turn tasks, often matching or outperforming unpruned full-context baselines.

Significance. If the central claims hold after addressing supervision validation, LaMR could meaningfully advance efficient context management for repository-scale coding agents by replacing monolithic relevance scoring with interpretable, dimension-specific structured models. The approach of deriving multi-rubric supervision from existing AST analysis without new annotations is a practical strength that could generalize to other structured prediction tasks in code.

major comments (2)
  1. [Section 3.2] Section 3.2 (Multi-Rubric Label Derivation): The central claim that the two-rubric decomposition plus denoising outperforms single-CRF baselines rests on the assumption that AST-based program analysis yields reliable per-dimension supervision. However, the manuscript provides no quantitative validation or error analysis showing that syntactic dependencies and structural spans align with query-conditioned semantic relevance; lexical matches to query keywords in non-called helpers, for example, would be invisible to the AST and could systematically misalign the separate CRF transition matrices and MoE gating. This makes it unclear whether reported token savings and Exact-Match gains are attributable to the multi-rubric architecture or to the particular denoising heuristic.
  2. [Section 4.2] Section 4.2 (Main Results, Table 2): The reported 12/16 head-to-head wins and up to +3.5 Exact-Match improvement are presented without ablation isolating the contribution of the dual-CRF structure and query-conditioned gating from the effect of label denoising alone. A single-CRF baseline trained on the same denoised labels would be required to establish that the architectural decomposition itself is load-bearing for the gains.
minor comments (2)
  1. [Figure 2] Figure 2: The diagram of the fused final CRF would benefit from explicit arrows showing how the MoE-weighted emissions are concatenated before the final transition matrix is applied.
  2. [Section 4.1] Section 4.1: The description of the four benchmarks could include a brief note on average context lengths and typical repository sizes to contextualize the token-saving claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify areas where additional evidence would strengthen the claims regarding label quality and architectural contributions. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Section 3.2] Section 3.2 (Multi-Rubric Label Derivation): The central claim that the two-rubric decomposition plus denoising outperforms single-CRF baselines rests on the assumption that AST-based program analysis yields reliable per-dimension supervision. However, the manuscript provides no quantitative validation or error analysis showing that syntactic dependencies and structural spans align with query-conditioned semantic relevance; lexical matches to query keywords in non-called helpers, for example, would be invisible to the AST and could systematically misalign the separate CRF transition matrices and MoE gating. This makes it unclear whether reported token savings and Exact-Match gains are attributable to the multi-rubric architecture or to the particular denoising heuristic.

    Authors: We agree that explicit quantitative validation of the AST-derived multi-rubric labels would provide stronger support for the supervision strategy. The current manuscript relies on the established reliability of AST analysis for capturing dependencies and structural spans, combined with the observation that LaMR frequently matches or exceeds full-context baselines, as indirect evidence that the labels are effective for denoising. However, we acknowledge the potential for misalignment in cases such as lexical matches outside called functions. In the revised manuscript we will add a dedicated error analysis subsection in Section 3.2, including a small-scale manual inspection of label alignment on sampled queries and discussion of edge cases. revision: yes

  2. Referee: [Section 4.2] Section 4.2 (Main Results, Table 2): The reported 12/16 head-to-head wins and up to +3.5 Exact-Match improvement are presented without ablation isolating the contribution of the dual-CRF structure and query-conditioned gating from the effect of label denoising alone. A single-CRF baseline trained on the same denoised labels would be required to establish that the architectural decomposition itself is load-bearing for the gains.

    Authors: We concur that an ablation isolating the dual-CRF plus MoE gating from the denoising effect alone is necessary to attribute gains specifically to the multi-rubric architecture. The existing experiments compare against full-context and other pruners but do not report a single-CRF model trained on the identical denoised labels. We will add this baseline to the revised Table 2 (or a new ablation table) and update the discussion in Section 4.2 to quantify the incremental benefit of the structured multi-rubric decomposition. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core supervision derives multi-rubric labels for semantic evidence and dependency support directly from AST-based program analysis applied to the existing training corpus; this process is external and independent of the model's parameters, outputs, or fitted quantities. The LaMR architecture (separate CRFs per rubric, query-conditioned MoE gating, and final fused CRF) is then trained on these AST-derived labels to produce keep-or-prune decisions, with the original teacher binary labels denoised as a byproduct of the same AST analysis. Downstream performance claims (token savings, Exact Match gains, head-to-head wins on SWE-Bench etc.) are evaluated empirically on held-out benchmarks rather than being forced by construction from the inputs. No self-definitional reductions, fitted inputs renamed as predictions, load-bearing self-citations, or ansatz smuggling appear in the described chain; the method remains falsifiable through the reported experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the ability of AST analysis to produce useful supervisory signals for the two rubrics and on the assumption that heterogeneous retention patterns are best captured by separate transition dynamics.

axioms (1)
  • domain assumption AST-based program analysis produces reliable multi-rubric labels that denoise binary teacher labels without new biases.
    Invoked to enable supervision at no extra annotation cost.

pith-pipeline@v0.9.0 · 5845 in / 1321 out tokens · 51397 ms · 2026-05-19T16:10:33.628957+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 6 internal anchors

  1. [1]

    SWE-agent: Agent-computer interfaces enable automated soft- ware engineering

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated soft- ware engineering. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  2. [2]

    Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHands: An Open Platform for AI Soft...

  3. [3]

    Agentless: Demystifying LLM-based Software Engineering Agents

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying LLM-based Software Engineering Agents. 2024

  4. [4]

    Swe-pruner: Self-adaptive context pruning for coding agents, 2026

    Yuhang Wang, Yuling Shi, Mo Yang, Rongrui Zhang, Shilin He, Heng Lian, Yuting Chen, Siyu Ye, Kai Cai, and Xiaodong Gu. Swe-pruner: Self-adaptive context pruning for coding agents, 2026

  5. [5]

    Lost in the Middle: How Language Models Use Long Contexts

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.arXiv preprint arXiv:2307.03172, 2023

  6. [6]

    Llmlingua: Compress- ing prompts for accelerated inference of large language models

    Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compress- ing prompts for accelerated inference of large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

  7. [7]

    Compressing context to enhance inference efficiency of large language models

    Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. Compressing context to enhance inference efficiency of large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

  8. [8]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, Haofen Wang, et al. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2(1):32, 2023

  9. [9]

    Longcodezip: Compress long context for code language models.arXiv preprint arXiv:2510.00446, 2025

    Yuling Shi, Yichun Qian, Hongyu Zhang, Beijun Shen, and Xiaodong Gu. Longcodezip: Compress long context for code language models.arXiv preprint arXiv:2510.00446, 2025

  10. [10]

    Conditional random fields as recurrent neural networks

    Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr. Conditional random fields as recurrent neural networks. InProceedings of the IEEE international conference on computer vision, pages 1529–1537, 2015. 10

  11. [11]

    UniXcoder: Unified Cross-Modal Pre-training for Code Representation

    Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. Unixcoder: Unified cross-modal pre-training for code representation.arXiv preprint arXiv:2203.03850, 2022

  12. [12]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

  13. [13]

    Swe-bench: Can language models resolve real-world github issues? In ICLR, 2024

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. Swe-bench: Can language models resolve real-world github issues? In ICLR, 2024

  14. [14]

    SWE-QA: Can Language Models Answer Repository-level Code Questions?

    Weihan Peng, Yuling Shi, Yuhang Wang, Xinyun Zhang, Beijun Shen, and Xiaodong Gu. Swe-qa: Can language models answer repository-level code questions?arXiv preprint arXiv:2509.14635, 2025

  15. [15]

    Longcoder: A long-range pre-trained language model for code completion

    Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley. Longcoder: A long-range pre-trained language model for code completion. InInternational Conference on Machine Learning, pages 12098–12107. PMLR, 2023

  16. [16]

    Longcodebench: Evaluating coding llms at 1m context windows.arXiv preprint arXiv:2505.07897, 2025

    Stefano Rando, Luca Romani, Alessio Sampieri, Luca Franco, John Yang, Yuta Kyuragi, Fabio Galasso, and Tatsunori Hashimoto. Longcodebench: Evaluating coding llms at 1m context windows.arXiv preprint arXiv:2505.07897, 2025

  17. [17]

    Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression

    Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, et al. Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. InFindings of the Association for Computational Linguistics ACL 2024, pages 963–981, 2024

  18. [18]

    Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression

    Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

  19. [19]

    Learning to compress prompts with gist tokens

    Jesse Mu, Xiang Li, and Noah Goodman. Learning to compress prompts with gist tokens. Advances in Neural Information Processing Systems, 36:19327–19352, 2023

  20. [20]

    Repocoder: Repository-level code completion through iterative retrieval and generation

    Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian- Guang Lou, and Weizhu Chen. Repocoder: Repository-level code completion through iterative retrieval and generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2471–2484, 2023

  21. [21]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

  22. [22]

    Diet code is healthy: Simpli- fying programs for pre-trained models of code

    Zhaowei Zhang, Hongyu Zhang, Beijun Shen, and Xiaodong Gu. Diet code is healthy: Simpli- fying programs for pre-trained models of code. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1073–1084, 2022

  23. [23]

    Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs, 2024

    Lei Zhang, Yunshui Li, Jiaming Li, Xiaobo Xia, Jiaxi Yang, Run Luo, Minzheng Wang, Longze Chen, Junhao Liu, and Min Yang. Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs, 2024

  24. [24]

    Natural is the best: Model-agnostic code simplification for pre-trained large language models.Proceedings of the ACM on Software Engineering, 1(FSE):586–608, 2024

    Yan Wang, Xiaoning Li, Tien N Nguyen, Shaohua Wang, Chao Ni, and Ling Ding. Natural is the best: Model-agnostic code simplification for pre-trained large language models.Proceedings of the ACM on Software Engineering, 1(FSE):586–608, 2024

  25. [25]

    RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

    Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-completion systems.arXiv preprint arXiv:2306.03091, 2023

  26. [26]

    Deep code search

    Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. Deep code search. InProceedings of the 40th International Conference on Software Engineering, pages 933–944, 2018. 11

  27. [27]

    Auto-rubric: Learning from implicit weights to explicit rubrics for reward modeling.arXiv preprint arXiv:2510.17314, 2025

    Lipeng Xie, Sen Huang, Zhuo Zhang, Anni Zou, Yunpeng Zhai, Dingchao Ren, Kezun Zhang, Haoyuan Hu, Boyin Liu, Haoran Chen, et al. Auto-rubric: Learning from implicit weights to explicit rubrics for reward modeling.arXiv preprint arXiv:2510.17314, 2025

  28. [28]

    Interpretable prefer- ences via multi-objective reward modeling and mixture-of-experts

    Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. Interpretable prefer- ences via multi-objective reward modeling and mixture-of-experts. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2024, pages 10582–10592, 2024

  29. [29]

    LLMs Get Lost In Multi-Turn Conversation

    Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. LLMs Get Lost In Multi-Turn Conversation. 2025

  30. [30]

    Scaling llm multi-turn rl with end-to-end summarization-based context management.arXiv preprint arXiv:2510.06727, 2025

    Miao Lu, Weiwei Sun, Weihua Du, Zhan Ling, Xuesong Yao, Kang Liu, and Jiecao Chen. Scaling llm multi-turn rl with end-to-end summarization-based context management.arXiv preprint arXiv:2510.06727, 2025

  31. [31]

    Cursor – the ai code editor, 2025

    Cursor. Cursor – the ai code editor, 2025

  32. [32]

    Claude code: Built for developers, 2025

    Anthropic. Claude code: Built for developers, 2025

  33. [33]

    Scaling long-horizon llm agent via context-folding.arXiv preprint arXiv:2510.11967, 2025

    Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, and Jiecao Chen. Scaling long-horizon llm agent via context-folding.arXiv preprint arXiv:2510.11967, 2025

  34. [34]

    The complexity trap: Simple observation masking is as efficient as llm summarization for agent context management.arXiv preprint arXiv:2508.21433, 2025

    Tobias Lindenbauer, Igor Slinko, Ludwig Felder, Egor Bogomolov, and Yaroslav Zharov. The complexity trap: Simple observation masking is as efficient as llm summarization for agent context management.arXiv preprint arXiv:2508.21433, 2025

  35. [35]

    Acon: Optimizing context compression for long-horizon llm agents.arXiv preprint arXiv:2510.00615, 2025

    Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan. Acon: Optimizing context compression for long-horizon llm agents.arXiv preprint arXiv:2510.00615, 2025

  36. [36]

    Agentfold: Long-horizon web agents with proactive context management

    Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, Pengjun Xie, Fei Huang, Siheng Chen, Jingren Zhou, and Yong Jiang. Agentfold: Long-horizon web agents with proactive context management. arXiv preprint arXiv:2510.24699, 2025

  37. [37]

    Compass: Enhancing agent long-horizon reasoning with evolving context.arXiv preprint arXiv:2510.08790, 2025

    Guangya Wan, Mingyang Ling, Xiaoqi Ren, Rujun Han, Sheng Li, and Zizhao Zhang. Compass: Enhancing agent long-horizon reasoning with evolving context.arXiv preprint arXiv:2510.08790, 2025. 12 A Related Work Prompt and code-context compression.Token-level pruning methods such as LLMLingua [ 6, 18], Selective-Context [7], and gist-token distillation [19] com...