pith. sign in

arxiv: 2606.01326 · v1 · pith:MAD2ZZSOnew · submitted 2026-05-31 · 💻 cs.SE

Reducing Token Usage of State-in-Context Agents using Minification

Pith reviewed 2026-06-28 16:39 UTC · model grok-4.3

classification 💻 cs.SE
keywords state-in-context agentscode minificationtoken reductionSWE-bench Verifiedsoftware engineering agentsinput transformationsefficiency optimization
0
0 comments X

The pith

Minifying source code in state-in-context agents cuts average input token use by 42 percent while dropping resolution rate by 12 percentage points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper replicates the DirectSolve state-in-context agent and tests it on the full SWE-bench Verified benchmark. It identifies source code as the main driver of token consumption and applies a series of minification transformations that remove or shorten non-essential lexical elements while attempting to keep program semantics intact. Experiments show the transformations deliver a 42 percent reduction in tokens at the cost of a 12-point fall in the fraction of tasks solved. A reader would care because token limits currently constrain how much real code an agent can examine in one pass.

Core claim

By integrating code minification transformations into the state-in-context agent, average input token usage falls by 42 percent and resolution rate on SWE-bench Verified declines by 12 percentage points from the unminified baseline, while still retaining a substantial fraction of the original performance.

What carries the argument

Code minification transformations that shorten or remove non-essential lexical elements from source code while preserving program semantics.

If this is right

  • Agents can handle larger code contexts inside fixed token budgets.
  • The same transformations can be added to other agent pipelines to lower inference cost.
  • The observed retention of most baseline performance shows that the minification steps are largely semantics-preserving for the tasks tested.
  • Full-benchmark numbers supply a concrete reference point for measuring future token-reduction methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Selective application of only the least disruptive minification steps could narrow the performance gap.
  • The same token-saving pattern may appear in other LLM-based coding systems that ingest full source files.
  • The results point to a broader design choice between context richness and cost that future agent work will need to navigate.

Load-bearing premise

The chosen minification steps keep enough of the original program meaning that the agent's reasoning and repair steps stay effective enough to produce the reported performance level.

What would settle it

Re-running the identical agent and benchmark with and without each minification step while logging whether the performance drop correlates with specific semantic changes introduced by the transformations.

Figures

Figures reproduced from arXiv: 2606.01326 by J\"urgen Cito, Nicolas Hrubec.

Figure 1
Figure 1. Figure 1: Distribution of Ranking and Repair Tokens. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance vs. per-issue cost by transformation with GPT-4.1 on a 100-sample SWE-bench Verified subset. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Resolved instances vs. average input tokens with stacked code minification transformations for GPT-4.1 and GPT-5- [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparing instance-level resolutions on SWE-bench Verified when no transformations (top) and all transformations [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

This paper presents a replication and extension of the recently introduced state-in-context agent framework. We independently re-implement the DirectSolve variant and evaluate it on the SWE-bench Verified benchmark. We report end-to-end full-benchmark results using GPT-5-mini and run selected ablations with GPT-4.1. In addition, we investigate a complementary research question: What is the impact of token-reducing input transformation strategies on the performance of software engineering agents? Based on a preliminary prompt analysis, we identify source code as the dominant contributor to token consumption. We therefore apply a series of code minification techniques that remove or shorten non-essential lexical elements while preserving program semantics. The proposed transformations are integrated into the agent and systematically evaluated. Experiments show that minification reduces average input token usage by 42% with a 12 percentage-point drop in resolution rate. These findings demonstrate that lightweight source code transformations can yield substantial efficiency gains while retaining a substantial fraction of the baseline performance, indicating a promising path toward more cost-effective agents. The full implementation is publicly available on GitHub: https://github.com/ipa-lab/minified-state-in-context-agent

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper re-implements the DirectSolve variant of the state-in-context agent framework and evaluates it end-to-end on SWE-bench Verified using GPT-5-mini (with selected ablations on GPT-4.1). It additionally proposes a set of code minification transformations applied to source code in the agent inputs, claiming these reduce average input token usage by 42% while incurring only a 12 percentage-point drop in resolution rate.

Significance. If the central empirical result holds, the work shows that lightweight, semantics-preserving input transformations can deliver substantial efficiency gains for state-in-context software engineering agents. The public release of the full implementation is a clear strength that supports reproducibility and follow-on work.

major comments (1)
  1. [Minification techniques (following the preliminary prompt analysis)] The description of the minification techniques asserts that they 'remove or shorten non-essential lexical elements while preserving program semantics,' yet supplies no supporting evidence (no test-suite equivalence checks, no behavioral comparison of agent outputs on original vs. minified code, and no formal argument). This is load-bearing for the headline claim, because the observed 12pp resolution drop could be explained by corrupted task inputs rather than by the intended efficiency mechanism.
minor comments (2)
  1. The reported end-to-end numbers lack any information on variance, statistical significance, or confidence intervals.
  2. No details are given on the fidelity of the re-implementation relative to the original DirectSolve work.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the constructive review and recommendation. We address the major comment below.

read point-by-point responses
  1. Referee: [Minification techniques (following the preliminary prompt analysis)] The description of the minification techniques asserts that they 'remove or shorten non-essential lexical elements while preserving program semantics,' yet supplies no supporting evidence (no test-suite equivalence checks, no behavioral comparison of agent outputs on original vs. minified code, and no formal argument). This is load-bearing for the headline claim, because the observed 12pp resolution drop could be explained by corrupted task inputs rather than by the intended efficiency mechanism.

    Authors: We agree that the manuscript would be strengthened by explicit evidence for semantic preservation. The original submission described the transformations (comment removal, whitespace normalization, and safe identifier shortening) as standard lexical minifications but did not include verification. In the revised version we will add: (1) test-suite equivalence results on a sample of 20 SWE-bench Verified repositories showing that minified code passes the same tests as the original, and (2) a behavioral comparison of agent outputs on a small subset of tasks using both original and minified inputs. These empirical checks will support that the 12pp drop stems from the efficiency mechanism rather than input corruption. A full formal argument for arbitrary programs is outside the paper's scope. revision: yes

Circularity Check

0 steps flagged

Empirical replication study with no derivation chain or fitted predictions

full rationale

The paper is a replication and measurement study: it re-implements an agent, applies a fixed set of code minification transformations, and reports direct experimental outcomes (token counts and resolution rates) on SWE-bench Verified. No equations, parameters, or predictions are defined; the results are raw benchmark measurements. No self-citation load-bearing steps, ansatzes, or renamings appear. The semantic-preservation assumption is an unvalidated modeling choice but does not constitute circularity under the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical replication and extension study. It relies on standard assumptions of benchmark validity and semantic preservation under minification but introduces no new free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5729 in / 1069 out tokens · 33466 ms · 2026-06-28T16:39:21.541947+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 10 canonical work pages · 5 internal anchors

  1. [1]

    Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry

    Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry. 2024. Introducing SWE-bench Verified. https://openai.com/index/introducing-swe- bench-verified/ Accessed: 2025-10-04

  2. [2]

    Gheorghe Comanici et al. 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabil- ities.CoRRabs/2507.06261 (2025). arXiv:2507.06261 doi:10.48550/ARXIV.2507. 06261

  3. [3]

    2025.Context Rot: How Increasing Input Tokens Impacts LLM Performance

    Kelly Hong, Anton Troynikov, and Jeff Huber. 2025.Context Rot: How Increasing Input Tokens Impacts LLM Performance. Technical Report. Chroma. https: //research.trychroma.com/context-rot

  4. [4]

    Lastras, Pavan Kapanipathi, and Tatsunori Hashimoto

    Mingjian Jiang, Yangjun Ruan, Luis A. Lastras, Pavan Kapanipathi, and Tatsunori Hashimoto. 2025. Putting It All into Context: Simplifying Agents with LCLMs. CoRRabs/2505.08120 (2025). arXiv:2505.08120 doi:10.48550/ARXIV.2505.08120

  5. [5]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openreview.net/forum?id=VTF8yNQM66

  6. [6]

    Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. 2025. LLMs Get Lost In Multi-Turn Conversation.CoRRabs/2505.06120 (2025). arXiv:2505.06120 doi:10.48550/ARXIV.2505.06120

  7. [7]

    Tobias Lindenbauer, Igor Slinko, Ludwig Felder, Egor Bogomolov, and Yaroslav Zharov. 2025. The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management.CoRRabs/2508.21433 (2025). arXiv:2508.21433 doi:10.48550/ARXIV.2508.21433

  8. [8]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Trans. Assoc. Comput. Linguistics12 (2024), 157–173. doi:10. 1162/TACL_A_00638

  9. [9]

    MDN. [n. d.]. Minification — MDN Web Docs. https://developer.mozilla.org/en- US/docs/Glossary/Minification Accessed: 2025-10-16

  10. [10]

    OpenAI. [n. d.]. OpenAI API Pricing. https://platform.openai.com/docs/pricing. [accessed 2025-12-04]

  11. [11]

    OpenAI. 2025. Introducing GPT-5. https://openai.com. Accessed: 2025-09-11

  12. [12]

    OpenAI. 2025. tiktoken: fast BPE tokenizer for use with OpenAI’s models. https: //github.com/openai/tiktoken. [accessed 2025-12-04]

  13. [13]

    OpenAI and Josh Achiam et al. 2024. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL] https://arxiv.org/abs/2303.08774

  14. [14]

    Dangfeng Pan, Zhensu Sun, Cenyuan Zhang, David Lo, and Xiaoning Du. 2025. The Hidden Cost of Readability: How Code Formatting Silently Consumes Your LLM Budget.CoRRabs/2508.13666 (2025). arXiv:2508.13666 doi:10.48550/ARXIV. 2508.13666

  15. [15]

    Chi, Nathanael Schärli, and Denny Zhou

    Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Schärli, and Denny Zhou. 2023. Large Language Models Can Be Easily Distracted by Irrelevant Context. InInternational Conference on Machine Learn- ing, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Proceedings of Machine Learning Research, Vol. 202), Andreas Kraus...

  16. [16]

    Krithik Vishwanath, Anton Alyakin, Daniel Alexander Alber, Jin Vivian Lee, Douglas Kondziolka, and Eric Karl Oermann. 2025. Medical large language models are easily distracted. arXiv:2504.01201 [cs.CL] https://arxiv.org/abs/2504.01201

  17. [17]

    Solved Issues

    You Wang, Michael Pradel, and Zhongxin Liu. 2025. Are "Solved Issues" in SWE- bench Really Solved Correctly? An Empirical Study.CoRRabs/2503.15223 (2025). arXiv:2503.15223 doi:10.48550/ARXIV.2503.15223

  18. [18]

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2024. Agentless: Demystifying LLM-based Software Engineering Agents.CoRR abs/2407.01489 (2024). arXiv:2407.01489 doi:10.48550/ARXIV.2407.01489

  19. [19]

    Yuan-An Xiao, Pengfei Gao, Chao Peng, and Yingfei Xiong. 2025. Improving the Efficiency of LLM Agent Systems through Trajectory Reduction.CoRR abs/2509.23586 (2025). arXiv:2509.23586 doi:10.48550/ARXIV.2509.23586

  20. [20]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Inter- faces Enable Automated Software Engineering. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, Dece...

  21. [21]

    shortcut

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, Ellen Riloff, D...

  22. [22]

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. 2022. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, Sanmi Koyejo, S. M...

  23. [23]

    Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. 2025. MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents.CoRR abs/2506.15841 (2025). arXiv:2506.15841 doi:10.48550/ARXIV.2506.15841