Reducing Token Usage of State-in-Context Agents using Minification

J\"urgen Cito; Nicolas Hrubec

arxiv: 2606.01326 · v1 · pith:MAD2ZZSOnew · submitted 2026-05-31 · 💻 cs.SE

Reducing Token Usage of State-in-Context Agents using Minification

Nicolas Hrubec , J\"urgen Cito This is my paper

Pith reviewed 2026-06-28 16:39 UTC · model grok-4.3

classification 💻 cs.SE

keywords state-in-context agentscode minificationtoken reductionSWE-bench Verifiedsoftware engineering agentsinput transformationsefficiency optimization

0 comments

The pith

Minifying source code in state-in-context agents cuts average input token use by 42 percent while dropping resolution rate by 12 percentage points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper replicates the DirectSolve state-in-context agent and tests it on the full SWE-bench Verified benchmark. It identifies source code as the main driver of token consumption and applies a series of minification transformations that remove or shorten non-essential lexical elements while attempting to keep program semantics intact. Experiments show the transformations deliver a 42 percent reduction in tokens at the cost of a 12-point fall in the fraction of tasks solved. A reader would care because token limits currently constrain how much real code an agent can examine in one pass.

Core claim

By integrating code minification transformations into the state-in-context agent, average input token usage falls by 42 percent and resolution rate on SWE-bench Verified declines by 12 percentage points from the unminified baseline, while still retaining a substantial fraction of the original performance.

What carries the argument

Code minification transformations that shorten or remove non-essential lexical elements from source code while preserving program semantics.

If this is right

Agents can handle larger code contexts inside fixed token budgets.
The same transformations can be added to other agent pipelines to lower inference cost.
The observed retention of most baseline performance shows that the minification steps are largely semantics-preserving for the tasks tested.
Full-benchmark numbers supply a concrete reference point for measuring future token-reduction methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Selective application of only the least disruptive minification steps could narrow the performance gap.
The same token-saving pattern may appear in other LLM-based coding systems that ingest full source files.
The results point to a broader design choice between context richness and cost that future agent work will need to navigate.

Load-bearing premise

The chosen minification steps keep enough of the original program meaning that the agent's reasoning and repair steps stay effective enough to produce the reported performance level.

What would settle it

Re-running the identical agent and benchmark with and without each minification step while logging whether the performance drop correlates with specific semantic changes introduced by the transformations.

Figures

Figures reproduced from arXiv: 2606.01326 by J\"urgen Cito, Nicolas Hrubec.

**Figure 2.** Figure 2: Performance vs. per-issue cost by transformation with GPT-4.1 on a 100-sample SWE-bench Verified subset. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Resolved instances vs. average input tokens with stacked code minification transformations for GPT-4.1 and GPT-5- [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Comparing instance-level resolutions on SWE-bench Verified when no transformations (top) and all transformations [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

This paper presents a replication and extension of the recently introduced state-in-context agent framework. We independently re-implement the DirectSolve variant and evaluate it on the SWE-bench Verified benchmark. We report end-to-end full-benchmark results using GPT-5-mini and run selected ablations with GPT-4.1. In addition, we investigate a complementary research question: What is the impact of token-reducing input transformation strategies on the performance of software engineering agents? Based on a preliminary prompt analysis, we identify source code as the dominant contributor to token consumption. We therefore apply a series of code minification techniques that remove or shorten non-essential lexical elements while preserving program semantics. The proposed transformations are integrated into the agent and systematically evaluated. Experiments show that minification reduces average input token usage by 42% with a 12 percentage-point drop in resolution rate. These findings demonstrate that lightweight source code transformations can yield substantial efficiency gains while retaining a substantial fraction of the baseline performance, indicating a promising path toward more cost-effective agents. The full implementation is publicly available on GitHub: https://github.com/ipa-lab/minified-state-in-context-agent

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This replication measures a 42% token cut from code minification on a re-implemented state-in-context agent at the cost of 12pp lower resolution on SWE-bench Verified, but offers no direct check that the minified code still behaves the same.

read the letter

The useful part is the concrete end-to-end measurement. They re-implement the DirectSolve variant, run the full SWE-bench Verified benchmark with GPT-5-mini, and show that applying standard code minification to the dominant source-code sections in the prompt drops average input tokens by 42% while resolution falls 12 points. A few ablations with GPT-4.1 are included and the code is released. That gives practitioners a real data point on the efficiency trade-off for this agent style.

The replication itself looks straightforward and the numbers are reported at benchmark scale rather than toy subsets, which is better than many agent papers. The prompt analysis that led them to target source code is also a reasonable step.

The main gap is on the minification claim. The paper states the transformations remove or shorten non-essential lexical elements while preserving program semantics, yet nothing in the description shows they verified this—no test-suite runs on the minified files, no behavioral equivalence checks, and no discussion of how often the changes might alter observable behavior. Without that, the 12-point drop could partly reflect corrupted task inputs rather than the agent operating on shorter but still correct code. That leaves the headline conclusion about retaining substantial performance open to reinterpretation.

No variance, confidence intervals, or statistical tests are mentioned either, which is common in this area but still limits how firmly the numbers can be read.

This is for engineers who need to control token spend on SWE-bench-style agents and want a quantified example of one simple intervention. It is not advancing new agent architectures or theory. A serious editor should send it to review; the replication and the measured trade-off are worth referee scrutiny even if the semantic-preservation point needs tightening.

Referee Report

1 major / 2 minor

Summary. The paper re-implements the DirectSolve variant of the state-in-context agent framework and evaluates it end-to-end on SWE-bench Verified using GPT-5-mini (with selected ablations on GPT-4.1). It additionally proposes a set of code minification transformations applied to source code in the agent inputs, claiming these reduce average input token usage by 42% while incurring only a 12 percentage-point drop in resolution rate.

Significance. If the central empirical result holds, the work shows that lightweight, semantics-preserving input transformations can deliver substantial efficiency gains for state-in-context software engineering agents. The public release of the full implementation is a clear strength that supports reproducibility and follow-on work.

major comments (1)

[Minification techniques (following the preliminary prompt analysis)] The description of the minification techniques asserts that they 'remove or shorten non-essential lexical elements while preserving program semantics,' yet supplies no supporting evidence (no test-suite equivalence checks, no behavioral comparison of agent outputs on original vs. minified code, and no formal argument). This is load-bearing for the headline claim, because the observed 12pp resolution drop could be explained by corrupted task inputs rather than by the intended efficiency mechanism.

minor comments (2)

The reported end-to-end numbers lack any information on variance, statistical significance, or confidence intervals.
No details are given on the fidelity of the re-implementation relative to the original DirectSolve work.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the constructive review and recommendation. We address the major comment below.

read point-by-point responses

Referee: [Minification techniques (following the preliminary prompt analysis)] The description of the minification techniques asserts that they 'remove or shorten non-essential lexical elements while preserving program semantics,' yet supplies no supporting evidence (no test-suite equivalence checks, no behavioral comparison of agent outputs on original vs. minified code, and no formal argument). This is load-bearing for the headline claim, because the observed 12pp resolution drop could be explained by corrupted task inputs rather than by the intended efficiency mechanism.

Authors: We agree that the manuscript would be strengthened by explicit evidence for semantic preservation. The original submission described the transformations (comment removal, whitespace normalization, and safe identifier shortening) as standard lexical minifications but did not include verification. In the revised version we will add: (1) test-suite equivalence results on a sample of 20 SWE-bench Verified repositories showing that minified code passes the same tests as the original, and (2) a behavioral comparison of agent outputs on a small subset of tasks using both original and minified inputs. These empirical checks will support that the 12pp drop stems from the efficiency mechanism rather than input corruption. A full formal argument for arbitrary programs is outside the paper's scope. revision: yes

Circularity Check

0 steps flagged

Empirical replication study with no derivation chain or fitted predictions

full rationale

The paper is a replication and measurement study: it re-implements an agent, applies a fixed set of code minification transformations, and reports direct experimental outcomes (token counts and resolution rates) on SWE-bench Verified. No equations, parameters, or predictions are defined; the results are raw benchmark measurements. No self-citation load-bearing steps, ansatzes, or renamings appear. The semantic-preservation assumption is an unvalidated modeling choice but does not constitute circularity under the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical replication and extension study. It relies on standard assumptions of benchmark validity and semantic preservation under minification but introduces no new free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5729 in / 1069 out tokens · 33466 ms · 2026-06-28T16:39:21.541947+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 10 canonical work pages · 5 internal anchors

[1]

Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry

Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry. 2024. Introducing SWE-bench Verified. https://openai.com/index/introducing-swe- bench-verified/ Accessed: 2025-10-04

2024
[2]

Gheorghe Comanici et al. 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabil- ities.CoRRabs/2507.06261 (2025). arXiv:2507.06261 doi:10.48550/ARXIV.2507. 06261

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507 2025
[3]

2025.Context Rot: How Increasing Input Tokens Impacts LLM Performance

Kelly Hong, Anton Troynikov, and Jeff Huber. 2025.Context Rot: How Increasing Input Tokens Impacts LLM Performance. Technical Report. Chroma. https: //research.trychroma.com/context-rot

2025
[4]

Lastras, Pavan Kapanipathi, and Tatsunori Hashimoto

Mingjian Jiang, Yangjun Ruan, Luis A. Lastras, Pavan Kapanipathi, and Tatsunori Hashimoto. 2025. Putting It All into Context: Simplifying Agents with LCLMs. CoRRabs/2505.08120 (2025). arXiv:2505.08120 doi:10.48550/ARXIV.2505.08120

work page doi:10.48550/arxiv.2505.08120 2025
[5]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openreview.net/forum?id=VTF8yNQM66

2024
[6]

Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. 2025. LLMs Get Lost In Multi-Turn Conversation.CoRRabs/2505.06120 (2025). arXiv:2505.06120 doi:10.48550/ARXIV.2505.06120

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.06120 2025
[7]

Tobias Lindenbauer, Igor Slinko, Ludwig Felder, Egor Bogomolov, and Yaroslav Zharov. 2025. The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management.CoRRabs/2508.21433 (2025). arXiv:2508.21433 doi:10.48550/ARXIV.2508.21433

work page doi:10.48550/arxiv.2508.21433 2025
[8]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Trans. Assoc. Comput. Linguistics12 (2024), 157–173. doi:10. 1162/TACL_A_00638

2024
[9]

MDN. [n. d.]. Minification — MDN Web Docs. https://developer.mozilla.org/en- US/docs/Glossary/Minification Accessed: 2025-10-16

2025
[10]

OpenAI. [n. d.]. OpenAI API Pricing. https://platform.openai.com/docs/pricing. [accessed 2025-12-04]

2025
[11]

OpenAI. 2025. Introducing GPT-5. https://openai.com. Accessed: 2025-09-11

2025
[12]

OpenAI. 2025. tiktoken: fast BPE tokenizer for use with OpenAI’s models. https: //github.com/openai/tiktoken. [accessed 2025-12-04]

2025
[13]

OpenAI and Josh Achiam et al. 2024. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL] https://arxiv.org/abs/2303.08774

Pith/arXiv arXiv 2024
[14]

Dangfeng Pan, Zhensu Sun, Cenyuan Zhang, David Lo, and Xiaoning Du. 2025. The Hidden Cost of Readability: How Code Formatting Silently Consumes Your LLM Budget.CoRRabs/2508.13666 (2025). arXiv:2508.13666 doi:10.48550/ARXIV. 2508.13666

work page internal anchor Pith review doi:10.48550/arxiv 2025
[15]

Chi, Nathanael Schärli, and Denny Zhou

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Schärli, and Denny Zhou. 2023. Large Language Models Can Be Easily Distracted by Irrelevant Context. InInternational Conference on Machine Learn- ing, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Proceedings of Machine Learning Research, Vol. 202), Andreas Kraus...

2023
[16]

Krithik Vishwanath, Anton Alyakin, Daniel Alexander Alber, Jin Vivian Lee, Douglas Kondziolka, and Eric Karl Oermann. 2025. Medical large language models are easily distracted. arXiv:2504.01201 [cs.CL] https://arxiv.org/abs/2504.01201

arXiv 2025
[17]

Solved Issues

You Wang, Michael Pradel, and Zhongxin Liu. 2025. Are "Solved Issues" in SWE- bench Really Solved Correctly? An Empirical Study.CoRRabs/2503.15223 (2025). arXiv:2503.15223 doi:10.48550/ARXIV.2503.15223

work page doi:10.48550/arxiv.2503.15223 2025
[18]

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2024. Agentless: Demystifying LLM-based Software Engineering Agents.CoRR abs/2407.01489 (2024). arXiv:2407.01489 doi:10.48550/ARXIV.2407.01489

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.01489 2024
[19]

Yuan-An Xiao, Pengfei Gao, Chao Peng, and Yingfei Xiong. 2025. Improving the Efficiency of LLM Agent Systems through Trajectory Reduction.CoRR abs/2509.23586 (2025). arXiv:2509.23586 doi:10.48550/ARXIV.2509.23586

work page doi:10.48550/arxiv.2509.23586 2025
[20]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Inter- faces Enable Automated Software Engineering. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, Dece...

2024
[21]

Cohen, Ruslan Salakhut- dinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, Ellen Riloff, D...

work page doi:10.18653/v1/d18-1259 2018
[22]

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. 2022. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, Sanmi Koyejo, S. M...

2022
[23]

Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. 2025. MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents.CoRR abs/2506.15841 (2025). arXiv:2506.15841 doi:10.48550/ARXIV.2506.15841

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.15841 2025

[1] [1]

Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry

Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry. 2024. Introducing SWE-bench Verified. https://openai.com/index/introducing-swe- bench-verified/ Accessed: 2025-10-04

2024

[2] [2]

Gheorghe Comanici et al. 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabil- ities.CoRRabs/2507.06261 (2025). arXiv:2507.06261 doi:10.48550/ARXIV.2507. 06261

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507 2025

[3] [3]

2025.Context Rot: How Increasing Input Tokens Impacts LLM Performance

Kelly Hong, Anton Troynikov, and Jeff Huber. 2025.Context Rot: How Increasing Input Tokens Impacts LLM Performance. Technical Report. Chroma. https: //research.trychroma.com/context-rot

2025

[4] [4]

Lastras, Pavan Kapanipathi, and Tatsunori Hashimoto

Mingjian Jiang, Yangjun Ruan, Luis A. Lastras, Pavan Kapanipathi, and Tatsunori Hashimoto. 2025. Putting It All into Context: Simplifying Agents with LCLMs. CoRRabs/2505.08120 (2025). arXiv:2505.08120 doi:10.48550/ARXIV.2505.08120

work page doi:10.48550/arxiv.2505.08120 2025

[5] [5]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openreview.net/forum?id=VTF8yNQM66

2024

[6] [6]

Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. 2025. LLMs Get Lost In Multi-Turn Conversation.CoRRabs/2505.06120 (2025). arXiv:2505.06120 doi:10.48550/ARXIV.2505.06120

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.06120 2025

[7] [7]

Tobias Lindenbauer, Igor Slinko, Ludwig Felder, Egor Bogomolov, and Yaroslav Zharov. 2025. The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management.CoRRabs/2508.21433 (2025). arXiv:2508.21433 doi:10.48550/ARXIV.2508.21433

work page doi:10.48550/arxiv.2508.21433 2025

[8] [8]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Trans. Assoc. Comput. Linguistics12 (2024), 157–173. doi:10. 1162/TACL_A_00638

2024

[9] [9]

MDN. [n. d.]. Minification — MDN Web Docs. https://developer.mozilla.org/en- US/docs/Glossary/Minification Accessed: 2025-10-16

2025

[10] [10]

OpenAI. [n. d.]. OpenAI API Pricing. https://platform.openai.com/docs/pricing. [accessed 2025-12-04]

2025

[11] [11]

OpenAI. 2025. Introducing GPT-5. https://openai.com. Accessed: 2025-09-11

2025

[12] [12]

OpenAI. 2025. tiktoken: fast BPE tokenizer for use with OpenAI’s models. https: //github.com/openai/tiktoken. [accessed 2025-12-04]

2025

[13] [13]

OpenAI and Josh Achiam et al. 2024. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL] https://arxiv.org/abs/2303.08774

Pith/arXiv arXiv 2024

[14] [14]

Dangfeng Pan, Zhensu Sun, Cenyuan Zhang, David Lo, and Xiaoning Du. 2025. The Hidden Cost of Readability: How Code Formatting Silently Consumes Your LLM Budget.CoRRabs/2508.13666 (2025). arXiv:2508.13666 doi:10.48550/ARXIV. 2508.13666

work page internal anchor Pith review doi:10.48550/arxiv 2025

[15] [15]

Chi, Nathanael Schärli, and Denny Zhou

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Schärli, and Denny Zhou. 2023. Large Language Models Can Be Easily Distracted by Irrelevant Context. InInternational Conference on Machine Learn- ing, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Proceedings of Machine Learning Research, Vol. 202), Andreas Kraus...

2023

[16] [16]

Krithik Vishwanath, Anton Alyakin, Daniel Alexander Alber, Jin Vivian Lee, Douglas Kondziolka, and Eric Karl Oermann. 2025. Medical large language models are easily distracted. arXiv:2504.01201 [cs.CL] https://arxiv.org/abs/2504.01201

arXiv 2025

[17] [17]

Solved Issues

You Wang, Michael Pradel, and Zhongxin Liu. 2025. Are "Solved Issues" in SWE- bench Really Solved Correctly? An Empirical Study.CoRRabs/2503.15223 (2025). arXiv:2503.15223 doi:10.48550/ARXIV.2503.15223

work page doi:10.48550/arxiv.2503.15223 2025

[18] [18]

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2024. Agentless: Demystifying LLM-based Software Engineering Agents.CoRR abs/2407.01489 (2024). arXiv:2407.01489 doi:10.48550/ARXIV.2407.01489

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.01489 2024

[19] [19]

Yuan-An Xiao, Pengfei Gao, Chao Peng, and Yingfei Xiong. 2025. Improving the Efficiency of LLM Agent Systems through Trajectory Reduction.CoRR abs/2509.23586 (2025). arXiv:2509.23586 doi:10.48550/ARXIV.2509.23586

work page doi:10.48550/arxiv.2509.23586 2025

[20] [20]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Inter- faces Enable Automated Software Engineering. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, Dece...

2024

[21] [21]

Cohen, Ruslan Salakhut- dinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, Ellen Riloff, D...

work page doi:10.18653/v1/d18-1259 2018

[22] [22]

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. 2022. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, Sanmi Koyejo, S. M...

2022

[23] [23]

Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. 2025. MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents.CoRR abs/2506.15841 (2025). arXiv:2506.15841 doi:10.48550/ARXIV.2506.15841

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.15841 2025