pith. sign in

arxiv: 2606.22417 · v1 · pith:QEDH52KOnew · submitted 2026-06-21 · 💻 cs.AI

Code Isn't Memory: A Structural Codebase Index Inside a Coding Agent

Pith reviewed 2026-06-26 10:57 UTC · model grok-4.3

classification 💻 cs.AI
keywords coding agentsstructural codebase indexcode retrievalSWE-benchablation studylocalizationtask resolutioncost per solve
0
0 comments X

The pith

Adding a structural codebase index improves localization and task resolution in a fixed coding-agent harness without raising cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether a structural codebase index alters cost or success rates when retrieval methods already vary widely across coding agents. It holds the model and harness fixed while running three arms on SWE-PolyBench Verified and SWE-bench Pro: the index version, the same harness without the index, and an agentic-grep baseline. The within-harness comparison shows large localization gains and statistically separated resolve gains, with no cost penalty per cell and lower cost per solved task. The cross-harness check confirms the index does not regress against the grep baseline on either metric. The practical question therefore shifts from expense to whether the workload contains multi-file edits where structural ranking helps.

Core claim

Within a fixed coding-agent harness on a fixed model, the structural codebase index produces a large localization gain and a statistically separated resolve gain, with no cost penalty per cell and lower cost per solve; it also matches or exceeds an agentic-grep comparator on resolve and localization at no cost penalty.

What carries the argument

structural codebase index, which supplies ranked retrieval over the repository using code structure rather than surface text or simple search.

If this is right

  • The index yields large localization gains inside the fixed harness.
  • Resolve gains reach statistical separation from the no-index arm.
  • Cost per cell stays flat while cost per solved task drops.
  • Performance does not regress against an agentic-grep baseline on resolve or localization.
  • The index becomes relevant precisely when workloads include multi-file changes that benefit from structural ranking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If gains concentrate on multi-file tasks, then agents handling large refactors or cross-module edits would capture most of the benefit.
  • Releasing the per-cell exclusion ledger and leak-audit script lets other groups test the same isolation on their own harnesses.
  • Repeating the three-arm comparison on different models or larger repositories would reveal how stable the cost and resolve advantages remain.

Load-bearing premise

The structural index is the only causal driver of the observed localization and resolve differences, and the chosen benchmarks plus sandbox controls isolate its effect from other harness variables.

What would settle it

An ablation that achieves the same localization and resolve rates as the index version by adjusting only non-structural retrieval parameters while keeping every other harness component identical would show the structural ranking is not required for the gains.

Figures

Figures reproduced from arXiv: 2606.22417 by Adithyan Krishnan, Ishaan Bhola, Mukunda NS, Sravanth Kurmala.

Figure 1
Figure 1. Figure 1: Context-engine pipeline. The upper block runs once [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cost–resolve plane (mean of seed means; error bars [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: First-gold rank CDF under View B, per arm. Each [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: View B acc@5 by gold-file count, mean of seed means [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Coding agents now interleave LLMs with retrieval over the working repository, and retrieval implementations vary widely across deployed harnesses. Inside a fixed coding-agent harness on a fixed model, does adding a structural codebase index actually change cost or resolve? We ran three arms (the harness with the index, the same harness without it, and an agentic-grep comparator) on SWE-PolyBench Verified and SWE-bench Pro with Claude Opus 4.7 held fixed throughout, across three seeds, inside a leak-audited per-task sandbox. The within-harness ablation produces a large localization gain and a statistically separated resolve gain, with no cost penalty per cell and lower cost per solve. The cross-harness check shows that the index does not regress against an agentic-grep baseline on resolve or localization, again at no cost penalty. We release the per-cell exclusion ledger, the leak-audit script, the localization extractor, and the results database. The deployment question for a structural codebase index is thus not whether it is too expensive to run (across seeds, the index lands at a lower $/solved than agentic grep) but whether the workload includes multi-file changes where structural ranking pays off.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript evaluates whether adding a structural codebase index to a fixed coding-agent harness changes cost or resolve rates. It runs three arms (harness with index, same harness without index, and agentic-grep comparator) on SWE-PolyBench Verified and SWE-bench Pro using Claude Opus 4.7 across three seeds in a leak-audited sandbox. The central claim is that the within-harness ablation yields large localization gains and statistically separated resolve gains with no per-cell cost penalty and lower cost per solve; the cross-harness check shows no regression versus agentic-grep. The paper releases the per-cell exclusion ledger, leak-audit script, localization extractor, and results database.

Significance. If the ablation results hold under the stated controls, the work would supply direct empirical evidence that structural indices can improve localization and resolution on multi-file tasks without raising per-cell cost, thereby reframing the deployment question around workload characteristics rather than raw expense. The open release of the exclusion ledger, audit script, and results database is a concrete strength that supports reproducibility and independent verification.

major comments (3)
  1. [Methods (ablation design)] Methods (ablation design): The description of the 'same harness without it' arm asserts that the harness and sandbox are fixed but does not explicitly confirm that retrieval logic, context assembly, prompting, and ranking remain bitwise identical when the index is disabled; any incidental difference would undermine the claim that the structural index is the sole causal driver of the reported localization and resolve gains.
  2. [Results (statistical claims)] Results (statistical claims): The abstract states a 'statistically separated resolve gain' without naming the test statistic, sample sizes per arm, exact p-value threshold, or correction for multiple comparisons across benchmarks and seeds; this detail is load-bearing for the central claim of separation.
  3. [Data handling] Data handling: Although the per-cell exclusion ledger is released, the methods section must specify whether exclusion criteria were pre-registered or determined after inspecting outcomes, because post-hoc choices could affect the measured gains on localization and resolve.
minor comments (2)
  1. [Abstract] Abstract: 'Claude Opus 4.7' should be replaced by the precise model identifier used in the experiments.
  2. [Figures and tables] Figures and tables: Captions should explicitly list the benchmark names, number of tasks, and seed count so that each display is self-contained.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the referee's detailed and constructive comments. We address each major comment point by point below, with commitments to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: The description of the 'same harness without it' arm asserts that the harness and sandbox are fixed but does not explicitly confirm that retrieval logic, context assembly, prompting, and ranking remain bitwise identical when the index is disabled; any incidental difference would undermine the claim that the structural index is the sole causal driver of the reported localization and resolve gains.

    Authors: We agree that explicit confirmation is required to support the causal attribution. The harness code implements the structural index as a self-contained optional module; toggling it off leaves every other component (retrieval logic, context assembly, prompting templates, and ranking) unchanged at the source level. In the revised manuscript we will add a dedicated sentence in the Methods section stating this bitwise identity explicitly. revision: yes

  2. Referee: The abstract states a 'statistically separated resolve gain' without naming the test statistic, sample sizes per arm, exact p-value threshold, or correction for multiple comparisons across benchmarks and seeds; this detail is load-bearing for the central claim of separation.

    Authors: The statistical details are already present in the Results section and the released database, but the abstract should be self-contained. We will revise the abstract to name the test statistic, state the per-arm sample sizes (three seeds), report the p-value threshold, and describe the multiple-comparison approach. revision: yes

  3. Referee: Although the per-cell exclusion ledger is released, the methods section must specify whether exclusion criteria were pre-registered or determined after inspecting outcomes, because post-hoc choices could affect the measured gains on localization and resolve.

    Authors: We will add an explicit statement to the Methods section indicating whether the exclusion criteria were pre-registered or determined post-inspection, together with the rationale and a pointer to the released ledger. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical ablation on external benchmarks

full rationale

The paper reports measured outcomes from three fixed-harness arms (with-index, without-index, agentic-grep) run on SWE-PolyBench Verified and SWE-bench Pro using a held-fixed model and sandbox. All claims rest on observed localization, resolve, and cost differences across seeds; no equations, fitted parameters, self-citations, or derivations are invoked to produce the results. The within-harness comparison is presented as a controlled measurement rather than a reduction to prior inputs or self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the benchmarks and sandbox setup validly measure the index's contribution; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption SWE-PolyBench Verified and SWE-bench Pro are representative of the multi-file coding tasks where structural ranking would matter.
    The paper uses these benchmarks as the evaluation substrate for the ablation.
  • domain assumption The structural index implementation and the within-harness comparison isolate the index effect from other variables.
    This premise underpins the claim that observed gains are attributable to the index.

pith-pipeline@v0.9.1-grok · 5752 in / 1321 out tokens · 29412 ms · 2026-06-26T10:57:07.973774+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 10 linked inside Pith

  1. [1]

    Introducing Claude Opus 4.7

    Anthropic. Introducing Claude Opus 4.7. Anthropic blog post, https://www.anthropic.com/news/claude-opus-4-7, 2026

  2. [2]

    Is grep all you need? how agent harnesses reshape agentic search.arXiv preprint arXiv:2605.15184, 2026

    Sahil Sen, Akhil Kasturi, Elias Lumer, Anmol Gulati, and Vamse Kumar Subbiah. Is grep all you need? how agent harnesses reshape agentic search.arXiv preprint arXiv:2605.15184, 2026. URL https://arxiv.org/abs/2605. 15184

  3. [3]

    SWE- Bench+: Enhanced coding benchmark for LLMs.arXiv preprint arXiv:2410.06992, 2024

    Reem Aleithan, Haoran Xue, Mohammad Mahdi Mohajer, Elijah Nnorom, Gias Uddin, and Song Wang. SWE- Bench+: Enhanced coding benchmark for LLMs.arXiv preprint arXiv:2410.06992, 2024. URL https://arxiv.org/ abs/2410.06992

  4. [4]

    The SWE-Bench illusion: When state-of- the-art LLMs remember instead of reason.arXiv preprint arXiv:2506.12286, 2025

    Shanchao Liang, Spandan Garg, and Roshanak Zilouch- ian Moghaddam. The SWE-Bench illusion: When state-of- the-art LLMs remember instead of reason.arXiv preprint arXiv:2506.12286, 2025. URL https://arxiv.org/abs/2506. 12286

  5. [5]

    Saving SWE-Bench: A benchmark mutation ap- proach for realistic agent evaluation.arXiv preprint arXiv:2510.08996, 2025

    Spandan Garg, Benjamin Steenhoek, and Yufan Huang. Saving SWE-Bench: A benchmark mutation ap- proach for realistic agent evaluation.arXiv preprint arXiv:2510.08996, 2025. URL https://arxiv.org/abs/2510. 08996

  6. [6]

    SuperCoder: An autonomous AI coding-agent harness

    SuperAGI Research and SuperCoder Con- tributors. SuperCoder: An autonomous AI coding-agent harness. GitHub repository, https://github.com/TransformerOptimus/SuperCoder, 2024

  7. [7]

    opencode: The AI coding agent built for the terminal

    SST and opencode Contributors. opencode: The AI coding agent built for the terminal. GitHub repository, https: //github.com/sst/opencode, 2025

  8. [8]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. URL https://arxiv. org/abs/2405.15793

  9. [9]

    Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, et al

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, et al. OpenHands: An open platform for AI software developers as generalist agents. InInternational Conference on Learning Representations (ICLR), 2025. URL https://arxiv.org/abs/2407.16741

  10. [10]

    Aider: AI pair programming in your terminal

    Paul Gauthier and Aider Contributors. Aider: AI pair programming in your terminal. GitHub repository, https: //github.com/Aider-AI/aider, 2026

  11. [11]

    AutoCodeRover: Autonomous program improvement

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. AutoCodeRover: Autonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 2024. URL https://arxiv.org/abs/2404.05427

  12. [12]

    RepoCoder: Repository-level code completion through iterative retrieval and generation

    Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. RepoCoder: Repository-level code completion through iterative retrieval and generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023. URL https://arxiv. org/abs/2303.12570

  13. [13]

    Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, D. C. Vageesh, Arun Iyer, Suresh Parthasarathy, Sriram Rajamani, B. Ashok, and Shashank Shet. CodePlan: Repository-level coding using LLMs and planning.arXiv preprint arXiv:2309.12499, 2023. URL https://arxiv.org/ abs/2309.12499. Published in FSE 2024

  14. [14]

    LocAgent: Graph-guided LLM agents for code localization

    Zhaoling Chen, Robert Tang, Gangda Deng, Fang Wu, Jialong Wu, Zhiwei Jiang, Viktor Prasanna, Arman Cohan, and Xingyao Wang. LocAgent: Graph-guided LLM agents for code localization. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025. URL https://arxiv.org/abs/2503.09089

  15. [15]

    RepoGraph: Enhancing AI software engineering with repository-level code graph.arXiv preprint arXiv:2410.14684, 2024

    Siru Ouyang, Wenhao Yu, Kaixin Ma, Zilin Xiao, Zhihan Zhang, Mengzhao Jia, Jiawei Han, Hongming Zhang, and Dong Yu. RepoGraph: Enhancing AI software engineering with repository-level code graph.arXiv preprint arXiv:2410.14684, 2024. URL https://arxiv.org/ abs/2410.14684

  16. [16]

    Code graph model (CGM): A graph-integrated large language model for repository-level software engineering tasks

    Hongyuan Tao, Ying Zhang, Zhenhao Tang, Hongen Peng, Xukun Zhu, Bingchang Liu, Yingguang Yang, Ziyin Zhang, Zhaogui Xu, Haipeng Zhang, Linchao Zhu, Rui Wang, Hang Yu, Jianguo Li, and Peng Di. Code graph model (CGM): A graph-integrated large language model for repository-level software engineering tasks. InAdvances in Neural Information Processing Systems ...

  17. [17]

    Agentless: Demystifying LLM- based software engineering agents.arXiv preprint arXiv:2407.01489, 2024

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying LLM- based software engineering agents.arXiv preprint arXiv:2407.01489, 2024. URL https://arxiv.org/abs/2407. 01489

  18. [19]

    URL https://arxiv.org/abs/2603.24631

  19. [20]

    SWE-Explore: Benchmarking how coding agents explore repositories

    Shaoqiu Zhang, Yuhang Wang, Jialiang Liang, Yuling Shi, Wenhao Zeng, Maoquan Wang, Shilin He, Ningyuan Xu, Siyu Ye, Kai Cai, and Xiaodong Gu. SWE-Explore: Benchmarking how coding agents explore repositories. arXiv preprint arXiv:2606.07297, 2026. URL https://arxiv. org/abs/2606.07297

  20. [21]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations (ICLR), 2024. URL https://arxiv.org/abs/ 2310.06770

  21. [22]

    Introducing SWE-bench veri- fied

    Neil Chowdhury, James Aung, Jun Shern Chan, and Oliver Jaffe. Introducing SWE-bench veri- fied. OpenAI blog post, https://openai.com/index/ introducing-swe-bench-verified/, 2024

  22. [23]

    SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents.arXiv preprint arXiv:2504.08703, 2025

    Muhammad Shihab Rashid, Christian Bock, Yuan Zhuang, Alexander Buchholz, Tim Esler, Simon Valentin, Luca Franceschi, Martin Wistuba, Prabhu Teja Sivaprasad, Woo Jung Kim, Anoop Deoras, Giovanni Zappella, and Laurent Callot. SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents.arXiv preprint arXiv:2504.08703, 2025. URL...

  23. [25]

    URL https://arxiv.org/abs/2509.16941

  24. [26]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023. URL https://arxiv.org/abs/2210.03629

  25. [27]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Can- cedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/2302.04761

  26. [28]

    On randomness in agentic evals.arXiv preprint arXiv:2602.07150, 2026

    Bjarni Haukur Bjarnason, Andre Silva, and Martin Mon- perrus. On randomness in agentic evals.arXiv preprint arXiv:2602.07150, 2026. URL https://arxiv.org/abs/2602. 07150