Code Isn't Memory: A Structural Codebase Index Inside a Coding Agent

Adithyan Krishnan; Ishaan Bhola; Mukunda NS; Sravanth Kurmala

arxiv: 2606.22417 · v1 · pith:QEDH52KOnew · submitted 2026-06-21 · 💻 cs.AI

Code Isn't Memory: A Structural Codebase Index Inside a Coding Agent

Ishaan Bhola , Adithyan Krishnan , Sravanth Kurmala , Mukunda NS This is my paper

Pith reviewed 2026-06-26 10:57 UTC · model grok-4.3

classification 💻 cs.AI

keywords coding agentsstructural codebase indexcode retrievalSWE-benchablation studylocalizationtask resolutioncost per solve

0 comments

The pith

Adding a structural codebase index improves localization and task resolution in a fixed coding-agent harness without raising cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether a structural codebase index alters cost or success rates when retrieval methods already vary widely across coding agents. It holds the model and harness fixed while running three arms on SWE-PolyBench Verified and SWE-bench Pro: the index version, the same harness without the index, and an agentic-grep baseline. The within-harness comparison shows large localization gains and statistically separated resolve gains, with no cost penalty per cell and lower cost per solved task. The cross-harness check confirms the index does not regress against the grep baseline on either metric. The practical question therefore shifts from expense to whether the workload contains multi-file edits where structural ranking helps.

Core claim

Within a fixed coding-agent harness on a fixed model, the structural codebase index produces a large localization gain and a statistically separated resolve gain, with no cost penalty per cell and lower cost per solve; it also matches or exceeds an agentic-grep comparator on resolve and localization at no cost penalty.

What carries the argument

structural codebase index, which supplies ranked retrieval over the repository using code structure rather than surface text or simple search.

If this is right

The index yields large localization gains inside the fixed harness.
Resolve gains reach statistical separation from the no-index arm.
Cost per cell stays flat while cost per solved task drops.
Performance does not regress against an agentic-grep baseline on resolve or localization.
The index becomes relevant precisely when workloads include multi-file changes that benefit from structural ranking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If gains concentrate on multi-file tasks, then agents handling large refactors or cross-module edits would capture most of the benefit.
Releasing the per-cell exclusion ledger and leak-audit script lets other groups test the same isolation on their own harnesses.
Repeating the three-arm comparison on different models or larger repositories would reveal how stable the cost and resolve advantages remain.

Load-bearing premise

The structural index is the only causal driver of the observed localization and resolve differences, and the chosen benchmarks plus sandbox controls isolate its effect from other harness variables.

What would settle it

An ablation that achieves the same localization and resolve rates as the index version by adjusting only non-structural retrieval parameters while keeping every other harness component identical would show the structural ranking is not required for the gains.

Figures

Figures reproduced from arXiv: 2606.22417 by Adithyan Krishnan, Ishaan Bhola, Mukunda NS, Sravanth Kurmala.

**Figure 2.** Figure 2: Cost–resolve plane (mean of seed means; error bars [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: First-gold rank CDF under View B, per arm. Each [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: View B acc@5 by gold-file count, mean of seed means [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Coding agents now interleave LLMs with retrieval over the working repository, and retrieval implementations vary widely across deployed harnesses. Inside a fixed coding-agent harness on a fixed model, does adding a structural codebase index actually change cost or resolve? We ran three arms (the harness with the index, the same harness without it, and an agentic-grep comparator) on SWE-PolyBench Verified and SWE-bench Pro with Claude Opus 4.7 held fixed throughout, across three seeds, inside a leak-audited per-task sandbox. The within-harness ablation produces a large localization gain and a statistically separated resolve gain, with no cost penalty per cell and lower cost per solve. The cross-harness check shows that the index does not regress against an agentic-grep baseline on resolve or localization, again at no cost penalty. We release the per-cell exclusion ledger, the leak-audit script, the localization extractor, and the results database. The deployment question for a structural codebase index is thus not whether it is too expensive to run (across seeds, the index lands at a lower $/solved than agentic grep) but whether the workload includes multi-file changes where structural ranking pays off.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The within-harness ablation shows localization and resolve gains from the structural index at no cost penalty, but the methods need to confirm nothing else changed when the index was turned off.

read the letter

The paper's core finding is that, inside one fixed coding-agent harness on SWE-PolyBench Verified and SWE-bench Pro, adding the structural index lifts localization and produces a statistically separated resolve improvement versus the no-index version of the same harness. Cost per cell stays flat and cost per solve drops. The agentic-grep comparator shows the index version stays competitive on both metrics.

They ran the comparison with Claude Opus 4.7 fixed, three seeds, and a leak-audited sandbox. Releasing the per-cell exclusion ledger, the audit script, the localization extractor, and the results database is useful; it lets others check the numbers directly.

The main uncertainty is whether the no-index arm truly kept every other harness component identical. The abstract describes it as the same harness without the index, but if retrieval logic, context assembly, or ranking shifted even slightly when the index was disabled, that could drive the observed differences. The stress-test note correctly flags this, and the abstract alone does not give the implementation details needed to rule it out. The cross-harness grep arm does not address the isolation question.

This is for readers who build or tune retrieval inside coding agents and want controlled measurements on current SWE-style tasks. It does not introduce new theory or first-principles results.

I would send it to peer review. The released artifacts and the direct within-harness comparison give referees something concrete to evaluate, even if the isolation claim needs tighter documentation in revision.

Referee Report

3 major / 2 minor

Summary. The manuscript evaluates whether adding a structural codebase index to a fixed coding-agent harness changes cost or resolve rates. It runs three arms (harness with index, same harness without index, and agentic-grep comparator) on SWE-PolyBench Verified and SWE-bench Pro using Claude Opus 4.7 across three seeds in a leak-audited sandbox. The central claim is that the within-harness ablation yields large localization gains and statistically separated resolve gains with no per-cell cost penalty and lower cost per solve; the cross-harness check shows no regression versus agentic-grep. The paper releases the per-cell exclusion ledger, leak-audit script, localization extractor, and results database.

Significance. If the ablation results hold under the stated controls, the work would supply direct empirical evidence that structural indices can improve localization and resolution on multi-file tasks without raising per-cell cost, thereby reframing the deployment question around workload characteristics rather than raw expense. The open release of the exclusion ledger, audit script, and results database is a concrete strength that supports reproducibility and independent verification.

major comments (3)

[Methods (ablation design)] Methods (ablation design): The description of the 'same harness without it' arm asserts that the harness and sandbox are fixed but does not explicitly confirm that retrieval logic, context assembly, prompting, and ranking remain bitwise identical when the index is disabled; any incidental difference would undermine the claim that the structural index is the sole causal driver of the reported localization and resolve gains.
[Results (statistical claims)] Results (statistical claims): The abstract states a 'statistically separated resolve gain' without naming the test statistic, sample sizes per arm, exact p-value threshold, or correction for multiple comparisons across benchmarks and seeds; this detail is load-bearing for the central claim of separation.
[Data handling] Data handling: Although the per-cell exclusion ledger is released, the methods section must specify whether exclusion criteria were pre-registered or determined after inspecting outcomes, because post-hoc choices could affect the measured gains on localization and resolve.

minor comments (2)

[Abstract] Abstract: 'Claude Opus 4.7' should be replaced by the precise model identifier used in the experiments.
[Figures and tables] Figures and tables: Captions should explicitly list the benchmark names, number of tasks, and seed count so that each display is self-contained.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the referee's detailed and constructive comments. We address each major comment point by point below, with commitments to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: The description of the 'same harness without it' arm asserts that the harness and sandbox are fixed but does not explicitly confirm that retrieval logic, context assembly, prompting, and ranking remain bitwise identical when the index is disabled; any incidental difference would undermine the claim that the structural index is the sole causal driver of the reported localization and resolve gains.

Authors: We agree that explicit confirmation is required to support the causal attribution. The harness code implements the structural index as a self-contained optional module; toggling it off leaves every other component (retrieval logic, context assembly, prompting templates, and ranking) unchanged at the source level. In the revised manuscript we will add a dedicated sentence in the Methods section stating this bitwise identity explicitly. revision: yes
Referee: The abstract states a 'statistically separated resolve gain' without naming the test statistic, sample sizes per arm, exact p-value threshold, or correction for multiple comparisons across benchmarks and seeds; this detail is load-bearing for the central claim of separation.

Authors: The statistical details are already present in the Results section and the released database, but the abstract should be self-contained. We will revise the abstract to name the test statistic, state the per-arm sample sizes (three seeds), report the p-value threshold, and describe the multiple-comparison approach. revision: yes
Referee: Although the per-cell exclusion ledger is released, the methods section must specify whether exclusion criteria were pre-registered or determined after inspecting outcomes, because post-hoc choices could affect the measured gains on localization and resolve.

Authors: We will add an explicit statement to the Methods section indicating whether the exclusion criteria were pre-registered or determined post-inspection, together with the rationale and a pointer to the released ledger. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical ablation on external benchmarks

full rationale

The paper reports measured outcomes from three fixed-harness arms (with-index, without-index, agentic-grep) run on SWE-PolyBench Verified and SWE-bench Pro using a held-fixed model and sandbox. All claims rest on observed localization, resolve, and cost differences across seeds; no equations, fitted parameters, self-citations, or derivations are invoked to produce the results. The within-harness comparison is presented as a controlled measurement rather than a reduction to prior inputs or self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the benchmarks and sandbox setup validly measure the index's contribution; no free parameters or invented entities are introduced.

axioms (2)

domain assumption SWE-PolyBench Verified and SWE-bench Pro are representative of the multi-file coding tasks where structural ranking would matter.
The paper uses these benchmarks as the evaluation substrate for the ablation.
domain assumption The structural index implementation and the within-harness comparison isolate the index effect from other variables.
This premise underpins the claim that observed gains are attributable to the index.

pith-pipeline@v0.9.1-grok · 5752 in / 1321 out tokens · 29412 ms · 2026-06-26T10:57:07.973774+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 10 linked inside Pith

[1]

Introducing Claude Opus 4.7

Anthropic. Introducing Claude Opus 4.7. Anthropic blog post, https://www.anthropic.com/news/claude-opus-4-7, 2026

2026
[2]

Is grep all you need? how agent harnesses reshape agentic search.arXiv preprint arXiv:2605.15184, 2026

Sahil Sen, Akhil Kasturi, Elias Lumer, Anmol Gulati, and Vamse Kumar Subbiah. Is grep all you need? how agent harnesses reshape agentic search.arXiv preprint arXiv:2605.15184, 2026. URL https://arxiv.org/abs/2605. 15184

Pith/arXiv arXiv 2026
[3]

SWE- Bench+: Enhanced coding benchmark for LLMs.arXiv preprint arXiv:2410.06992, 2024

Reem Aleithan, Haoran Xue, Mohammad Mahdi Mohajer, Elijah Nnorom, Gias Uddin, and Song Wang. SWE- Bench+: Enhanced coding benchmark for LLMs.arXiv preprint arXiv:2410.06992, 2024. URL https://arxiv.org/ abs/2410.06992

arXiv 2024
[4]

The SWE-Bench illusion: When state-of- the-art LLMs remember instead of reason.arXiv preprint arXiv:2506.12286, 2025

Shanchao Liang, Spandan Garg, and Roshanak Zilouch- ian Moghaddam. The SWE-Bench illusion: When state-of- the-art LLMs remember instead of reason.arXiv preprint arXiv:2506.12286, 2025. URL https://arxiv.org/abs/2506. 12286

arXiv 2025
[5]

Saving SWE-Bench: A benchmark mutation ap- proach for realistic agent evaluation.arXiv preprint arXiv:2510.08996, 2025

Spandan Garg, Benjamin Steenhoek, and Yufan Huang. Saving SWE-Bench: A benchmark mutation ap- proach for realistic agent evaluation.arXiv preprint arXiv:2510.08996, 2025. URL https://arxiv.org/abs/2510. 08996

arXiv 2025
[6]

SuperCoder: An autonomous AI coding-agent harness

SuperAGI Research and SuperCoder Con- tributors. SuperCoder: An autonomous AI coding-agent harness. GitHub repository, https://github.com/TransformerOptimus/SuperCoder, 2024

2024
[7]

opencode: The AI coding agent built for the terminal

SST and opencode Contributors. opencode: The AI coding agent built for the terminal. GitHub repository, https: //github.com/sst/opencode, 2025

2025
[8]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. URL https://arxiv. org/abs/2405.15793

Pith/arXiv arXiv 2024
[9]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, et al

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, et al. OpenHands: An open platform for AI software developers as generalist agents. InInternational Conference on Learning Representations (ICLR), 2025. URL https://arxiv.org/abs/2407.16741

Pith/arXiv arXiv 2025
[10]

Aider: AI pair programming in your terminal

Paul Gauthier and Aider Contributors. Aider: AI pair programming in your terminal. GitHub repository, https: //github.com/Aider-AI/aider, 2026

2026
[11]

AutoCodeRover: Autonomous program improvement

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. AutoCodeRover: Autonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 2024. URL https://arxiv.org/abs/2404.05427

arXiv 2024
[12]

RepoCoder: Repository-level code completion through iterative retrieval and generation

Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. RepoCoder: Repository-level code completion through iterative retrieval and generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023. URL https://arxiv. org/abs/2303.12570

arXiv 2023
[13]

Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, D. C. Vageesh, Arun Iyer, Suresh Parthasarathy, Sriram Rajamani, B. Ashok, and Shashank Shet. CodePlan: Repository-level coding using LLMs and planning.arXiv preprint arXiv:2309.12499, 2023. URL https://arxiv.org/ abs/2309.12499. Published in FSE 2024

arXiv 2023
[14]

LocAgent: Graph-guided LLM agents for code localization

Zhaoling Chen, Robert Tang, Gangda Deng, Fang Wu, Jialong Wu, Zhiwei Jiang, Viktor Prasanna, Arman Cohan, and Xingyao Wang. LocAgent: Graph-guided LLM agents for code localization. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025. URL https://arxiv.org/abs/2503.09089

arXiv 2025
[15]

RepoGraph: Enhancing AI software engineering with repository-level code graph.arXiv preprint arXiv:2410.14684, 2024

Siru Ouyang, Wenhao Yu, Kaixin Ma, Zilin Xiao, Zhihan Zhang, Mengzhao Jia, Jiawei Han, Hongming Zhang, and Dong Yu. RepoGraph: Enhancing AI software engineering with repository-level code graph.arXiv preprint arXiv:2410.14684, 2024. URL https://arxiv.org/ abs/2410.14684

arXiv 2024
[16]

Code graph model (CGM): A graph-integrated large language model for repository-level software engineering tasks

Hongyuan Tao, Ying Zhang, Zhenhao Tang, Hongen Peng, Xukun Zhu, Bingchang Liu, Yingguang Yang, Ziyin Zhang, Zhaogui Xu, Haipeng Zhang, Linchao Zhu, Rui Wang, Hang Yu, Jianguo Li, and Peng Di. Code graph model (CGM): A graph-integrated large language model for repository-level software engineering tasks. InAdvances in Neural Information Processing Systems ...

arXiv 2025
[17]

Agentless: Demystifying LLM- based software engineering agents.arXiv preprint arXiv:2407.01489, 2024

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying LLM- based software engineering agents.arXiv preprint arXiv:2407.01489, 2024. URL https://arxiv.org/abs/2407. 01489

Pith/arXiv arXiv 2024
[19]

URL https://arxiv.org/abs/2603.24631

Pith/arXiv arXiv
[20]

SWE-Explore: Benchmarking how coding agents explore repositories

Shaoqiu Zhang, Yuhang Wang, Jialiang Liang, Yuling Shi, Wenhao Zeng, Maoquan Wang, Shilin He, Ningyuan Xu, Siyu Ye, Kai Cai, and Xiaodong Gu. SWE-Explore: Benchmarking how coding agents explore repositories. arXiv preprint arXiv:2606.07297, 2026. URL https://arxiv. org/abs/2606.07297

Pith/arXiv arXiv 2026
[21]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations (ICLR), 2024. URL https://arxiv.org/abs/ 2310.06770

Pith/arXiv arXiv 2024
[22]

Introducing SWE-bench veri- fied

Neil Chowdhury, James Aung, Jun Shern Chan, and Oliver Jaffe. Introducing SWE-bench veri- fied. OpenAI blog post, https://openai.com/index/ introducing-swe-bench-verified/, 2024

2024
[23]

SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents.arXiv preprint arXiv:2504.08703, 2025

Muhammad Shihab Rashid, Christian Bock, Yuan Zhuang, Alexander Buchholz, Tim Esler, Simon Valentin, Luca Franceschi, Martin Wistuba, Prabhu Teja Sivaprasad, Woo Jung Kim, Anoop Deoras, Giovanni Zappella, and Laurent Callot. SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents.arXiv preprint arXiv:2504.08703, 2025. URL...

arXiv 2025
[25]

URL https://arxiv.org/abs/2509.16941

Pith/arXiv arXiv
[26]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023. URL https://arxiv.org/abs/2210.03629

Pith/arXiv arXiv 2023
[27]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Can- cedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/2302.04761

Pith/arXiv arXiv 2023
[28]

On randomness in agentic evals.arXiv preprint arXiv:2602.07150, 2026

Bjarni Haukur Bjarnason, Andre Silva, and Martin Mon- perrus. On randomness in agentic evals.arXiv preprint arXiv:2602.07150, 2026. URL https://arxiv.org/abs/2602. 07150

arXiv 2026

[1] [1]

Introducing Claude Opus 4.7

Anthropic. Introducing Claude Opus 4.7. Anthropic blog post, https://www.anthropic.com/news/claude-opus-4-7, 2026

2026

[2] [2]

Is grep all you need? how agent harnesses reshape agentic search.arXiv preprint arXiv:2605.15184, 2026

Sahil Sen, Akhil Kasturi, Elias Lumer, Anmol Gulati, and Vamse Kumar Subbiah. Is grep all you need? how agent harnesses reshape agentic search.arXiv preprint arXiv:2605.15184, 2026. URL https://arxiv.org/abs/2605. 15184

Pith/arXiv arXiv 2026

[3] [3]

SWE- Bench+: Enhanced coding benchmark for LLMs.arXiv preprint arXiv:2410.06992, 2024

Reem Aleithan, Haoran Xue, Mohammad Mahdi Mohajer, Elijah Nnorom, Gias Uddin, and Song Wang. SWE- Bench+: Enhanced coding benchmark for LLMs.arXiv preprint arXiv:2410.06992, 2024. URL https://arxiv.org/ abs/2410.06992

arXiv 2024

[4] [4]

The SWE-Bench illusion: When state-of- the-art LLMs remember instead of reason.arXiv preprint arXiv:2506.12286, 2025

Shanchao Liang, Spandan Garg, and Roshanak Zilouch- ian Moghaddam. The SWE-Bench illusion: When state-of- the-art LLMs remember instead of reason.arXiv preprint arXiv:2506.12286, 2025. URL https://arxiv.org/abs/2506. 12286

arXiv 2025

[5] [5]

Saving SWE-Bench: A benchmark mutation ap- proach for realistic agent evaluation.arXiv preprint arXiv:2510.08996, 2025

Spandan Garg, Benjamin Steenhoek, and Yufan Huang. Saving SWE-Bench: A benchmark mutation ap- proach for realistic agent evaluation.arXiv preprint arXiv:2510.08996, 2025. URL https://arxiv.org/abs/2510. 08996

arXiv 2025

[6] [6]

SuperCoder: An autonomous AI coding-agent harness

SuperAGI Research and SuperCoder Con- tributors. SuperCoder: An autonomous AI coding-agent harness. GitHub repository, https://github.com/TransformerOptimus/SuperCoder, 2024

2024

[7] [7]

opencode: The AI coding agent built for the terminal

SST and opencode Contributors. opencode: The AI coding agent built for the terminal. GitHub repository, https: //github.com/sst/opencode, 2025

2025

[8] [8]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. URL https://arxiv. org/abs/2405.15793

Pith/arXiv arXiv 2024

[9] [9]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, et al

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, et al. OpenHands: An open platform for AI software developers as generalist agents. InInternational Conference on Learning Representations (ICLR), 2025. URL https://arxiv.org/abs/2407.16741

Pith/arXiv arXiv 2025

[10] [10]

Aider: AI pair programming in your terminal

Paul Gauthier and Aider Contributors. Aider: AI pair programming in your terminal. GitHub repository, https: //github.com/Aider-AI/aider, 2026

2026

[11] [11]

AutoCodeRover: Autonomous program improvement

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. AutoCodeRover: Autonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 2024. URL https://arxiv.org/abs/2404.05427

arXiv 2024

[12] [12]

RepoCoder: Repository-level code completion through iterative retrieval and generation

Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. RepoCoder: Repository-level code completion through iterative retrieval and generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023. URL https://arxiv. org/abs/2303.12570

arXiv 2023

[13] [13]

Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, D. C. Vageesh, Arun Iyer, Suresh Parthasarathy, Sriram Rajamani, B. Ashok, and Shashank Shet. CodePlan: Repository-level coding using LLMs and planning.arXiv preprint arXiv:2309.12499, 2023. URL https://arxiv.org/ abs/2309.12499. Published in FSE 2024

arXiv 2023

[14] [14]

LocAgent: Graph-guided LLM agents for code localization

Zhaoling Chen, Robert Tang, Gangda Deng, Fang Wu, Jialong Wu, Zhiwei Jiang, Viktor Prasanna, Arman Cohan, and Xingyao Wang. LocAgent: Graph-guided LLM agents for code localization. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025. URL https://arxiv.org/abs/2503.09089

arXiv 2025

[15] [15]

RepoGraph: Enhancing AI software engineering with repository-level code graph.arXiv preprint arXiv:2410.14684, 2024

Siru Ouyang, Wenhao Yu, Kaixin Ma, Zilin Xiao, Zhihan Zhang, Mengzhao Jia, Jiawei Han, Hongming Zhang, and Dong Yu. RepoGraph: Enhancing AI software engineering with repository-level code graph.arXiv preprint arXiv:2410.14684, 2024. URL https://arxiv.org/ abs/2410.14684

arXiv 2024

[16] [16]

Code graph model (CGM): A graph-integrated large language model for repository-level software engineering tasks

Hongyuan Tao, Ying Zhang, Zhenhao Tang, Hongen Peng, Xukun Zhu, Bingchang Liu, Yingguang Yang, Ziyin Zhang, Zhaogui Xu, Haipeng Zhang, Linchao Zhu, Rui Wang, Hang Yu, Jianguo Li, and Peng Di. Code graph model (CGM): A graph-integrated large language model for repository-level software engineering tasks. InAdvances in Neural Information Processing Systems ...

arXiv 2025

[17] [17]

Agentless: Demystifying LLM- based software engineering agents.arXiv preprint arXiv:2407.01489, 2024

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying LLM- based software engineering agents.arXiv preprint arXiv:2407.01489, 2024. URL https://arxiv.org/abs/2407. 01489

Pith/arXiv arXiv 2024

[18] [19]

URL https://arxiv.org/abs/2603.24631

Pith/arXiv arXiv

[19] [20]

SWE-Explore: Benchmarking how coding agents explore repositories

Shaoqiu Zhang, Yuhang Wang, Jialiang Liang, Yuling Shi, Wenhao Zeng, Maoquan Wang, Shilin He, Ningyuan Xu, Siyu Ye, Kai Cai, and Xiaodong Gu. SWE-Explore: Benchmarking how coding agents explore repositories. arXiv preprint arXiv:2606.07297, 2026. URL https://arxiv. org/abs/2606.07297

Pith/arXiv arXiv 2026

[20] [21]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations (ICLR), 2024. URL https://arxiv.org/abs/ 2310.06770

Pith/arXiv arXiv 2024

[21] [22]

Introducing SWE-bench veri- fied

Neil Chowdhury, James Aung, Jun Shern Chan, and Oliver Jaffe. Introducing SWE-bench veri- fied. OpenAI blog post, https://openai.com/index/ introducing-swe-bench-verified/, 2024

2024

[22] [23]

SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents.arXiv preprint arXiv:2504.08703, 2025

Muhammad Shihab Rashid, Christian Bock, Yuan Zhuang, Alexander Buchholz, Tim Esler, Simon Valentin, Luca Franceschi, Martin Wistuba, Prabhu Teja Sivaprasad, Woo Jung Kim, Anoop Deoras, Giovanni Zappella, and Laurent Callot. SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents.arXiv preprint arXiv:2504.08703, 2025. URL...

arXiv 2025

[23] [25]

URL https://arxiv.org/abs/2509.16941

Pith/arXiv arXiv

[24] [26]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023. URL https://arxiv.org/abs/2210.03629

Pith/arXiv arXiv 2023

[25] [27]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Can- cedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/2302.04761

Pith/arXiv arXiv 2023

[26] [28]

On randomness in agentic evals.arXiv preprint arXiv:2602.07150, 2026

Bjarni Haukur Bjarnason, Andre Silva, and Martin Mon- perrus. On randomness in agentic evals.arXiv preprint arXiv:2602.07150, 2026. URL https://arxiv.org/abs/2602. 07150

arXiv 2026