SWE-Explore: Benchmarking How Coding Agents Explore Repositories

Jialiang Liang; Kai Cai; Maoquan Wang; Ningyuan Xu; Shaoqiu Zhang; Shilin He; Siyu Ye; Wenhao Zeng; Xiaodong Gu; Yuhang Wang

arxiv: 2606.07297 · v1 · pith:27HZ6HPCnew · submitted 2026-06-05 · 💻 cs.SE · cs.CL

SWE-Explore: Benchmarking How Coding Agents Explore Repositories

Shaoqiu Zhang , Yuhang Wang , Jialiang Liang , Yuling Shi , Wenhao Zeng , Maoquan Wang , Shilin He , Ningyuan Xu

show 3 more authors

Siyu Ye Kai Cai Xiaodong Gu

This is my paper

Pith reviewed 2026-06-27 21:20 UTC · model grok-4.3

classification 💻 cs.SE cs.CL

keywords coding agentsrepository explorationbug localizationSWE-benchagentic explorationcode retrievalsoftware engineering benchmarksline-level evaluation

0 comments

The pith

Agentic explorers outperform classical retrieval when locating relevant code regions for repository issues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SWE-Explore, a benchmark that isolates the repository exploration step in coding agents by asking them to return a ranked list of relevant code regions within a fixed line budget. Ground truth for each of the 848 issues is derived from the specific code regions consulted in independent successful repair trajectories. The work evaluates methods along coverage, ranking quality, and context efficiency, and reports that these scores correlate with whether the same agents later succeed at actual bug repair. Agentic approaches sit in a clear tier above classical retrieval techniques, while file-level localization is already robust and line-level plus efficiency metrics separate the strongest performers.

Core claim

SWE-Explore requires an explorer to produce a ranked list of relevant code regions for a given issue under a fixed line budget. Line-level ground truth is distilled from trajectories of agents that independently solved the same issue. Across retrieval baselines, general coding agents, and specialized localizers, agentic explorers achieve higher coverage and better ranking; the three evaluation dimensions strongly track downstream repair success. File-level localization is already strong for modern methods, but line-level coverage and context-efficient ranking remain the axes that differentiate state-of-the-art explorers.

What carries the argument

SWE-Explore benchmark that scores ranked retrieval of code regions under a fixed line budget using ground truth extracted from successful agent trajectories.

If this is right

Higher scores on coverage, ranking, and context-efficiency metrics predict higher rates of successful bug repair.
Agentic explorers form a distinct performance tier above classical retrieval methods.
File-level localization is already strong while line-level coverage and efficient ranking remain the differentiating factors.
The benchmark spans 10 languages and 203 repositories, supporting broad comparison of exploration strategies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Separating exploration evaluation from full repair allows targeted iteration on retrieval components before full pipeline testing.
Ground truth drawn only from successful trajectories may systematically favor exploration styles already known to work.
The fixed-budget format could be reused to study context-length trade-offs when scaling agents to larger repositories.

Load-bearing premise

That the code regions visited in successful independent trajectories accurately represent the minimal relevant set an explorer should retrieve for each issue.

What would settle it

A method that scores highest on SWE-Explore coverage, ranking, and efficiency but produces no measurable increase in end-to-end repair rates when plugged into a full agent pipeline.

Figures

Figures reproduced from arXiv: 2606.07297 by Jialiang Liang, Kai Cai, Maoquan Wang, Ningyuan Xu, Shaoqiu Zhang, Shilin He, Siyu Ye, Wenhao Zeng, Xiaodong Gu, Yuhang Wang, Yuling Shi.

**Figure 2.** Figure 2: Overview of SWE-Explore. From solution-verified trajectories, SWE-Explore extracts read [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Language distribution of the 848 retained SWE-Explore instances across 10 different coding languages [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Example of a SWE-Explore instance. Left: an issue plus a repo snapshot with the highlighted core span. Right: trajectory-derived core regions Ccore (scoring target) and optional regions Copt, an explorer’s ranked prediction scored against Ccore. Rcore refer to this refined and audited ground truth, and Rcore is the only scoring target in the main experiments [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Resolve rate as the visible context degrades from the Oracle’s full core set [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding agents. Yet they usually treat coding tasks as a holistic, binary prediction problem (e.g., resolved or unresolved), neglecting fine-grained agent capabilities such as repository understanding, context retrieval, code localization, and bug diagnosis. In this paper, we introduce SWE-Explore, a benchmark that isolates the evaluation of repository exploration, a critical capability of coding agents. Given a repository and an issue, SWE-Explore asks an explorer to return a ranked list of relevant code regions under a fixed line budget. SWE-Explore covers 848 issues across 10 programming languages and 203 open-source repositories. For each instance, we derive line-level ground truth from independent agent trajectories that successfully solved the same issue, distilling the specific code regions their solution paths actually consulted. We evaluate exploration along coverage, ranking, and context-efficiency dimensions, showing that these metrics strongly track downstream repair behavior. Across a broad set of retrieval methods, general coding agents, and specialized localizers, we find that agentic explorers form a clear tier above classical retrieval. While file-level localization is already strong for modern methods, line-level coverage and efficient ranking remain the key axes differentiating state-of-the-art explorers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SWE-Explore gives a new benchmark for repository exploration with trajectory-derived line labels, but that ground truth choice creates a real risk of circularity that favors agentic methods.

read the letter

SWE-Explore isolates the exploration step that sits between an issue description and a patch. It asks an explorer to return a ranked list of relevant code regions under a line budget, then scores coverage, ranking, and context efficiency. The ground truth lines come from distilling what independent successful agent trajectories actually consulted for each of the 848 issues.

This is new. SWE-bench and similar work treat the whole task as binary success or failure. Standard retrieval benchmarks usually rely on human annotations or synthetic queries. Here the labels are automatic, tied to real fixes, and span 10 languages and 203 repositories. That scale and the focus on line-level rather than file-level is a clear step past existing setups.

The paper reports that these metrics correlate with downstream repair success and that agentic explorers sit clearly above classical retrieval methods. File-level localization is already strong across modern systems, while line-level coverage and efficient ranking are the differentiating factors.

The soft spot is the ground truth construction itself. Relevance is defined by the regions that successful agents visited. Any explorer whose behavior matches those trajectories will score higher by design, while classical methods that surface different but still necessary regions get no credit. The abstract claims the metrics track repair behavior, yet that correlation could be partly an artifact of how the labels were built rather than an independent signal of exploration quality. This is not a small methodological detail; it directly affects whether the reported tiering is measuring something objective.

The work is aimed at groups building or evaluating coding agents who need finer-grained diagnostics than end-to-end pass rates. It deserves peer review because the benchmark framing is practical and the data release could be useful, but referees will need to press on whether the evaluation protocol avoids rewarding similarity to the label-generating agents. The thinking is coherent on its own terms.

Referee Report

1 major / 2 minor

Summary. The paper introduces SWE-Explore, a benchmark isolating repository exploration for coding agents. Given an issue and repo, explorers return a ranked list of relevant code regions under a fixed line budget. Ground truth is line-level regions distilled from independent successful agent trajectories that solved the issue. The benchmark spans 848 issues across 10 languages and 203 repositories. Evaluation uses coverage, ranking, and context-efficiency metrics, which are shown to track downstream repair performance. Results indicate agentic explorers form a clear tier above classical retrieval methods, with file-level localization already strong but line-level coverage and efficient ranking as key differentiators.

Significance. If the evaluation protocol is free of bias, the work fills a gap by providing fine-grained metrics for exploration (a prerequisite for repair) rather than binary SWE-bench outcomes. The scale (multi-language, hundreds of repos) and the reported correlation between exploration metrics and repair success are useful contributions for guiding agent development. Credit is due for releasing a benchmark that separates exploration from end-to-end repair.

major comments (1)

[Abstract and ground-truth construction section] Abstract and ground-truth construction section: line-level ground truth is obtained by distilling code regions consulted in independent successful agent trajectories. Because relevance is defined exactly by the paths taken by those trajectories, any explorer whose retrieval behavior correlates with the ground-truth-generating agents receives inflated coverage and ranking scores by construction. Classical retrieval methods that surface different but functionally necessary regions receive no credit. This directly undermines the headline claim that agentic explorers form a clear tier above classical retrieval and that the metrics strongly track repair behavior. The paper should either (a) provide an independent (e.g., human-annotated or cross-agent-class) ground-truth validation or (b) quantify how much the observed tiering changes when ground truth is replaced by an alternative source.

minor comments (2)

[Methods] The abstract states that trajectories are 'independent,' but the manuscript should explicitly report the set of agents used to generate ground truth versus those evaluated, including any overlap in model families or prompting strategies.
[Results] Table or figure reporting per-language or per-repository variance in the agentic-vs-classical gap would strengthen the 'clear tier' claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting this important methodological consideration regarding ground-truth construction. We respond to the concern below and indicate planned revisions.

read point-by-point responses

Referee: Abstract and ground-truth construction section: line-level ground truth is obtained by distilling code regions consulted in independent successful agent trajectories. Because relevance is defined exactly by the paths taken by those trajectories, any explorer whose retrieval behavior correlates with the ground-truth-generating agents receives inflated coverage and ranking scores by construction. Classical retrieval methods that surface different but functionally necessary regions receive no credit. This directly undermines the headline claim that agentic explorers form a clear tier above classical retrieval and that the metrics strongly track repair behavior. The paper should either (a) provide an independent (e.g., human-annotated or cross-agent-class) ground-truth validation or (b) quantify how much the observed tiering changes when ground truth is replaced by an alternative source.

Authors: The ground truth is derived exclusively from trajectories of agents that independently succeeded in resolving the issues, capturing the precise regions consulted during effective repairs. This is intentional: the benchmark measures an explorer's ability to surface code that has been shown to enable task completion, rather than regions that might be relevant under other criteria. While we recognize that this construction can advantage explorers whose behavior aligns with the ground-truth agents, the manuscript already reports that the resulting coverage, ranking, and efficiency metrics correlate with downstream repair success across a wide range of methods, including classical retrieval baselines. This empirical correlation provides evidence that the metrics remain informative for practical agent development. We will revise the manuscript to add an explicit discussion paragraph in the ground-truth section addressing the design rationale, the potential for behavioral correlation, and the supporting repair-performance correlation as validation. We do not perform a full alternative ground-truth experiment, as that lies outside the current scope. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper constructs line-level ground truth from independent successful agent trajectories (abstract) and evaluates a range of retrieval methods, general agents, and localizers against coverage/ranking/efficiency metrics that are reported to track repair. This construction uses an external success filter on separate trajectories rather than defining relevance via the evaluated explorers themselves or reducing any claimed result to a fitted parameter or self-citation. No equations, self-definitional loops, or load-bearing self-citations appear; the benchmark remains self-contained against its stated external anchor.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is a new evaluation protocol rather than new parameters or entities. The main unstated premise is that successful prior trajectories constitute an unbiased and complete source of relevant code regions.

axioms (1)

domain assumption Successful agent trajectories on the same issue provide the authoritative set of relevant code regions for exploration evaluation.
Abstract states ground truth is derived from such trajectories; if this set is incomplete or biased toward particular exploration styles, all downstream metrics are affected.

pith-pipeline@v0.9.1-grok · 5787 in / 1484 out tokens · 15484 ms · 2026-06-27T21:20:11.202965+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLM Agents Can See Code Repositories
cs.SE 2026-06 unverdicted novelty 7.0

Visual graphs of repository structure added to text inputs for multimodal LLM agents reduce token consumption by up to 26% while maintaining or improving issue-resolution accuracy.
Code Isn't Memory: A Structural Codebase Index Inside a Coding Agent
cs.AI 2026-06 unverdicted novelty 4.0

Ablation study finds that a structural codebase index improves localization and resolve rates in coding agents on two SWE benchmarks without raising per-cell cost.

Reference graph

Works this paper leans on

37 extracted references · 10 canonical work pages · cited by 2 Pith papers · 4 internal anchors

[1]

Claude code: Ai-assisted coding in real-world codebases, 2025

Anthropic. Claude code: Ai-assisted coding in real-world codebases, 2025. URL https: //claude.ai/code. Accessed: 2026-05

2025
[2]

Swe- rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents, 2025

Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. Swe- rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents, 2025. URLhttps://arxiv.org/abs/2505.20411

arXiv 2025
[3]

Beyondswe: Can current code agent survive beyond single-repo bug fixing?, 2026

Guoxin Chen, Fanzhe Meng, Jiale Zhao, Minghao Li, Daixuan Cheng, Huatong Song, Jie Chen, Yuzhi Lin, Hui Chen, Xin Zhao, Ruihua Song, Chang Liu, Cheng Chen, Kai Jia, and Ji-Rong 10 Wen. Beyondswe: Can current code agent survive beyond single-repo bug fixing?, 2026. URL https://arxiv.org/abs/2603.03194

Pith/arXiv arXiv 2026
[4]

Swe-exp: Experience-driven software issue resolution.arXiv preprint arXiv:2507.23361, 2025

Silin Chen, Shaoxin Lin, Yuling Shi, Heng Lian, Xiaodong Gu, Longfei Yun, Dong Chen, Lin Cao, Jiyang Liu, Nu Xia, et al. Swe-exp: Experience-driven software issue resolution.arXiv preprint arXiv:2507.23361, 2025

arXiv 2025
[5]

Prasanna, Arman Cohan, and Xingyao Wang

Zhaoling Chen, Xiangru Tang, Gangda Deng, Fang Wu, Jialong Wu, Zhiwei Jiang, Viktor K. Prasanna, Arman Cohan, and Xingyao Wang. LocAgent: Graph-guided LLM agents for code localization.CoRR, abs/2503.09089, 2025. doi: 10.48550/arXiv.2503.09089

work page doi:10.48550/arxiv.2503.09089 2025
[6]

Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry

Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry. Introducing SWE-bench verified, August 2024. URLhttps://openai.com/index/introducing-swe-bench-verified/. OpenAI milestone, ...

2024
[7]

Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?, 2025

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Vijay Bharadwaj, Jeff Holm, Raja Aluri, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. Swe-bench pro: Can ai agents solve long-ho...

Pith/arXiv arXiv 2025
[8]

CodeSearchNet challenge: Evaluating the state of semantic code search, 2019

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. CodeSearchNet challenge: Evaluating the state of semantic code search, 2019. URL https: //arxiv.org/abs/1909.09436

Pith/arXiv arXiv 2019
[9]

Cumulated gain-based evaluation of IR techniques

Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4):422–446, 2002. doi: 10.1145/582415.582418

work page doi:10.1145/582415.582418 2002
[10]

Issue localization via llm-driven iterative code graph searching, 2025

Zhonghao Jiang, Xiaoxue Ren, Meng Yan, Wei Jiang, Yong Li, and Zhongxin Liu. Issue localization via llm-driven iterative code graph searching, 2025. URL https://arxiv.org/ abs/2503.22424

arXiv 2025
[11]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=VTF8yNQM66

2024
[12]

Swe-debate: Competitive multi-agent debate for software issue resolution

Han Li, Yuling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Yantao Jia, Tao Huang, and Qianxiang Wang. Swe-debate: Competitive multi-agent debate for software issue resolution. arXiv preprint arXiv:2507.23348, 2025

arXiv 2025
[13]

Barr, Federica Sarro, Zhaoyang Chu, and He Ye

Han Li, Letian Zhu, Bohan Zhang, Rili Feng, Jiaming Wang, Yue Pan, Earl T. Barr, Federica Sarro, Zhaoyang Chu, and He Ye. ContextBench: A benchmark for context retrieval in coding agents.CoRR, abs/2602.05892, 2026. doi: 10.48550/arXiv.2602.05892

work page doi:10.48550/arxiv.2602.05892 2026
[14]

Repobench: Benchmarking repository- level code auto-completion systems

Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository- level code auto-completion systems. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=pPjZIOuQuF

2024
[15]

Training software engineering agents and verifiers with swe-gym, 2025

Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym, 2025. URL https: //arxiv.org/abs/2412.21139

Pith/arXiv arXiv 2025
[16]

Swe-qa: Can language models answer repository-level code questions?arXiv preprint arXiv:2509.14635, 2025

Weihan Peng, Yuling Shi, Yuhang Wang, Xinyun Zhang, Beijun Shen, and Xiaodong Gu. Swe-qa: Can language models answer repository-level code questions?arXiv preprint arXiv:2509.14635, 2025

Pith/arXiv arXiv 2025
[17]

The probabilistic relevance framework: BM25 and beyond.Foundations and Trends in Information Retrieval, 3(4):333–389, 2009

Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond.Foundations and Trends in Information Retrieval, 3(4):333–389, 2009. doi: 10.1561/ 1500000019. 11

2009
[18]

Term-weighting approaches in automatic text retrieval

Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513–523, 1988. doi: 10.1016/0306-4573(88) 90021-0

work page doi:10.1016/0306-4573(88 1988
[19]

From code to correctness: Closing the last mile of code generation with hierarchical debugging.arXiv preprint arXiv:2410.01215, 2024

Yuling Shi, Songsong Wang, Chengcheng Wan, Min Wang, and Xiaodong Gu. From code to correctness: Closing the last mile of code generation with hierarchical debugging.arXiv preprint arXiv:2410.01215, 2024

arXiv 2024
[20]

Longcodezip: Compress long context for code language models

Yuling Shi, Yichun Qian, Hongyu Zhang, Beijun Shen, and Xiaodong Gu. Longcodezip: Compress long context for code language models. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 141–153. IEEE, 2025

2025
[21]

Between lines of code: Unraveling the distinct patterns of machine and human programmers

Yuling Shi, Hongyu Zhang, Chengcheng Wan, and Xiaodong Gu. Between lines of code: Unraveling the distinct patterns of machine and human programmers. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), pages 1628–1639. IEEE, 2025

2025
[22]

CodeScout: Contextual Problem Statement Enhancement for Software Agents

Manan Suri, Xiangci Li, Mehdi Shojaie, Songyang Han, Chao-Chun Hsu, Shweta Garg, Aniket Anand Deshmukh, and Varun Kumar. CodeScout: Contextual problem statement en- hancement for software agents.CoRR, abs/2603.05744, 2026. doi: 10.48550/arXiv.2603.05744

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.05744 2026
[23]

SWE-dev: Building software engineering agents with training and inference scaling

Haoran Wang, Zhenyu Hou, Yao Wei, Jie Tang, and Yuxiao Dong. SWE-dev: Building software engineering agents with training and inference scaling. InFindings of the Association for Computational Linguistics: ACL 2025, pages 3742–3761. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-acl.193. URL https://aclanthology. org/2025.f...

work page doi:10.18653/v1/2025.findings-acl.193 2025
[24]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHands: An open platform for AI soft...

Pith/arXiv arXiv 2024
[25]

Context compression for llm agents: A survey of methods, failure modes, and evaluation

Yifei Wang, Ziteng Wang, Yuling Shi, Silin Chen, Xinrui Wang, Yueqi Wang, Beijun Shen, Linjing Li, Xiaodong Gu, Julian McAuley, et al. Context compression for llm agents: A survey of methods, failure modes, and evaluation. 2026

2026
[26]

SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents

Yuhang Wang, Yuling Shi, Mo Yang, Rongrui Zhang, Shilin He, Heng Lian, Yuting Chen, Siyu Ye, Kai Cai, and Xiaodong Gu. SWE-pruner: Self-adaptive context pruning for coding agents. CoRR, abs/2601.16746, 2026. doi: 10.48550/arXiv.2601.16746

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.16746 2026
[27]

Agentless: Demystifying LLM-based Software Engineering Agents

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying LLM-based software engineering agents.CoRR, abs/2407.01489, 2024. doi: 10.48550/arXiv. 2407.01489

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2024
[28]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R. Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems, 2024. URL https: //openreview.net/forum?id=mXpq6ut8J3

2024
[29]

SWE-bench multimodal: Do AI systems generalize to visual software domains? InThe Thirteenth International Conference on Learning Representations, 2025

John Yang, Carlos E Jimenez, Alex L Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R Narasimhan, Diyi Yang, Sida Wang, and Ofir Press. SWE-bench multimodal: Do AI systems generalize to visual software domains? InThe Thirteenth International Conference on Learning Representations, 2025. URL https: //ope...

2025
[30]

Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang

John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. Swe-smith: Scaling data for software engineering agents, 2025. URLhttps://arxiv.org/abs/2504.21798

Pith/arXiv arXiv 2025
[31]

OrcaLoca: An LLM agent framework for software issue localization

Zhongming Yu, Hejia Zhang, Yujie Zhao, Hanxian Huang, Matrix Yao, Ke Ding, and Jishen Zhao. OrcaLoca: An LLM agent framework for software issue localization. InInterna- tional Conference on Machine Learning, 2025. URL https://openreview.net/forum? id=LyUfPOvM6I. 12

2025
[32]

Multi-swe-bench: A multilingual benchmark for issue resolving, 2025

Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su, Tianyu Liu, Rui Long, Kai Shen, and Liang Xiang. Multi-swe-bench: A multilingual benchmark for issue resolving, 2025. URLhttps://arxiv.org/abs/2504.02605

Pith/arXiv arXiv 2025
[33]

Swe-bench goes live!, 2025

Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, and Dongmei Zhang. Swe-bench goes live!, 2025. URL https: //arxiv.org/abs/2505.23419

arXiv 2025
[34]

AutoCodeRover: Au- tonomous program improvement

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. AutoCodeRover: Au- tonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, 2024. doi: 10.1145/3650212.3680384

work page doi:10.1145/3650212.3680384 2024
[35]

MULocBench: A benchmark for localizing code and non-code issues in software projects, 2025

Zejun Zhang, Jian Wang, Qingyun Yang, Yifan Pan, Yi Tang, Yi Li, Zhenchang Xing, Tian Zhang, Xuandong Li, and Guoan Zhang. MULocBench: A benchmark for localizing code and non-code issues in software projects, 2025. URLhttps://arxiv.org/abs/2509.25242

arXiv 2025
[36]

Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports

Jian Zhou, Hongyu Zhang, and David Lo. Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports. InProceedings of the 34th International Conference on Software Engineering, pages 14–24. IEEE, 2012. doi: 10.1109/ ICSE.2012.6227210

arXiv 2012
[37]

SWE Context Bench: A Benchmark for Context Learning in Coding

Jared Zhu, Minhao Hu, and Junde Wu. SWE context bench: A benchmark for context learning in coding.CoRR, abs/2602.08316, 2026. doi: 10.48550/arXiv.2602.08316. A Dataset Details This appendix supplements §3.2. The main paper reports the aggregate benchmark statistics; here we specify the retained instance set, record schema, and repository-snapshot assumpti...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.08316 2026

[1] [1]

Claude code: Ai-assisted coding in real-world codebases, 2025

Anthropic. Claude code: Ai-assisted coding in real-world codebases, 2025. URL https: //claude.ai/code. Accessed: 2026-05

2025

[2] [2]

Swe- rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents, 2025

Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. Swe- rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents, 2025. URLhttps://arxiv.org/abs/2505.20411

arXiv 2025

[3] [3]

Beyondswe: Can current code agent survive beyond single-repo bug fixing?, 2026

Guoxin Chen, Fanzhe Meng, Jiale Zhao, Minghao Li, Daixuan Cheng, Huatong Song, Jie Chen, Yuzhi Lin, Hui Chen, Xin Zhao, Ruihua Song, Chang Liu, Cheng Chen, Kai Jia, and Ji-Rong 10 Wen. Beyondswe: Can current code agent survive beyond single-repo bug fixing?, 2026. URL https://arxiv.org/abs/2603.03194

Pith/arXiv arXiv 2026

[4] [4]

Swe-exp: Experience-driven software issue resolution.arXiv preprint arXiv:2507.23361, 2025

Silin Chen, Shaoxin Lin, Yuling Shi, Heng Lian, Xiaodong Gu, Longfei Yun, Dong Chen, Lin Cao, Jiyang Liu, Nu Xia, et al. Swe-exp: Experience-driven software issue resolution.arXiv preprint arXiv:2507.23361, 2025

arXiv 2025

[5] [5]

Prasanna, Arman Cohan, and Xingyao Wang

Zhaoling Chen, Xiangru Tang, Gangda Deng, Fang Wu, Jialong Wu, Zhiwei Jiang, Viktor K. Prasanna, Arman Cohan, and Xingyao Wang. LocAgent: Graph-guided LLM agents for code localization.CoRR, abs/2503.09089, 2025. doi: 10.48550/arXiv.2503.09089

work page doi:10.48550/arxiv.2503.09089 2025

[6] [6]

Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry

Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry. Introducing SWE-bench verified, August 2024. URLhttps://openai.com/index/introducing-swe-bench-verified/. OpenAI milestone, ...

2024

[7] [7]

Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?, 2025

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Vijay Bharadwaj, Jeff Holm, Raja Aluri, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. Swe-bench pro: Can ai agents solve long-ho...

Pith/arXiv arXiv 2025

[8] [8]

CodeSearchNet challenge: Evaluating the state of semantic code search, 2019

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. CodeSearchNet challenge: Evaluating the state of semantic code search, 2019. URL https: //arxiv.org/abs/1909.09436

Pith/arXiv arXiv 2019

[9] [9]

Cumulated gain-based evaluation of IR techniques

Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4):422–446, 2002. doi: 10.1145/582415.582418

work page doi:10.1145/582415.582418 2002

[10] [10]

Issue localization via llm-driven iterative code graph searching, 2025

Zhonghao Jiang, Xiaoxue Ren, Meng Yan, Wei Jiang, Yong Li, and Zhongxin Liu. Issue localization via llm-driven iterative code graph searching, 2025. URL https://arxiv.org/ abs/2503.22424

arXiv 2025

[11] [11]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=VTF8yNQM66

2024

[12] [12]

Swe-debate: Competitive multi-agent debate for software issue resolution

Han Li, Yuling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Yantao Jia, Tao Huang, and Qianxiang Wang. Swe-debate: Competitive multi-agent debate for software issue resolution. arXiv preprint arXiv:2507.23348, 2025

arXiv 2025

[13] [13]

Barr, Federica Sarro, Zhaoyang Chu, and He Ye

Han Li, Letian Zhu, Bohan Zhang, Rili Feng, Jiaming Wang, Yue Pan, Earl T. Barr, Federica Sarro, Zhaoyang Chu, and He Ye. ContextBench: A benchmark for context retrieval in coding agents.CoRR, abs/2602.05892, 2026. doi: 10.48550/arXiv.2602.05892

work page doi:10.48550/arxiv.2602.05892 2026

[14] [14]

Repobench: Benchmarking repository- level code auto-completion systems

Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository- level code auto-completion systems. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=pPjZIOuQuF

2024

[15] [15]

Training software engineering agents and verifiers with swe-gym, 2025

Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym, 2025. URL https: //arxiv.org/abs/2412.21139

Pith/arXiv arXiv 2025

[16] [16]

Swe-qa: Can language models answer repository-level code questions?arXiv preprint arXiv:2509.14635, 2025

Weihan Peng, Yuling Shi, Yuhang Wang, Xinyun Zhang, Beijun Shen, and Xiaodong Gu. Swe-qa: Can language models answer repository-level code questions?arXiv preprint arXiv:2509.14635, 2025

Pith/arXiv arXiv 2025

[17] [17]

The probabilistic relevance framework: BM25 and beyond.Foundations and Trends in Information Retrieval, 3(4):333–389, 2009

Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond.Foundations and Trends in Information Retrieval, 3(4):333–389, 2009. doi: 10.1561/ 1500000019. 11

2009

[18] [18]

Term-weighting approaches in automatic text retrieval

Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513–523, 1988. doi: 10.1016/0306-4573(88) 90021-0

work page doi:10.1016/0306-4573(88 1988

[19] [19]

From code to correctness: Closing the last mile of code generation with hierarchical debugging.arXiv preprint arXiv:2410.01215, 2024

Yuling Shi, Songsong Wang, Chengcheng Wan, Min Wang, and Xiaodong Gu. From code to correctness: Closing the last mile of code generation with hierarchical debugging.arXiv preprint arXiv:2410.01215, 2024

arXiv 2024

[20] [20]

Longcodezip: Compress long context for code language models

Yuling Shi, Yichun Qian, Hongyu Zhang, Beijun Shen, and Xiaodong Gu. Longcodezip: Compress long context for code language models. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 141–153. IEEE, 2025

2025

[21] [21]

Between lines of code: Unraveling the distinct patterns of machine and human programmers

Yuling Shi, Hongyu Zhang, Chengcheng Wan, and Xiaodong Gu. Between lines of code: Unraveling the distinct patterns of machine and human programmers. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), pages 1628–1639. IEEE, 2025

2025

[22] [22]

CodeScout: Contextual Problem Statement Enhancement for Software Agents

Manan Suri, Xiangci Li, Mehdi Shojaie, Songyang Han, Chao-Chun Hsu, Shweta Garg, Aniket Anand Deshmukh, and Varun Kumar. CodeScout: Contextual problem statement en- hancement for software agents.CoRR, abs/2603.05744, 2026. doi: 10.48550/arXiv.2603.05744

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.05744 2026

[23] [23]

SWE-dev: Building software engineering agents with training and inference scaling

Haoran Wang, Zhenyu Hou, Yao Wei, Jie Tang, and Yuxiao Dong. SWE-dev: Building software engineering agents with training and inference scaling. InFindings of the Association for Computational Linguistics: ACL 2025, pages 3742–3761. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-acl.193. URL https://aclanthology. org/2025.f...

work page doi:10.18653/v1/2025.findings-acl.193 2025

[24] [24]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHands: An open platform for AI soft...

Pith/arXiv arXiv 2024

[25] [25]

Context compression for llm agents: A survey of methods, failure modes, and evaluation

Yifei Wang, Ziteng Wang, Yuling Shi, Silin Chen, Xinrui Wang, Yueqi Wang, Beijun Shen, Linjing Li, Xiaodong Gu, Julian McAuley, et al. Context compression for llm agents: A survey of methods, failure modes, and evaluation. 2026

2026

[26] [26]

SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents

Yuhang Wang, Yuling Shi, Mo Yang, Rongrui Zhang, Shilin He, Heng Lian, Yuting Chen, Siyu Ye, Kai Cai, and Xiaodong Gu. SWE-pruner: Self-adaptive context pruning for coding agents. CoRR, abs/2601.16746, 2026. doi: 10.48550/arXiv.2601.16746

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.16746 2026

[27] [27]

Agentless: Demystifying LLM-based Software Engineering Agents

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying LLM-based software engineering agents.CoRR, abs/2407.01489, 2024. doi: 10.48550/arXiv. 2407.01489

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2024

[28] [28]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R. Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems, 2024. URL https: //openreview.net/forum?id=mXpq6ut8J3

2024

[29] [29]

SWE-bench multimodal: Do AI systems generalize to visual software domains? InThe Thirteenth International Conference on Learning Representations, 2025

John Yang, Carlos E Jimenez, Alex L Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R Narasimhan, Diyi Yang, Sida Wang, and Ofir Press. SWE-bench multimodal: Do AI systems generalize to visual software domains? InThe Thirteenth International Conference on Learning Representations, 2025. URL https: //ope...

2025

[30] [30]

Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang

John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. Swe-smith: Scaling data for software engineering agents, 2025. URLhttps://arxiv.org/abs/2504.21798

Pith/arXiv arXiv 2025

[31] [31]

OrcaLoca: An LLM agent framework for software issue localization

Zhongming Yu, Hejia Zhang, Yujie Zhao, Hanxian Huang, Matrix Yao, Ke Ding, and Jishen Zhao. OrcaLoca: An LLM agent framework for software issue localization. InInterna- tional Conference on Machine Learning, 2025. URL https://openreview.net/forum? id=LyUfPOvM6I. 12

2025

[32] [32]

Multi-swe-bench: A multilingual benchmark for issue resolving, 2025

Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su, Tianyu Liu, Rui Long, Kai Shen, and Liang Xiang. Multi-swe-bench: A multilingual benchmark for issue resolving, 2025. URLhttps://arxiv.org/abs/2504.02605

Pith/arXiv arXiv 2025

[33] [33]

Swe-bench goes live!, 2025

Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, and Dongmei Zhang. Swe-bench goes live!, 2025. URL https: //arxiv.org/abs/2505.23419

arXiv 2025

[34] [34]

AutoCodeRover: Au- tonomous program improvement

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. AutoCodeRover: Au- tonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, 2024. doi: 10.1145/3650212.3680384

work page doi:10.1145/3650212.3680384 2024

[35] [35]

MULocBench: A benchmark for localizing code and non-code issues in software projects, 2025

Zejun Zhang, Jian Wang, Qingyun Yang, Yifan Pan, Yi Tang, Yi Li, Zhenchang Xing, Tian Zhang, Xuandong Li, and Guoan Zhang. MULocBench: A benchmark for localizing code and non-code issues in software projects, 2025. URLhttps://arxiv.org/abs/2509.25242

arXiv 2025

[36] [36]

Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports

Jian Zhou, Hongyu Zhang, and David Lo. Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports. InProceedings of the 34th International Conference on Software Engineering, pages 14–24. IEEE, 2012. doi: 10.1109/ ICSE.2012.6227210

arXiv 2012

[37] [37]

SWE Context Bench: A Benchmark for Context Learning in Coding

Jared Zhu, Minhao Hu, and Junde Wu. SWE context bench: A benchmark for context learning in coding.CoRR, abs/2602.08316, 2026. doi: 10.48550/arXiv.2602.08316. A Dataset Details This appendix supplements §3.2. The main paper reports the aggregate benchmark statistics; here we specify the retained instance set, record schema, and repository-snapshot assumpti...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.08316 2026