pith. sign in

arxiv: 2606.07297 · v1 · pith:27HZ6HPCnew · submitted 2026-06-05 · 💻 cs.SE · cs.CL

SWE-Explore: Benchmarking How Coding Agents Explore Repositories

Pith reviewed 2026-06-27 21:20 UTC · model grok-4.3

classification 💻 cs.SE cs.CL
keywords coding agentsrepository explorationbug localizationSWE-benchagentic explorationcode retrievalsoftware engineering benchmarksline-level evaluation
0
0 comments X

The pith

Agentic explorers outperform classical retrieval when locating relevant code regions for repository issues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SWE-Explore, a benchmark that isolates the repository exploration step in coding agents by asking them to return a ranked list of relevant code regions within a fixed line budget. Ground truth for each of the 848 issues is derived from the specific code regions consulted in independent successful repair trajectories. The work evaluates methods along coverage, ranking quality, and context efficiency, and reports that these scores correlate with whether the same agents later succeed at actual bug repair. Agentic approaches sit in a clear tier above classical retrieval techniques, while file-level localization is already robust and line-level plus efficiency metrics separate the strongest performers.

Core claim

SWE-Explore requires an explorer to produce a ranked list of relevant code regions for a given issue under a fixed line budget. Line-level ground truth is distilled from trajectories of agents that independently solved the same issue. Across retrieval baselines, general coding agents, and specialized localizers, agentic explorers achieve higher coverage and better ranking; the three evaluation dimensions strongly track downstream repair success. File-level localization is already strong for modern methods, but line-level coverage and context-efficient ranking remain the axes that differentiate state-of-the-art explorers.

What carries the argument

SWE-Explore benchmark that scores ranked retrieval of code regions under a fixed line budget using ground truth extracted from successful agent trajectories.

If this is right

  • Higher scores on coverage, ranking, and context-efficiency metrics predict higher rates of successful bug repair.
  • Agentic explorers form a distinct performance tier above classical retrieval methods.
  • File-level localization is already strong while line-level coverage and efficient ranking remain the differentiating factors.
  • The benchmark spans 10 languages and 203 repositories, supporting broad comparison of exploration strategies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Separating exploration evaluation from full repair allows targeted iteration on retrieval components before full pipeline testing.
  • Ground truth drawn only from successful trajectories may systematically favor exploration styles already known to work.
  • The fixed-budget format could be reused to study context-length trade-offs when scaling agents to larger repositories.

Load-bearing premise

That the code regions visited in successful independent trajectories accurately represent the minimal relevant set an explorer should retrieve for each issue.

What would settle it

A method that scores highest on SWE-Explore coverage, ranking, and efficiency but produces no measurable increase in end-to-end repair rates when plugged into a full agent pipeline.

Figures

Figures reproduced from arXiv: 2606.07297 by Jialiang Liang, Kai Cai, Maoquan Wang, Ningyuan Xu, Shaoqiu Zhang, Shilin He, Siyu Ye, Wenhao Zeng, Xiaodong Gu, Yuhang Wang, Yuling Shi.

Figure 1
Figure 1. Figure 1: Motivation of SWE-Explore. A holistic metric of resolution rate conflates exploration, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SWE-Explore. From solution-verified trajectories, SWE-Explore extracts read [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Language distribution of the 848 retained SWE-Explore instances across 10 different coding languages [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example of a SWE-Explore instance. Left: an issue plus a repo snapshot with the highlighted core span. Right: trajectory-derived core regions Ccore (scoring target) and optional regions Copt, an explorer’s ranked prediction scored against Ccore. Rcore refer to this refined and audited ground truth, and Rcore is the only scoring target in the main experiments [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Resolve rate as the visible context degrades from the Oracle’s full core set [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding agents. Yet they usually treat coding tasks as a holistic, binary prediction problem (e.g., resolved or unresolved), neglecting fine-grained agent capabilities such as repository understanding, context retrieval, code localization, and bug diagnosis. In this paper, we introduce SWE-Explore, a benchmark that isolates the evaluation of repository exploration, a critical capability of coding agents. Given a repository and an issue, SWE-Explore asks an explorer to return a ranked list of relevant code regions under a fixed line budget. SWE-Explore covers 848 issues across 10 programming languages and 203 open-source repositories. For each instance, we derive line-level ground truth from independent agent trajectories that successfully solved the same issue, distilling the specific code regions their solution paths actually consulted. We evaluate exploration along coverage, ranking, and context-efficiency dimensions, showing that these metrics strongly track downstream repair behavior. Across a broad set of retrieval methods, general coding agents, and specialized localizers, we find that agentic explorers form a clear tier above classical retrieval. While file-level localization is already strong for modern methods, line-level coverage and efficient ranking remain the key axes differentiating state-of-the-art explorers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces SWE-Explore, a benchmark isolating repository exploration for coding agents. Given an issue and repo, explorers return a ranked list of relevant code regions under a fixed line budget. Ground truth is line-level regions distilled from independent successful agent trajectories that solved the issue. The benchmark spans 848 issues across 10 languages and 203 repositories. Evaluation uses coverage, ranking, and context-efficiency metrics, which are shown to track downstream repair performance. Results indicate agentic explorers form a clear tier above classical retrieval methods, with file-level localization already strong but line-level coverage and efficient ranking as key differentiators.

Significance. If the evaluation protocol is free of bias, the work fills a gap by providing fine-grained metrics for exploration (a prerequisite for repair) rather than binary SWE-bench outcomes. The scale (multi-language, hundreds of repos) and the reported correlation between exploration metrics and repair success are useful contributions for guiding agent development. Credit is due for releasing a benchmark that separates exploration from end-to-end repair.

major comments (1)
  1. [Abstract and ground-truth construction section] Abstract and ground-truth construction section: line-level ground truth is obtained by distilling code regions consulted in independent successful agent trajectories. Because relevance is defined exactly by the paths taken by those trajectories, any explorer whose retrieval behavior correlates with the ground-truth-generating agents receives inflated coverage and ranking scores by construction. Classical retrieval methods that surface different but functionally necessary regions receive no credit. This directly undermines the headline claim that agentic explorers form a clear tier above classical retrieval and that the metrics strongly track repair behavior. The paper should either (a) provide an independent (e.g., human-annotated or cross-agent-class) ground-truth validation or (b) quantify how much the observed tiering changes when ground truth is replaced by an alternative source.
minor comments (2)
  1. [Methods] The abstract states that trajectories are 'independent,' but the manuscript should explicitly report the set of agents used to generate ground truth versus those evaluated, including any overlap in model families or prompting strategies.
  2. [Results] Table or figure reporting per-language or per-repository variance in the agentic-vs-classical gap would strengthen the 'clear tier' claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting this important methodological consideration regarding ground-truth construction. We respond to the concern below and indicate planned revisions.

read point-by-point responses
  1. Referee: Abstract and ground-truth construction section: line-level ground truth is obtained by distilling code regions consulted in independent successful agent trajectories. Because relevance is defined exactly by the paths taken by those trajectories, any explorer whose retrieval behavior correlates with the ground-truth-generating agents receives inflated coverage and ranking scores by construction. Classical retrieval methods that surface different but functionally necessary regions receive no credit. This directly undermines the headline claim that agentic explorers form a clear tier above classical retrieval and that the metrics strongly track repair behavior. The paper should either (a) provide an independent (e.g., human-annotated or cross-agent-class) ground-truth validation or (b) quantify how much the observed tiering changes when ground truth is replaced by an alternative source.

    Authors: The ground truth is derived exclusively from trajectories of agents that independently succeeded in resolving the issues, capturing the precise regions consulted during effective repairs. This is intentional: the benchmark measures an explorer's ability to surface code that has been shown to enable task completion, rather than regions that might be relevant under other criteria. While we recognize that this construction can advantage explorers whose behavior aligns with the ground-truth agents, the manuscript already reports that the resulting coverage, ranking, and efficiency metrics correlate with downstream repair success across a wide range of methods, including classical retrieval baselines. This empirical correlation provides evidence that the metrics remain informative for practical agent development. We will revise the manuscript to add an explicit discussion paragraph in the ground-truth section addressing the design rationale, the potential for behavioral correlation, and the supporting repair-performance correlation as validation. We do not perform a full alternative ground-truth experiment, as that lies outside the current scope. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper constructs line-level ground truth from independent successful agent trajectories (abstract) and evaluates a range of retrieval methods, general agents, and localizers against coverage/ranking/efficiency metrics that are reported to track repair. This construction uses an external success filter on separate trajectories rather than defining relevance via the evaluated explorers themselves or reducing any claimed result to a fitted parameter or self-citation. No equations, self-definitional loops, or load-bearing self-citations appear; the benchmark remains self-contained against its stated external anchor.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is a new evaluation protocol rather than new parameters or entities. The main unstated premise is that successful prior trajectories constitute an unbiased and complete source of relevant code regions.

axioms (1)
  • domain assumption Successful agent trajectories on the same issue provide the authoritative set of relevant code regions for exploration evaluation.
    Abstract states ground truth is derived from such trajectories; if this set is incomplete or biased toward particular exploration styles, all downstream metrics are affected.

pith-pipeline@v0.9.1-grok · 5787 in / 1484 out tokens · 15484 ms · 2026-06-27T21:20:11.202965+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LLM Agents Can See Code Repositories

    cs.SE 2026-06 unverdicted novelty 7.0

    Visual graphs of repository structure added to text inputs for multimodal LLM agents reduce token consumption by up to 26% while maintaining or improving issue-resolution accuracy.

  2. Code Isn't Memory: A Structural Codebase Index Inside a Coding Agent

    cs.AI 2026-06 unverdicted novelty 4.0

    Ablation study finds that a structural codebase index improves localization and resolve rates in coding agents on two SWE benchmarks without raising per-cell cost.

Reference graph

Works this paper leans on

37 extracted references · 10 canonical work pages · cited by 2 Pith papers · 4 internal anchors

  1. [1]

    Claude code: Ai-assisted coding in real-world codebases, 2025

    Anthropic. Claude code: Ai-assisted coding in real-world codebases, 2025. URL https: //claude.ai/code. Accessed: 2026-05

  2. [2]

    Swe- rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents, 2025

    Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. Swe- rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents, 2025. URLhttps://arxiv.org/abs/2505.20411

  3. [3]

    Beyondswe: Can current code agent survive beyond single-repo bug fixing?, 2026

    Guoxin Chen, Fanzhe Meng, Jiale Zhao, Minghao Li, Daixuan Cheng, Huatong Song, Jie Chen, Yuzhi Lin, Hui Chen, Xin Zhao, Ruihua Song, Chang Liu, Cheng Chen, Kai Jia, and Ji-Rong 10 Wen. Beyondswe: Can current code agent survive beyond single-repo bug fixing?, 2026. URL https://arxiv.org/abs/2603.03194

  4. [4]

    Swe-exp: Experience-driven software issue resolution.arXiv preprint arXiv:2507.23361, 2025

    Silin Chen, Shaoxin Lin, Yuling Shi, Heng Lian, Xiaodong Gu, Longfei Yun, Dong Chen, Lin Cao, Jiyang Liu, Nu Xia, et al. Swe-exp: Experience-driven software issue resolution.arXiv preprint arXiv:2507.23361, 2025

  5. [5]

    Prasanna, Arman Cohan, and Xingyao Wang

    Zhaoling Chen, Xiangru Tang, Gangda Deng, Fang Wu, Jialong Wu, Zhiwei Jiang, Viktor K. Prasanna, Arman Cohan, and Xingyao Wang. LocAgent: Graph-guided LLM agents for code localization.CoRR, abs/2503.09089, 2025. doi: 10.48550/arXiv.2503.09089

  6. [6]

    Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry

    Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry. Introducing SWE-bench verified, August 2024. URLhttps://openai.com/index/introducing-swe-bench-verified/. OpenAI milestone, ...

  7. [7]

    Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?, 2025

    Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Vijay Bharadwaj, Jeff Holm, Raja Aluri, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. Swe-bench pro: Can ai agents solve long-ho...

  8. [8]

    CodeSearchNet challenge: Evaluating the state of semantic code search, 2019

    Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. CodeSearchNet challenge: Evaluating the state of semantic code search, 2019. URL https: //arxiv.org/abs/1909.09436

  9. [9]

    Cumulated gain-based evaluation of IR techniques

    Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4):422–446, 2002. doi: 10.1145/582415.582418

  10. [10]

    Issue localization via llm-driven iterative code graph searching, 2025

    Zhonghao Jiang, Xiaoxue Ren, Meng Yan, Wei Jiang, Yong Li, and Zhongxin Liu. Issue localization via llm-driven iterative code graph searching, 2025. URL https://arxiv.org/ abs/2503.22424

  11. [11]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=VTF8yNQM66

  12. [12]

    Swe-debate: Competitive multi-agent debate for software issue resolution

    Han Li, Yuling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Yantao Jia, Tao Huang, and Qianxiang Wang. Swe-debate: Competitive multi-agent debate for software issue resolution. arXiv preprint arXiv:2507.23348, 2025

  13. [13]

    Barr, Federica Sarro, Zhaoyang Chu, and He Ye

    Han Li, Letian Zhu, Bohan Zhang, Rili Feng, Jiaming Wang, Yue Pan, Earl T. Barr, Federica Sarro, Zhaoyang Chu, and He Ye. ContextBench: A benchmark for context retrieval in coding agents.CoRR, abs/2602.05892, 2026. doi: 10.48550/arXiv.2602.05892

  14. [14]

    Repobench: Benchmarking repository- level code auto-completion systems

    Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository- level code auto-completion systems. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=pPjZIOuQuF

  15. [15]

    Training software engineering agents and verifiers with swe-gym, 2025

    Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym, 2025. URL https: //arxiv.org/abs/2412.21139

  16. [16]

    Swe-qa: Can language models answer repository-level code questions?arXiv preprint arXiv:2509.14635, 2025

    Weihan Peng, Yuling Shi, Yuhang Wang, Xinyun Zhang, Beijun Shen, and Xiaodong Gu. Swe-qa: Can language models answer repository-level code questions?arXiv preprint arXiv:2509.14635, 2025

  17. [17]

    The probabilistic relevance framework: BM25 and beyond.Foundations and Trends in Information Retrieval, 3(4):333–389, 2009

    Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond.Foundations and Trends in Information Retrieval, 3(4):333–389, 2009. doi: 10.1561/ 1500000019. 11

  18. [18]

    Term-weighting approaches in automatic text retrieval

    Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513–523, 1988. doi: 10.1016/0306-4573(88) 90021-0

  19. [19]

    From code to correctness: Closing the last mile of code generation with hierarchical debugging.arXiv preprint arXiv:2410.01215, 2024

    Yuling Shi, Songsong Wang, Chengcheng Wan, Min Wang, and Xiaodong Gu. From code to correctness: Closing the last mile of code generation with hierarchical debugging.arXiv preprint arXiv:2410.01215, 2024

  20. [20]

    Longcodezip: Compress long context for code language models

    Yuling Shi, Yichun Qian, Hongyu Zhang, Beijun Shen, and Xiaodong Gu. Longcodezip: Compress long context for code language models. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 141–153. IEEE, 2025

  21. [21]

    Between lines of code: Unraveling the distinct patterns of machine and human programmers

    Yuling Shi, Hongyu Zhang, Chengcheng Wan, and Xiaodong Gu. Between lines of code: Unraveling the distinct patterns of machine and human programmers. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), pages 1628–1639. IEEE, 2025

  22. [22]

    CodeScout: Contextual Problem Statement Enhancement for Software Agents

    Manan Suri, Xiangci Li, Mehdi Shojaie, Songyang Han, Chao-Chun Hsu, Shweta Garg, Aniket Anand Deshmukh, and Varun Kumar. CodeScout: Contextual problem statement en- hancement for software agents.CoRR, abs/2603.05744, 2026. doi: 10.48550/arXiv.2603.05744

  23. [23]

    SWE-dev: Building software engineering agents with training and inference scaling

    Haoran Wang, Zhenyu Hou, Yao Wei, Jie Tang, and Yuxiao Dong. SWE-dev: Building software engineering agents with training and inference scaling. InFindings of the Association for Computational Linguistics: ACL 2025, pages 3742–3761. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-acl.193. URL https://aclanthology. org/2025.f...

  24. [24]

    Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHands: An open platform for AI soft...

  25. [25]

    Context compression for llm agents: A survey of methods, failure modes, and evaluation

    Yifei Wang, Ziteng Wang, Yuling Shi, Silin Chen, Xinrui Wang, Yueqi Wang, Beijun Shen, Linjing Li, Xiaodong Gu, Julian McAuley, et al. Context compression for llm agents: A survey of methods, failure modes, and evaluation. 2026

  26. [26]

    SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents

    Yuhang Wang, Yuling Shi, Mo Yang, Rongrui Zhang, Shilin He, Heng Lian, Yuting Chen, Siyu Ye, Kai Cai, and Xiaodong Gu. SWE-pruner: Self-adaptive context pruning for coding agents. CoRR, abs/2601.16746, 2026. doi: 10.48550/arXiv.2601.16746

  27. [27]

    Agentless: Demystifying LLM-based Software Engineering Agents

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying LLM-based software engineering agents.CoRR, abs/2407.01489, 2024. doi: 10.48550/arXiv. 2407.01489

  28. [28]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R. Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems, 2024. URL https: //openreview.net/forum?id=mXpq6ut8J3

  29. [29]

    SWE-bench multimodal: Do AI systems generalize to visual software domains? InThe Thirteenth International Conference on Learning Representations, 2025

    John Yang, Carlos E Jimenez, Alex L Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R Narasimhan, Diyi Yang, Sida Wang, and Ofir Press. SWE-bench multimodal: Do AI systems generalize to visual software domains? InThe Thirteenth International Conference on Learning Representations, 2025. URL https: //ope...

  30. [30]

    Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang

    John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. Swe-smith: Scaling data for software engineering agents, 2025. URLhttps://arxiv.org/abs/2504.21798

  31. [31]

    OrcaLoca: An LLM agent framework for software issue localization

    Zhongming Yu, Hejia Zhang, Yujie Zhao, Hanxian Huang, Matrix Yao, Ke Ding, and Jishen Zhao. OrcaLoca: An LLM agent framework for software issue localization. InInterna- tional Conference on Machine Learning, 2025. URL https://openreview.net/forum? id=LyUfPOvM6I. 12

  32. [32]

    Multi-swe-bench: A multilingual benchmark for issue resolving, 2025

    Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su, Tianyu Liu, Rui Long, Kai Shen, and Liang Xiang. Multi-swe-bench: A multilingual benchmark for issue resolving, 2025. URLhttps://arxiv.org/abs/2504.02605

  33. [33]

    Swe-bench goes live!, 2025

    Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, and Dongmei Zhang. Swe-bench goes live!, 2025. URL https: //arxiv.org/abs/2505.23419

  34. [34]

    AutoCodeRover: Au- tonomous program improvement

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. AutoCodeRover: Au- tonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, 2024. doi: 10.1145/3650212.3680384

  35. [35]

    MULocBench: A benchmark for localizing code and non-code issues in software projects, 2025

    Zejun Zhang, Jian Wang, Qingyun Yang, Yifan Pan, Yi Tang, Yi Li, Zhenchang Xing, Tian Zhang, Xuandong Li, and Guoan Zhang. MULocBench: A benchmark for localizing code and non-code issues in software projects, 2025. URLhttps://arxiv.org/abs/2509.25242

  36. [36]

    Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports

    Jian Zhou, Hongyu Zhang, and David Lo. Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports. InProceedings of the 34th International Conference on Software Engineering, pages 14–24. IEEE, 2012. doi: 10.1109/ ICSE.2012.6227210

  37. [37]

    SWE Context Bench: A Benchmark for Context Learning in Coding

    Jared Zhu, Minhao Hu, and Junde Wu. SWE context bench: A benchmark for context learning in coding.CoRR, abs/2602.08316, 2026. doi: 10.48550/arXiv.2602.08316. A Dataset Details This appendix supplements §3.2. The main paper reports the aggregate benchmark statistics; here we specify the retained instance set, record schema, and repository-snapshot assumpti...