SweRank: Software Issue Localization with Code Ranking

Caiming Xiong; Heng Ji; JaeHyeok Doo; Revanth Gangi Reddy; Semih Yavuz; Shafiq Joty; Tarun Suresh; Xuan Phi Nguyen; Ye Liu; Yingbo Zhou

arxiv: 2505.07849 · v2 · submitted 2025-05-07 · 💻 cs.SE · cs.AI· cs.IR

SweRank: Software Issue Localization with Code Ranking

Revanth Gangi Reddy , Tarun Suresh , JaeHyeok Doo , Ye Liu , Xuan Phi Nguyen , Yingbo Zhou , Semih Yavuz , Caiming Xiong

show 2 more authors

Heng Ji Shafiq Joty

This is my paper

Pith reviewed 2026-05-22 15:45 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.IR

keywords software issue localizationcode rankingretrieve and rerankSWE-BenchGitHub issuesbug localizationLLM agents

0 comments

The pith

SweRank uses a retrieve-and-rerank framework trained on real GitHub issues to localize relevant code more accurately than prior models or costly LLM agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SweRank as a practical way to identify the files or functions tied to a natural-language bug report or feature request. It builds SweLoc, a large collection of real GitHub issue descriptions matched to the code changes that fixed them, then trains a ranking model on this data. The resulting system handles the detailed, failure-oriented language of issue reports better than standard code search tools. On SWE-Bench-Lite and LocBench it reaches higher accuracy than both earlier ranking approaches and multi-step agent systems that rely on closed-source models such as Claude-3.5. Readers would care because the method cuts the time, money, and complexity currently required to turn an issue description into the exact code locations that need attention.

Core claim

SweRank is a retrieve-and-rerank framework that first narrows candidate code locations and then refines their order using a model trained on the SweLoc dataset of paired issue descriptions and code modifications; when evaluated on SWE-Bench-Lite and LocBench this pipeline delivers state-of-the-art accuracy while avoiding the latency and expense of agent-based systems that use closed-source LLMs.

What carries the argument

The retrieve-and-rerank pipeline that trains an initial retriever and a subsequent reranker on SweLoc pairs of real-world issue text and the code edits that resolved them.

If this is right

SweRank reduces latency and monetary cost relative to multi-step LLM agent workflows for the same localization task.
The SweLoc dataset can be reused to improve accuracy of other existing retrievers and rerankers on issue localization.
Traditional code-ranking models become competitive on verbose, failure-descriptive queries once fine-tuned on paired issue-and-fix data.
The performance advantage appears on both SWE-Bench-Lite and LocBench, indicating the approach is not tied to a single benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams could embed the ranking step inside development environments to surface likely files as soon as an issue report arrives.
The same curation pattern might support related tasks such as patch generation by supplying better candidate locations to repair models.
If similar datasets can be built for other programming languages or domains, the efficiency benefit could extend beyond the current Java and Python focus.
The results hint that carefully filtered real-world data can substitute for complex reasoning chains in narrow software-engineering subtasks.

Load-bearing premise

The SweLoc dataset of GitHub issue descriptions paired with code modifications supplies training examples that generalize to the evaluation benchmarks used for testing.

What would settle it

Evaluate SweRank on a fresh collection of issues drawn from repositories absent from the SweLoc training data and check whether accuracy still exceeds both prior ranking baselines and agent-based systems.

read the original abstract

Software issue localization, the task of identifying the precise code locations (files, classes, or functions) relevant to a natural language issue description (e.g., bug report, feature request), is a critical yet time-consuming aspect of software development. While recent LLM-based agentic approaches demonstrate promise, they often incur significant latency and cost due to complex multi-step reasoning and relying on closed-source LLMs. Alternatively, traditional code ranking models, typically optimized for query-to-code or code-to-code retrieval, struggle with the verbose and failure-descriptive nature of issue localization queries. To bridge this gap, we introduce SweRank, an efficient and effective retrieve-and-rerank framework for software issue localization. To facilitate training, we construct SweLoc, a large-scale dataset curated from public GitHub repositories, featuring real-world issue descriptions paired with corresponding code modifications. Empirical results on SWE-Bench-Lite and LocBench show that SweRank achieves state-of-the-art performance, outperforming both prior ranking models and costly agent-based systems using closed-source LLMs like Claude-3.5. Further, we demonstrate SweLoc's utility in enhancing various existing retriever and reranker models for issue localization, establishing the dataset as a valuable resource for the community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SweRank adapts ranking to issue localization with a useful new dataset but risks overstated results from possible benchmark overlap.

read the letter

SweRank is basically a retrieve-and-rerank pipeline for finding which code to change based on a bug report or feature request, trained on a new dataset they call SweLoc. The new part is SweLoc itself. They pulled real GitHub issues and paired the descriptions with the actual code diffs that fixed them. That gives a large training set for this specific task. The framework then uses that to train a ranker that handles the long, descriptive queries better than standard code search models. They show it improves several existing retrievers and rerankers, and on the benchmarks it beats both older ranking approaches and the slower, more expensive agent systems that use models like Claude. What works here is the focus on efficiency. Agentic pipelines do a lot of back and forth with LLMs, which adds latency and cost. SweRank keeps it to retrieval plus reranking, which should be faster and cheaper while still getting good localization results. The soft spot is the data split. The stress test points out that if SweLoc shares repositories or issue patterns with SWE-Bench-Lite or LocBench, the reported gains could come from memorizing repo-specific things rather than learning general issue-to-location mapping. The abstract does not mention any explicit hold-out or decontamination step, so that needs to be checked in the full paper. Without that, the comparison to agent systems that don't use this training data is not fully convincing. The experimental details are also light in the summary, so the SOTA claim is hard to evaluate fully right now. This paper is for people working on automated software engineering tools, especially those interested in code search and bug localization. A practitioner or researcher looking for a lighter-weight alternative to full LLM agents would get something out of the dataset and the efficiency numbers. It is worth a serious referee because the task matters and the approach is a reasonable application of ranking methods, even if the results need more scrutiny on the data side. I would send it to peer review, mainly to get the data overlap and experimental details sorted out.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces SweRank, a retrieve-and-rerank framework for software issue localization. To support this, the authors construct SweLoc, a large-scale dataset of real-world GitHub issue descriptions paired with code modifications. They report that SweRank achieves state-of-the-art performance on the SWE-Bench-Lite and LocBench benchmarks, outperforming both prior ranking models and costly agent-based systems that rely on closed-source LLMs such as Claude-3.5.

Significance. If the reported performance gains are shown to result from genuine generalization rather than potential data overlap, the work would provide a valuable, efficient alternative to expensive LLM-based agents for localizing issues in software. Additionally, the SweLoc dataset could be a useful resource for the community to improve retrievers and rerankers in this domain.

major comments (1)

[SweLoc Dataset Construction] The paper does not report any explicit steps for checking or preventing overlap between the repositories and issues in SweLoc and those in the SWE-Bench-Lite or LocBench evaluation sets. Given that both are sourced from public GitHub, this omission leaves open the possibility of train-test leakage, which would undermine the central claim of outperforming zero-shot agent systems.

minor comments (1)

[Abstract] The abstract states that SweRank outperforms prior models but does not specify the exact metrics or the magnitude of improvements; consider adding quantitative highlights for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment on the SweLoc dataset construction below.

read point-by-point responses

Referee: [SweLoc Dataset Construction] The paper does not report any explicit steps for checking or preventing overlap between the repositories and issues in SweLoc and those in the SWE-Bench-Lite or LocBench evaluation sets. Given that both are sourced from public GitHub, this omission leaves open the possibility of train-test leakage, which would undermine the central claim of outperforming zero-shot agent systems.

Authors: We agree that this is an important point and that the current manuscript does not report explicit overlap checks. SweLoc was curated from a wide range of public GitHub repositories to create a large-scale training resource, while the evaluation benchmarks focus on a smaller set of well-known repositories. To rigorously address the concern, we will revise the manuscript to add a new subsection under dataset construction that details the steps for preventing and verifying no train-test leakage. This will include repository-name matching, issue-description similarity checks (e.g., via normalized string comparison), and code-change diff comparison against SWE-Bench-Lite and LocBench. Any identified overlaps will be removed from SweLoc, and we will re-run the main experiments to confirm that performance remains consistent. We believe these additions will strengthen the evidence for genuine generalization. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard empirical ML pipeline on external benchmarks

full rationale

The paper constructs SweLoc from public GitHub issues to train a retrieve-and-rerank model and reports results on the independent SWE-Bench-Lite and LocBench benchmarks. No equations, self-definitional reductions, or load-bearing self-citations appear in the provided text; the central claim is an empirical performance comparison rather than a derivation that collapses to its own fitted inputs by construction. The setup follows conventional train-on-curated-data / evaluate-on-held-out-benchmark practice and remains self-contained against external test sets.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach rests on standard assumptions of supervised learning for retrieval and reranking models.

pith-pipeline@v0.9.0 · 5784 in / 1051 out tokens · 48112 ms · 2026-05-22T15:45:15.873826+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Neurosymbolic Repo-level Code Localization
cs.SE 2026-04 unverdicted novelty 7.0

LogicLoc combines LLMs with Datalog to achieve accurate repo-level code localization without relying on keyword shortcuts in benchmarks.
GALA: Multimodal Graph Alignment for Bug Localization in Automated Program Repair
cs.SE 2026-04 unverdicted novelty 6.0

GALA uses hierarchical graph alignment between UI screenshots and code structures to achieve state-of-the-art bug localization in multimodal automated program repair on SWE-bench.
Retrieval-Conditioned Topology Selection with Provable Budget Conservation for Multi-Agent Code Generation
cs.AI 2026-05 unverdicted novelty 5.0

RGAO combines retrieval-based complexity assessment with a formal budget algebra to enable dynamic topology selection in multi-agent code generation with provable conservation.