A Single Patch Is Not Enough: Deterministic Fusion of Repair Candidates

Bach Le; Boyang Yang; Haoye Tian; Luyao Ren; Tegawend\'e F. Bissyand\'e; Xiangliang Hu; Yanjun Chen

arxiv: 2607.01597 · v1 · pith:NBOAOZEEnew · submitted 2026-07-02 · 💻 cs.SE

A Single Patch Is Not Enough: Deterministic Fusion of Repair Candidates

Boyang Yang , Xiangliang Hu , Luyao Ren , Yanjun Chen , Bach Le , Tegawend\'e F. Bissyand\'e , Haoye Tian This is my paper

Pith reviewed 2026-07-03 09:17 UTC · model grok-4.3

classification 💻 cs.SE

keywords automated program repairpatch fusionLLM coding agentscandidate selectionSWE-benchDefects4Jdeterministic repair

0 comments

The pith

Fusing repeated edit atoms across multiple candidate patches constructs correct repairs that no individual candidate achieves.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that modern LLM coding agents produce pools of candidate patches where correct repair components are scattered, and that a deterministic process can fuse the shared edit atoms to select and build a final patch without consulting test outcomes or LLM judges. This closes the pass@k to pass@1 gap that arises when developers must apply only one patch in practice. The method first identifies a repair neighborhood from whole-diff agreement, picks an auditable representative, then applies evidence-constrained fusion to keep only the repeated atoms and drop unsupported changes. On fixed candidate pools from SWE-bench Verified, SWE-bench Multilingual, and Defects4J, the approach solves more bugs than any compared selector and recovers dozens of bugs that no single source solves.

Core claim

PatchFusion fuses whole-diff agreement into a repair neighborhood, selects an auditable representative, and applies evidence-constrained fusion (ECF) to retain repeated edit atoms while pruning unsupported parts, solving 426 out of 500 bugs on SWE-bench Verified, 236 out of 300 on SWE-bench Multilingual, and 87 out of 371 plausible patches on Defects4J while recovering 41 and 27 bugs that no single source solves.

What carries the argument

Evidence-constrained fusion (ECF), which retains only the edit atoms that appear repeatedly across candidates and removes any parts lacking such cross-candidate support.

Load-bearing premise

Repeated edit atoms across candidates reliably mark the correct repair components that can be safely fused without test outcomes or other external validation.

What would settle it

A controlled run on the same fixed candidate pools where fusing the repeated atoms produces a failing patch on a bug that at least one original candidate would have passed.

Figures

Figures reproduced from arXiv: 2607.01597 by Bach Le, Boyang Yang, Haoye Tian, Luyao Ren, Tegawend\'e F. Bissyand\'e, Xiangliang Hu, Yanjun Chen.

**Figure 1.** Figure 1: Motivating example (django__django-14315), where PatchFusion fuses repeated edit atoms into a resolved patch no candidate submitted. candidate is correct, the repair already present in fragments is lost. The correct local edits often recur across several otherwise-failing candidates, evidence that is free because it comes from the diffs already available, yet whole-diff scoring discards it. ② Effective met… view at source ↗

**Figure 2.** Figure 2: PatchFusion turns a fixed candidate pool into one auditable final patch in four deterministic steps. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Decision-stage cost versus repairs on SWE-bench Verified. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Fraction of candidate-reachable bugs solved versus pool size. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Modern LLM coding agents are commonly evaluated using pass@k, but developers typically apply a single final patch in real-world settings. This pass@k-to-pass@1 gap is a post-generation problem: a candidate patch pool may contain a correct patch, but the system must decide which one to suggest to developers. Existing post-generation approaches mainly rank whole candidates, filter them with tests, or query an LLM judge, but none deterministically reuse shared edit-atom evidence to both select and construct the final patch. Thus, we propose PatchFusion, a deterministic atomic evidence fusion approach for candidate patches that consults no test outcome at decision time. PatchFusion first fuses whole-diff agreement into a repair neighborhood, selects an auditable representative, and then applies evidence-constrained fusion (ECF) to retain repeated edit atoms and prune unsupported parts. To evaluate this setting, we build PatchFuseBench, a fixed-pool benchmark covering SWE-bench Verified, SWE-bench Multilingual, and Defects4J candidate patches. On PatchFuseBench, PatchFusion solves 426/500 bugs on SWE-bench Verified and 236/300 on SWE-bench Multilingual, and reaches 87/371 plausible patches on Defects4J, outperforming every matched candidate-pool selector on all three. PatchFusion recovers 41 and 27 bugs that no single source solves (30 and 18 more over the best single source). Ablation studies show that ECF adds +5/+6/+9 solved bugs by recovering in-pool repairs that selection misses, with no observed regression, and that PatchFusion's gains remain stable as candidate pools are resampled. On these complementary multi-source pools, cross-candidate evidence recovers more correct patches than the test-based and LLM-based selectors we evaluate, at orders-of-magnitude lower cost, reaching within 96.2% and 89.7% of the candidate-reachable ceiling on the two SWE-bench benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PatchFusion's deterministic edit-atom fusion beats selectors on SWE-bench and Defects4J pools and recovers some unique bugs, but the repetition-equals-correctness assumption lacks direct checks.

read the letter

PatchFusion's core idea is a deterministic process that first merges whole-diff agreement into a neighborhood, picks an auditable representative, then runs evidence-constrained fusion to keep only repeated edit atoms and drop unsupported ones. This runs without tests or LLM judges at decision time and is positioned as distinct from ranking, filtering, or judge baselines.

The paper builds a fixed-pool benchmark across three sources and reports clear numbers: 426/500 solved on SWE-bench Verified, 236/300 on Multilingual, and 87 plausible patches on Defects4J. It claims 41 and 27 bugs recovered that no single source solved, plus 30 and 18 more than the best single source. Ablations show the fusion step adding 5/6/9 bugs with no regressions, and results stay stable under resampling. Those are concrete, useful data points on the pass@k-to-pass@1 gap.

The soft spot is the load-bearing premise that repeated edit atoms reliably mark correct repair components. The method keeps them because they repeat, yet the paper gives no breakdown of how often those repeated atoms are actually wrong (for example, common symptom patches) or any independent check on the 41 recovered cases. Gains are measured post-test, so it remains unclear how much the fusion itself drives the edge versus pool quality.

This is aimed at researchers working on LLM repair agents and post-generation selection. The empirical setup and ablations are solid enough to merit a serious referee, even if the assumption would benefit from more direct evidence in revision.

I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper proposes PatchFusion, a deterministic post-generation method for fusing candidate patches from LLM coding agents. It constructs a repair neighborhood from whole-diff agreement, selects an auditable representative, and applies evidence-constrained fusion (ECF) to retain repeated edit atoms while pruning unsupported parts, without consulting test outcomes at decision time. On the introduced PatchFuseBench, it reports solving 426/500 bugs on SWE-bench Verified, 236/300 on SWE-bench Multilingual, and 87/371 plausible patches on Defects4J, outperforming matched selectors and recovering 41 and 27 bugs unreachable by any single source, with ablations attributing +5/+6/+9 gains to ECF.

Significance. If the performance claims and the underlying assumption hold, the work offers a low-cost, deterministic alternative to ranking, test-filtering, or LLM judging for closing the pass@k-to-pass@1 gap in automated repair. The deterministic, evidence-based fusion and the recovery of bugs missed by all individual sources would be a notable contribution to the field, especially given the reported stability under pool resampling and the orders-of-magnitude cost advantage.

major comments (2)

[Evaluation / Ablation studies] The central performance advantage (recovery of 41/27 bugs unreachable by any single source, plus the +5/+6/+9 ECF gains) rests on the untested premise that repeated edit atoms across candidates reliably signal correctness rather than shared incorrect repairs. No section quantifies the rate at which repeated atoms are incorrect or provides provenance analysis for the 41 recovered cases; this assumption is load-bearing for attributing gains to ECF rather than pool quality.
[§4 (PatchFuseBench results)] The reported numbers (426/500, 236/300, 87/371) are measured after fusion using test outcomes, yet the method itself makes no use of tests or oracles at decision time. Without a direct analysis of cases where ECF produces a fused patch that passes tests coincidentally or fails silently due to repeated incorrect atoms, the claim that cross-candidate evidence outperforms test-based and LLM-based selectors remains incompletely supported.

minor comments (2)

[Abstract / Methods] The abstract and evaluation sections would benefit from explicit definitions or pseudocode for 'edit atom' and 'repair neighborhood' to make the ECF procedure fully reproducible from the text alone.
[Evaluation setup] Clarify how 'plausible patches' are defined on Defects4J versus 'solved bugs' on SWE-bench, and whether the same test-based oracle is used for all three benchmarks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment point by point below.

read point-by-point responses

Referee: [Evaluation / Ablation studies] The central performance advantage (recovery of 41/27 bugs unreachable by any single source, plus the +5/+6/+9 ECF gains) rests on the untested premise that repeated edit atoms across candidates reliably signal correctness rather than shared incorrect repairs. No section quantifies the rate at which repeated atoms are incorrect or provides provenance analysis for the 41 recovered cases; this assumption is load-bearing for attributing gains to ECF rather than pool quality.

Authors: We agree that quantifying the rate of incorrect repeated atoms and providing provenance for the recovered cases would strengthen attribution of gains specifically to ECF. The existing ablations demonstrate that ECF contributes additional solved bugs with no observed regressions, and the recovery of bugs missed by all individual sources offers supporting evidence. In revision we will add a sample-based manual provenance analysis of the recovered cases and report the observed rate of incorrect repeated atoms where identifiable from the available patches. revision: yes
Referee: [§4 (PatchFuseBench results)] The reported numbers (426/500, 236/300, 87/371) are measured after fusion using test outcomes, yet the method itself makes no use of tests or oracles at decision time. Without a direct analysis of cases where ECF produces a fused patch that passes tests coincidentally or fails silently due to repeated incorrect atoms, the claim that cross-candidate evidence outperforms test-based and LLM-based selectors remains incompletely supported.

Authors: The reported success counts are necessarily determined by test outcomes because that is the established ground truth for correctness on these benchmarks. The fusion decision itself remains strictly test-agnostic, which is the central methodological distinction. We will revise the presentation in §4 to separate the test-free decision procedure more explicitly from the post-hoc evaluation and will add discussion of possible coincidental correctness. A comprehensive analysis of silent failures due to repeated incorrect atoms would require oracles beyond those provided by the benchmarks and is therefore outside the scope of the current evaluation. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper presents PatchFusion as a deterministic, test-independent fusion of repeated edit atoms across candidate patches, with success measured post-hoc on fixed benchmarks (PatchFuseBench). No equations, parameters, or claims reduce by construction to fitted inputs, self-citations, or renamed empirical patterns. The central premise (repetition signals correctness) is an explicit modeling assumption evaluated empirically rather than derived tautologically. Ablations and comparisons to selectors are reported as independent measurements. This is the common honest case of a self-contained empirical method.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.1-grok · 5909 in / 1081 out tokens · 29893 ms · 2026-07-03T09:17:45.511340+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 8 canonical work pages · 1 internal anchor

[1]

Automated repair of programs from large language models,

Z. Fan, X. Gaoet al., “Automated repair of programs from large language models,” inProceedings of the 45th International Conference on Software Engineering. Piscataway, NJ, USA: IEEE, 2023, pp. 1469–1481

2023
[2]

Automated program repair in the era of large pre-trained language models,

C. S. Xia, Y . Wei, and L. Zhang, “Automated program repair in the era of large pre-trained language models,” inProceedings of the 45th International Conference on Software Engineering. Piscataway, NJ, USA: IEEE, 2023, pp. 1482–1494

2023
[3]

RepairAgent: An autonomous, LLM-based agent for program repair,

I. Bouzenia, P. T. Devanbu, and M. Pradel, “RepairAgent: An autonomous, LLM-based agent for program repair,” inProceedings of the 47th International Conference on Software Engineering, 2025, pp. 2188–2200

2025
[4]

MORepair: Teaching LLMs to repair code via multi-objective fine-tuning,

B. Yang, H. Tianet al., “MORepair: Teaching LLMs to repair code via multi-objective fine-tuning,”ACM Trans. Softw. Eng. Methodol., vol. 35, no. 2, pp. 38:1–38:38, 2026

2026
[5]

Modern code reviews in open-source projects: which problems do they fix?

M. Beller, A. Bacchelliet al., “Modern code reviews in open-source projects: which problems do they fix?” inProceedings of the 11th Working Conference on Mining Software Repositories. New York, NY , USA: ACM, 2014, pp. 202–211

2014
[6]

An empirical study of the impact of modern code review practices on software quality,

S. McIntosh, Y . Kameiet al., “An empirical study of the impact of modern code review practices on software quality,”Empir. Softw. Eng., vol. 21, no. 5, pp. 2146–2189, 2016

2016
[7]

Review4repair: Code review aided automatic program repairing,

F. Huq, M. Hasanet al., “Review4repair: Code review aided automatic program repairing,”Inf. Softw. Technol., vol. 143, p. 106765, 2022

2022
[8]

On the introduction of automatic program repair in bloomberg,

S. Kirbas, E. Windelset al., “On the introduction of automatic program repair in bloomberg,”IEEE Softw., vol. 38, no. 4, pp. 43–51, 2021

2021
[9]

CREF: An LLM-based conversational software repair framework for programming tutors,

B. Yang, H. Tianet al., “CREF: An LLM-based conversational software repair framework for programming tutors,” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. New York, NY , USA: ACM, 2024, pp. 882–894

2024
[10]

SWE-bench: Can language models resolve real-world github issues?

C. E. Jimenez, J. Yanget al., “SWE-bench: Can language models resolve real-world github issues?” inThe Twelfth International Conference on Learning Representations. Vienna, Austria: OpenReview.net, 2024

2024
[11]

Competition-level code generation with AlphaCode,

Y . Li, D. Choiet al., “Competition-level code generation with AlphaCode,” Science, vol. 378, no. 6624, pp. 1092–1097, 2022

2022
[12]

Preprint , year =

R. Ehrlich, B. Brownet al., “CodeMonkeys: Scaling test-time compute for software engineering,”arXiv preprint arXiv:2501.14723, 2025

work page arXiv 2025
[13]

Orwall, K. Zhang, Y. Xie, A. Goyal, and W. Wang, “Swe-search: Enhancing software agents with monte carlo tree search and iterative refinement,

A. Antoniades, A. ¨Orwallet al., “SWE-Search: Enhancing software agents with monte carlo tree search and iterative refinement,”arXiv preprint arXiv:2410.20285, 2024

work page arXiv 2024
[14]

An analysis of patch plausibility and correctness for generate-and-validate patch generation systems,

Z. Qi, F. Longet al., “An analysis of patch plausibility and correctness for generate-and-validate patch generation systems,” inProceedings of the 2015 International Symposium on Software Testing and Analysis. New York, NY , USA: ACM, 2015, pp. 24–36

2015
[15]

The patch overfitting problem in automated program repair: Practical magnitude and a baseline for realistic bench- marking,

J. Petke, M. Martinezet al., “The patch overfitting problem in automated program repair: Practical magnitude and a baseline for realistic bench- marking,” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. New York, NY , USA: ACM, 2024, pp. 452–456

2024
[16]

A survey of LLM-based automated program repair: Taxonomies, design paradigms, and applications,

B. Yang, Z. Caiet al., “A survey of LLM-based automated program repair: Taxonomies, design paradigms, and applications,”arXiv preprint arXiv:2506.23749, 2025

work page arXiv 2025
[17]

Demystifying LLM-based software engineering agents,

C. S. Xia, Y . Denget al., “Demystifying LLM-based software engineering agents,”Proc. ACM Softw. Eng., vol. 2, no. FSE, pp. 801–824, 2025

2025
[18]

Revisiting object similarity-based patch ranking in automated program repair: An extensive study,

A. Ghanbari, “Revisiting object similarity-based patch ranking in automated program repair: An extensive study,” inProceedings of the 3rd IEEE/ACM International Workshop on Automated Program Repair. Piscataway, NJ, USA: IEEE, 2022, pp. 16–23

2022
[19]

Rank aggregation methods for the web,

C. Dwork, R. Kumaret al., “Rank aggregation methods for the web,” in Proceedings of the 10th International Conference on World Wide Web. Hong Kong, China: ACM, 2001, pp. 613–622

2001
[20]

Patch correctness assessment in automated program repair based on the impact of patches on production and test code,

A. Ghanbari and A. Marcus, “Patch correctness assessment in automated program repair based on the impact of patches on production and test code,” inProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. New York, NY , USA: ACM, 2022, pp. 654–665

2022
[21]

Predicting patch correctness based on the similarity of failing test cases,

H. Tian, Y . Liet al., “Predicting patch correctness based on the similarity of failing test cases,”ACM Trans. Softw. Eng. Methodol., vol. 31, no. 4, pp. 77:1–77:30, 2022

2022
[22]

Identifying patch correctness in test-based program repair,

Y . Xiong, X. Liuet al., “Identifying patch correctness in test-based program repair,” inProceedings of the 40th International Conference on Software Engineering. New York, NY , USA: ACM, 2018, pp. 789–799

2018
[23]

Automated patch assessment for program repair at scale,

H. Ye, M. Martinez, and M. Monperrus, “Automated patch assessment for program repair at scale,”Empir. Softw. Eng., vol. 26, no. 2, p. 20, 2021

2021
[24]

Leveraging large language model for automatic patch correctness assessment,

X. Zhou, B. Xuet al., “Leveraging large language model for automatic patch correctness assessment,”IEEE Trans. Software Eng., vol. 50, no. 11, pp. 2865–2883, 2024

2024
[25]

Self-consistency improves chain of thought reasoning in language models,

X. Wang, J. Weiet al., “Self-consistency improves chain of thought reasoning in language models,” inThe Eleventh International Conference on Learning Representations. OpenReview.net, 2023

2023
[26]

CodeT: Code generation with generated tests,

B. Chen, F. Zhanget al., “CodeT: Code generation with generated tests,” inThe Eleventh International Conference on Learning Representations. OpenReview.net, 2023

2023
[27]

Universal self-consistency for large language model generation.arXiv preprint arXiv:2311.17311, 2023

X. Chen, R. Aksitovet al., “Universal self-consistency for large language model generation,”arXiv preprint arXiv:2311.17311, 2023

work page arXiv 2023
[28]

SWE-bench official leaderboards,

SWE-bench Team, “SWE-bench official leaderboards,” https://www. swebench.com/, 2026

2026
[29]

Live-SWE-agent: Can software engineering agents self-evolve on the fly?

C. S. Xia, Z. Wanget al., “Live-SWE-agent: Can software engineering agents self-evolve on the fly?”arXiv preprint arXiv:2511.13646, 2025

work page arXiv 2025
[30]

Trae Agent: An LLM-based agent for general purpose software engineering tasks,

ByteDance, “Trae Agent: An LLM-based agent for general purpose software engineering tasks,” https://github.com/bytedance/trae-agent, 2025

2025
[31]

Rovo Dev: Agentic AI for software teams,

Atlassian, “Rovo Dev: Agentic AI for software teams,” https://www. atlassian.com/software/rovo-dev, 2025

2025
[32]

EPAM AI/Run Developer Agent: SWE-bench verified submission,

EPAM Systems, “EPAM AI/Run Developer Agent: SWE-bench verified submission,” https://github.com/SWE-bench/experiments, 2025

2025
[33]

ACoder: Achieving state-of-the-art performance on SWE- bench Verified,

ACoder Team, “ACoder: Achieving state-of-the-art performance on SWE- bench Verified,” https://github.com/SWE-bench/experiments, 2025

2025
[34]

Defects4J: A database of existing faults to enable controlled testing studies for Java programs,

R. Just, D. Jalali, and M. D. Ernst, “Defects4J: A database of existing faults to enable controlled testing studies for Java programs,” in Proceedings of the 2014 International Symposium on Software Testing and Analysis. New York, NY , USA: ACM, 2014, pp. 437–440

2014
[35]

Genprog: A generic method for automatic software repair,

C. Le Goues, T. Nguyenet al., “Genprog: A generic method for automatic software repair,”IEEE Trans. Softw. Eng., vol. 38, no. 1, pp. 54–72, 2012

2012
[36]

Language models can prioritize patches for practical program patching,

S. Kang and S. Yoo, “Language models can prioritize patches for practical program patching,” inProceedings of the 3rd IEEE/ACM International Workshop on Automated Program Repair. New York, NY , USA: ACM, 2022, pp. 8–15

2022
[37]

On the naturalness of software,

A. Hindle, E. T. Barret al., “On the naturalness of software,” inPro- ceedings of the 34th International Conference on Software Engineering. Piscataway, NJ, USA: IEEE, 2012, pp. 837–847

2012
[38]

Learning to represent patches,

X. Tang, H. Tianet al., “Learning to represent patches,” inCompanion Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. New York, NY , USA: ACM, 2024, pp. 396–397

2024
[39]

Automatic patch generation by learning correct code,

F. Long and M. C. Rinard, “Automatic patch generation by learning correct code,” inProceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. New York, NY , USA: ACM, 2016, pp. 298–312

2016
[40]

Context-aware patch generation for better automated program repair,

M. Wen, J. Chenet al., “Context-aware patch generation for better automated program repair,” inProceedings of the 40th International Conference on Software Engineering. New York, NY , USA: ACM, 2018, pp. 1–11

2018
[41]

TBar: Revisiting template-based automated program repair,

K. Liu, A. Koyuncuet al., “TBar: Revisiting template-based automated program repair,” inProceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. New York, NY , USA: ACM, 2019, pp. 31–42

2019
[42]

FixMiner: Mining relevant fix patterns for automated program repair,

A. Koyuncu, K. Liuet al., “FixMiner: Mining relevant fix patterns for automated program repair,”Empirical Software Engineering, vol. 25, no. 3, pp. 1980–2024, 2020

1980
[43]

Semfix: program repair via semantic analysis,

H. D. T. Nguyen, D. Qiet al., “Semfix: program repair via semantic analysis,” inProceedings of the 2013 International Conference on Software Engineering. Piscataway, NJ, USA: IEEE Press, 2013, pp. 772–781

2013
[44]

Angelix: scalable multiline program patch synthesis via symbolic analysis,

S. Mechtaev, J. Yi, and A. Roychoudhury, “Angelix: scalable multiline program patch synthesis via symbolic analysis,” inProceedings of the 38th International Conference on Software Engineering. New York, NY , USA: ACM, 2016, pp. 691–701

2016
[45]

DLFix: Context-based code transformation learning for automated program repair,

Y . Li, S. Wang, and T. N. Nguyen, “DLFix: Context-based code transformation learning for automated program repair,” inProceedings of the 42nd International Conference on Software Engineering. New York, NY , USA: ACM, 2020, pp. 602–614

2020
[46]

Coconut: combining context-aware neural translation models using ensemble for program repair,

T. Lutellier, H. V . Phamet al., “Coconut: combining context-aware neural translation models using ensemble for program repair,” inProceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. New York, NY , USA: ACM, 2020, pp. 101–114

2020
[47]

Sequencer: Sequence-to-sequence learning for end-to-end program repair,

Z. Chen, S. Kommruschet al., “Sequencer: Sequence-to-sequence learning for end-to-end program repair,”IEEE Trans. Softw. Eng., vol. 47, no. 9, pp. 1943–1959, 2021

1943
[48]

Less training, more repairing please: revisiting automated program repair via zero-shot learning,

C. S. Xia and L. Zhang, “Less training, more repairing please: revisiting automated program repair via zero-shot learning,” inProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York, NY , USA: ACM, 2022, pp. 959–971

2022
[49]

CCTEST: Testing and repairing code completion systems,

Z. Li, C. Wanget al., “CCTEST: Testing and repairing code completion systems,” inProceedings of the 45th International Conference on Software Engineering. Piscataway, NJ, USA: IEEE, 2023, pp. 1238–1250

2023
[50]

Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using chatgpt,

C. S. Xia and L. Zhang, “Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using chatgpt,” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. New York, NY , USA: ACM, 2024, pp. 819–831

2024
[51]

RAP-Gen: Retrieval-augmented patch generation with CodeT5 for automatic program repair,

W. Wang, Y . Wanget al., “RAP-Gen: Retrieval-augmented patch generation with CodeT5 for automatic program repair,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York, NY , USA: ACM, 2023, pp. 146–158

2023
[52]

Enhancing repository-level software repair via repository-aware knowledge graphs,

B. Yang, H. Tianet al., “Enhancing repository-level software repair via repository-aware knowledge graphs,”arXiv preprint arXiv:2503.21710, 2025

work page arXiv 2025
[53]

SWE-agent: Agent-computer interfaces en- able automated software engineering,

J. Yang, C. E. Jimenezet al., “SWE-agent: Agent-computer interfaces en- able automated software engineering,” inAdvances in Neural Information Processing Systems, vol. 37, 2024

2024
[54]

AutoCodeRover: Autonomous program improvement,

Y . Zhang, J. Ruanet al., “AutoCodeRover: Autonomous program improvement,” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. New York, NY , USA: ACM, 2024, pp. 1592–1604

2024
[55]

Harnessing evolution for multi- hunk program repair,

S. Saha, R. K. Saha, and M. R. Prasad, “Harnessing evolution for multi- hunk program repair,” inProceedings of the 41st International Conference on Software Engineering. Piscataway, NJ, USA: IEEE / ACM, 2019, pp. 13–24

2019
[56]

Varfix: balancing edit expressiveness and search effectiveness in automated program repair,

C.-P. Wong, P. Santiestebanet al., “Varfix: balancing edit expressiveness and search effectiveness in automated program repair,” inProceedings of the 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York, NY , USA: ACM, 2021, pp. 354–366

2021
[57]

ITER: Iterative neural repair for multi-location patches,

H. Ye and M. Monperrus, “ITER: Iterative neural repair for multi-location patches,” inProceedings of the 46th IEEE/ACM International Conference on Software Engineering, 2024, pp. 10:1–10:13

2024
[58]

Hybrid automated program repair by com- bining large language models and program analysis,

F. Li, J. Jianget al., “Hybrid automated program repair by com- bining large language models and program analysis,”arXiv preprint arXiv:2406.00992, 2024

work page arXiv 2024
[59]

SiblingRepair: Sibling-Based Multi-Hunk Repair with Large Language Models

X. Liu, J. Renet al., “Siblingrepair: Sibling-based multi-hunk repair with large language models,”arXiv preprint arXiv:2605.06209, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[60]

When large language models confront repository- level automatic program repair: How well they done?

Y . Chen, J. Wuet al., “When large language models confront repository- level automatic program repair: How well they done?” inProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings. New York, NY , USA: ACM, 2024, pp. 459–471

2024

[1] [1]

Automated repair of programs from large language models,

Z. Fan, X. Gaoet al., “Automated repair of programs from large language models,” inProceedings of the 45th International Conference on Software Engineering. Piscataway, NJ, USA: IEEE, 2023, pp. 1469–1481

2023

[2] [2]

Automated program repair in the era of large pre-trained language models,

C. S. Xia, Y . Wei, and L. Zhang, “Automated program repair in the era of large pre-trained language models,” inProceedings of the 45th International Conference on Software Engineering. Piscataway, NJ, USA: IEEE, 2023, pp. 1482–1494

2023

[3] [3]

RepairAgent: An autonomous, LLM-based agent for program repair,

I. Bouzenia, P. T. Devanbu, and M. Pradel, “RepairAgent: An autonomous, LLM-based agent for program repair,” inProceedings of the 47th International Conference on Software Engineering, 2025, pp. 2188–2200

2025

[4] [4]

MORepair: Teaching LLMs to repair code via multi-objective fine-tuning,

B. Yang, H. Tianet al., “MORepair: Teaching LLMs to repair code via multi-objective fine-tuning,”ACM Trans. Softw. Eng. Methodol., vol. 35, no. 2, pp. 38:1–38:38, 2026

2026

[5] [5]

Modern code reviews in open-source projects: which problems do they fix?

M. Beller, A. Bacchelliet al., “Modern code reviews in open-source projects: which problems do they fix?” inProceedings of the 11th Working Conference on Mining Software Repositories. New York, NY , USA: ACM, 2014, pp. 202–211

2014

[6] [6]

An empirical study of the impact of modern code review practices on software quality,

S. McIntosh, Y . Kameiet al., “An empirical study of the impact of modern code review practices on software quality,”Empir. Softw. Eng., vol. 21, no. 5, pp. 2146–2189, 2016

2016

[7] [7]

Review4repair: Code review aided automatic program repairing,

F. Huq, M. Hasanet al., “Review4repair: Code review aided automatic program repairing,”Inf. Softw. Technol., vol. 143, p. 106765, 2022

2022

[8] [8]

On the introduction of automatic program repair in bloomberg,

S. Kirbas, E. Windelset al., “On the introduction of automatic program repair in bloomberg,”IEEE Softw., vol. 38, no. 4, pp. 43–51, 2021

2021

[9] [9]

CREF: An LLM-based conversational software repair framework for programming tutors,

B. Yang, H. Tianet al., “CREF: An LLM-based conversational software repair framework for programming tutors,” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. New York, NY , USA: ACM, 2024, pp. 882–894

2024

[10] [10]

SWE-bench: Can language models resolve real-world github issues?

C. E. Jimenez, J. Yanget al., “SWE-bench: Can language models resolve real-world github issues?” inThe Twelfth International Conference on Learning Representations. Vienna, Austria: OpenReview.net, 2024

2024

[11] [11]

Competition-level code generation with AlphaCode,

Y . Li, D. Choiet al., “Competition-level code generation with AlphaCode,” Science, vol. 378, no. 6624, pp. 1092–1097, 2022

2022

[12] [12]

Preprint , year =

R. Ehrlich, B. Brownet al., “CodeMonkeys: Scaling test-time compute for software engineering,”arXiv preprint arXiv:2501.14723, 2025

work page arXiv 2025

[13] [13]

Orwall, K. Zhang, Y. Xie, A. Goyal, and W. Wang, “Swe-search: Enhancing software agents with monte carlo tree search and iterative refinement,

A. Antoniades, A. ¨Orwallet al., “SWE-Search: Enhancing software agents with monte carlo tree search and iterative refinement,”arXiv preprint arXiv:2410.20285, 2024

work page arXiv 2024

[14] [14]

An analysis of patch plausibility and correctness for generate-and-validate patch generation systems,

Z. Qi, F. Longet al., “An analysis of patch plausibility and correctness for generate-and-validate patch generation systems,” inProceedings of the 2015 International Symposium on Software Testing and Analysis. New York, NY , USA: ACM, 2015, pp. 24–36

2015

[15] [15]

The patch overfitting problem in automated program repair: Practical magnitude and a baseline for realistic bench- marking,

J. Petke, M. Martinezet al., “The patch overfitting problem in automated program repair: Practical magnitude and a baseline for realistic bench- marking,” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. New York, NY , USA: ACM, 2024, pp. 452–456

2024

[16] [16]

A survey of LLM-based automated program repair: Taxonomies, design paradigms, and applications,

B. Yang, Z. Caiet al., “A survey of LLM-based automated program repair: Taxonomies, design paradigms, and applications,”arXiv preprint arXiv:2506.23749, 2025

work page arXiv 2025

[17] [17]

Demystifying LLM-based software engineering agents,

C. S. Xia, Y . Denget al., “Demystifying LLM-based software engineering agents,”Proc. ACM Softw. Eng., vol. 2, no. FSE, pp. 801–824, 2025

2025

[18] [18]

Revisiting object similarity-based patch ranking in automated program repair: An extensive study,

A. Ghanbari, “Revisiting object similarity-based patch ranking in automated program repair: An extensive study,” inProceedings of the 3rd IEEE/ACM International Workshop on Automated Program Repair. Piscataway, NJ, USA: IEEE, 2022, pp. 16–23

2022

[19] [19]

Rank aggregation methods for the web,

C. Dwork, R. Kumaret al., “Rank aggregation methods for the web,” in Proceedings of the 10th International Conference on World Wide Web. Hong Kong, China: ACM, 2001, pp. 613–622

2001

[20] [20]

Patch correctness assessment in automated program repair based on the impact of patches on production and test code,

A. Ghanbari and A. Marcus, “Patch correctness assessment in automated program repair based on the impact of patches on production and test code,” inProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. New York, NY , USA: ACM, 2022, pp. 654–665

2022

[21] [21]

Predicting patch correctness based on the similarity of failing test cases,

H. Tian, Y . Liet al., “Predicting patch correctness based on the similarity of failing test cases,”ACM Trans. Softw. Eng. Methodol., vol. 31, no. 4, pp. 77:1–77:30, 2022

2022

[22] [22]

Identifying patch correctness in test-based program repair,

Y . Xiong, X. Liuet al., “Identifying patch correctness in test-based program repair,” inProceedings of the 40th International Conference on Software Engineering. New York, NY , USA: ACM, 2018, pp. 789–799

2018

[23] [23]

Automated patch assessment for program repair at scale,

H. Ye, M. Martinez, and M. Monperrus, “Automated patch assessment for program repair at scale,”Empir. Softw. Eng., vol. 26, no. 2, p. 20, 2021

2021

[24] [24]

Leveraging large language model for automatic patch correctness assessment,

X. Zhou, B. Xuet al., “Leveraging large language model for automatic patch correctness assessment,”IEEE Trans. Software Eng., vol. 50, no. 11, pp. 2865–2883, 2024

2024

[25] [25]

Self-consistency improves chain of thought reasoning in language models,

X. Wang, J. Weiet al., “Self-consistency improves chain of thought reasoning in language models,” inThe Eleventh International Conference on Learning Representations. OpenReview.net, 2023

2023

[26] [26]

CodeT: Code generation with generated tests,

B. Chen, F. Zhanget al., “CodeT: Code generation with generated tests,” inThe Eleventh International Conference on Learning Representations. OpenReview.net, 2023

2023

[27] [27]

Universal self-consistency for large language model generation.arXiv preprint arXiv:2311.17311, 2023

X. Chen, R. Aksitovet al., “Universal self-consistency for large language model generation,”arXiv preprint arXiv:2311.17311, 2023

work page arXiv 2023

[28] [28]

SWE-bench official leaderboards,

SWE-bench Team, “SWE-bench official leaderboards,” https://www. swebench.com/, 2026

2026

[29] [29]

Live-SWE-agent: Can software engineering agents self-evolve on the fly?

C. S. Xia, Z. Wanget al., “Live-SWE-agent: Can software engineering agents self-evolve on the fly?”arXiv preprint arXiv:2511.13646, 2025

work page arXiv 2025

[30] [30]

Trae Agent: An LLM-based agent for general purpose software engineering tasks,

ByteDance, “Trae Agent: An LLM-based agent for general purpose software engineering tasks,” https://github.com/bytedance/trae-agent, 2025

2025

[31] [31]

Rovo Dev: Agentic AI for software teams,

Atlassian, “Rovo Dev: Agentic AI for software teams,” https://www. atlassian.com/software/rovo-dev, 2025

2025

[32] [32]

EPAM AI/Run Developer Agent: SWE-bench verified submission,

EPAM Systems, “EPAM AI/Run Developer Agent: SWE-bench verified submission,” https://github.com/SWE-bench/experiments, 2025

2025

[33] [33]

ACoder: Achieving state-of-the-art performance on SWE- bench Verified,

ACoder Team, “ACoder: Achieving state-of-the-art performance on SWE- bench Verified,” https://github.com/SWE-bench/experiments, 2025

2025

[34] [34]

Defects4J: A database of existing faults to enable controlled testing studies for Java programs,

R. Just, D. Jalali, and M. D. Ernst, “Defects4J: A database of existing faults to enable controlled testing studies for Java programs,” in Proceedings of the 2014 International Symposium on Software Testing and Analysis. New York, NY , USA: ACM, 2014, pp. 437–440

2014

[35] [35]

Genprog: A generic method for automatic software repair,

C. Le Goues, T. Nguyenet al., “Genprog: A generic method for automatic software repair,”IEEE Trans. Softw. Eng., vol. 38, no. 1, pp. 54–72, 2012

2012

[36] [36]

Language models can prioritize patches for practical program patching,

S. Kang and S. Yoo, “Language models can prioritize patches for practical program patching,” inProceedings of the 3rd IEEE/ACM International Workshop on Automated Program Repair. New York, NY , USA: ACM, 2022, pp. 8–15

2022

[37] [37]

On the naturalness of software,

A. Hindle, E. T. Barret al., “On the naturalness of software,” inPro- ceedings of the 34th International Conference on Software Engineering. Piscataway, NJ, USA: IEEE, 2012, pp. 837–847

2012

[38] [38]

Learning to represent patches,

X. Tang, H. Tianet al., “Learning to represent patches,” inCompanion Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. New York, NY , USA: ACM, 2024, pp. 396–397

2024

[39] [39]

Automatic patch generation by learning correct code,

F. Long and M. C. Rinard, “Automatic patch generation by learning correct code,” inProceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. New York, NY , USA: ACM, 2016, pp. 298–312

2016

[40] [40]

Context-aware patch generation for better automated program repair,

M. Wen, J. Chenet al., “Context-aware patch generation for better automated program repair,” inProceedings of the 40th International Conference on Software Engineering. New York, NY , USA: ACM, 2018, pp. 1–11

2018

[41] [41]

TBar: Revisiting template-based automated program repair,

K. Liu, A. Koyuncuet al., “TBar: Revisiting template-based automated program repair,” inProceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. New York, NY , USA: ACM, 2019, pp. 31–42

2019

[42] [42]

FixMiner: Mining relevant fix patterns for automated program repair,

A. Koyuncu, K. Liuet al., “FixMiner: Mining relevant fix patterns for automated program repair,”Empirical Software Engineering, vol. 25, no. 3, pp. 1980–2024, 2020

1980

[43] [43]

Semfix: program repair via semantic analysis,

H. D. T. Nguyen, D. Qiet al., “Semfix: program repair via semantic analysis,” inProceedings of the 2013 International Conference on Software Engineering. Piscataway, NJ, USA: IEEE Press, 2013, pp. 772–781

2013

[44] [44]

Angelix: scalable multiline program patch synthesis via symbolic analysis,

S. Mechtaev, J. Yi, and A. Roychoudhury, “Angelix: scalable multiline program patch synthesis via symbolic analysis,” inProceedings of the 38th International Conference on Software Engineering. New York, NY , USA: ACM, 2016, pp. 691–701

2016

[45] [45]

DLFix: Context-based code transformation learning for automated program repair,

Y . Li, S. Wang, and T. N. Nguyen, “DLFix: Context-based code transformation learning for automated program repair,” inProceedings of the 42nd International Conference on Software Engineering. New York, NY , USA: ACM, 2020, pp. 602–614

2020

[46] [46]

Coconut: combining context-aware neural translation models using ensemble for program repair,

T. Lutellier, H. V . Phamet al., “Coconut: combining context-aware neural translation models using ensemble for program repair,” inProceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. New York, NY , USA: ACM, 2020, pp. 101–114

2020

[47] [47]

Sequencer: Sequence-to-sequence learning for end-to-end program repair,

Z. Chen, S. Kommruschet al., “Sequencer: Sequence-to-sequence learning for end-to-end program repair,”IEEE Trans. Softw. Eng., vol. 47, no. 9, pp. 1943–1959, 2021

1943

[48] [48]

Less training, more repairing please: revisiting automated program repair via zero-shot learning,

C. S. Xia and L. Zhang, “Less training, more repairing please: revisiting automated program repair via zero-shot learning,” inProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York, NY , USA: ACM, 2022, pp. 959–971

2022

[49] [49]

CCTEST: Testing and repairing code completion systems,

Z. Li, C. Wanget al., “CCTEST: Testing and repairing code completion systems,” inProceedings of the 45th International Conference on Software Engineering. Piscataway, NJ, USA: IEEE, 2023, pp. 1238–1250

2023

[50] [50]

Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using chatgpt,

C. S. Xia and L. Zhang, “Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using chatgpt,” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. New York, NY , USA: ACM, 2024, pp. 819–831

2024

[51] [51]

RAP-Gen: Retrieval-augmented patch generation with CodeT5 for automatic program repair,

W. Wang, Y . Wanget al., “RAP-Gen: Retrieval-augmented patch generation with CodeT5 for automatic program repair,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York, NY , USA: ACM, 2023, pp. 146–158

2023

[52] [52]

Enhancing repository-level software repair via repository-aware knowledge graphs,

B. Yang, H. Tianet al., “Enhancing repository-level software repair via repository-aware knowledge graphs,”arXiv preprint arXiv:2503.21710, 2025

work page arXiv 2025

[53] [53]

SWE-agent: Agent-computer interfaces en- able automated software engineering,

J. Yang, C. E. Jimenezet al., “SWE-agent: Agent-computer interfaces en- able automated software engineering,” inAdvances in Neural Information Processing Systems, vol. 37, 2024

2024

[54] [54]

AutoCodeRover: Autonomous program improvement,

Y . Zhang, J. Ruanet al., “AutoCodeRover: Autonomous program improvement,” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. New York, NY , USA: ACM, 2024, pp. 1592–1604

2024

[55] [55]

Harnessing evolution for multi- hunk program repair,

S. Saha, R. K. Saha, and M. R. Prasad, “Harnessing evolution for multi- hunk program repair,” inProceedings of the 41st International Conference on Software Engineering. Piscataway, NJ, USA: IEEE / ACM, 2019, pp. 13–24

2019

[56] [56]

Varfix: balancing edit expressiveness and search effectiveness in automated program repair,

C.-P. Wong, P. Santiestebanet al., “Varfix: balancing edit expressiveness and search effectiveness in automated program repair,” inProceedings of the 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York, NY , USA: ACM, 2021, pp. 354–366

2021

[57] [57]

ITER: Iterative neural repair for multi-location patches,

H. Ye and M. Monperrus, “ITER: Iterative neural repair for multi-location patches,” inProceedings of the 46th IEEE/ACM International Conference on Software Engineering, 2024, pp. 10:1–10:13

2024

[58] [58]

Hybrid automated program repair by com- bining large language models and program analysis,

F. Li, J. Jianget al., “Hybrid automated program repair by com- bining large language models and program analysis,”arXiv preprint arXiv:2406.00992, 2024

work page arXiv 2024

[59] [59]

SiblingRepair: Sibling-Based Multi-Hunk Repair with Large Language Models

X. Liu, J. Renet al., “Siblingrepair: Sibling-based multi-hunk repair with large language models,”arXiv preprint arXiv:2605.06209, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[60] [60]

When large language models confront repository- level automatic program repair: How well they done?

Y . Chen, J. Wuet al., “When large language models confront repository- level automatic program repair: How well they done?” inProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings. New York, NY , USA: ACM, 2024, pp. 459–471

2024