pith. machine review for the scientific record. sign in

arxiv: 2605.06209 · v1 · submitted 2026-05-07 · 💻 cs.SE

Recognition: unknown

SiblingRepair: Sibling-Based Multi-Hunk Repair with Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:46 UTC · model grok-4.3

classification 💻 cs.SE
keywords automated program repairmulti-hunk repairsibling locationslarge language modelsdefects4jghrbpatch generationfault localization
0
0 comments X

The pith

SiblingRepair uses large language models to detect and fix similar bugs in related code locations more effectively than previous methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SiblingRepair as an LLM-powered method for repairing multiple related code bugs, known as multi-hunk repairs, by identifying sibling locations that share similar issues. It improves on earlier techniques by searching for siblings using code token and embedding similarity instead of requiring commit history or strict structural matches, then employs the LLM to select relevant ones and create consistent fixes through simultaneous or iterative approaches. This leads to better performance on standard benchmarks like Defects4J and GHRB compared to state-of-the-art tools such as Hercules. A sympathetic reader would care because many real-world bugs occur in duplicated or similar code, and better automated repair could save developers time on fixing incomplete patches. The approach also preserves good patches from previous attempts to build more general fixes.

Core claim

SiblingRepair advances multi-hunk automated program repair by starting from a fault-localized location, using token- and embedding-based matching to find sibling candidates without commit-history restrictions, applying an LLM to identify failure-relevant siblings, and generating consistent patches via simultaneous joint repair or iterative analysis, while combining preserved patches from earlier locations into generalized multi-hunk solutions, resulting in substantially more repairs than prior SOTA methods on Defects4J and GHRB.

What carries the argument

The central mechanism is the LLM-guided sibling identification and repair using two strategies—simultaneous repair for joint fixing and iterative repair for progressive patch building—built on semantic code matching rather than AST or history-based methods.

If this is right

  • Automated repair tools can address more bugs that span multiple similar code locations without needing version history.
  • Patches become more consistent across related functions, reducing risks of partial fixes.
  • The method shows efficiency gains and robustness to potential data leakage in LLMs.
  • It validates the use of semantic embeddings alongside LLMs for candidate selection in repair tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Integrating SiblingRepair with other fault localization techniques could further improve overall repair rates.
  • The dual repair strategies might be applicable to single-hunk repairs for better patch quality.
  • Developers could use similar sibling detection to manually review related code sections for similar issues.
  • This approach suggests potential for LLM-based repair in non-Java languages or different bug types.

Load-bearing premise

The large language model must be able to correctly identify which sibling code locations are relevant to the specific failure and generate patches that maintain semantic consistency when applied across different but related code sites.

What would settle it

If evaluations on Defects4J show that SiblingRepair does not repair more multi-hunk bugs than Hercules or other SOTA techniques, or if manual inspection reveals inconsistent patches from the LLM, the superiority claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.06209 by Jiayu Ren, Jifeng Xuan, Qi Xin, Xiaoyuan Xie, Xinyu Liu, Yusen Wang.

Figure 1
Figure 1. Figure 1: The Math 100 bug and its correct patch (on the left), along with HERCULES’s AST-based code matching and edit generation (on the right). SIBLINGREPAIR, showing how it addresses these limitations. A. Motivating Example We use the Math 100 bug from Defects4J [6] as a moti￾vating example. The failure manifested as ArrayIndexOutOf￾BoundsException is caused by using the full parameter vector in the computation t… view at source ↗
Figure 2
Figure 2. Figure 2: The overview of SIBLINGREPAIR. Algorithm 1: The SIBLINGREPAIR Algorithm INPUT: initial buggy program b, test suite T, fault localization F, max candidate siblings number k, embeding similarity threshold θ, jaccard similarity threshold α, repair attempts t, max fix ingredients of each line n OUTPUT: plausible patches Ppl 1 SIBLINGREPAIR (b, k, θ, α, t, n, T, F) 2 Ppro ← ∅; // initial promising patch 3 Ppl ←… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the prompt construction. view at source ↗
Figure 4
Figure 4. Figure 4: Box plots of SIBLINGREPAIR’s repair time for the Defects4J bugs under SBFL (left) and SPFL (right) for each setting. Time for fault localization is excluded. From left to right, it shows the statistics for bugs correctly repaired, incorrectly repaired with plausible patches, and not repaired with no plausible patches. difference between the two settings arises from the suspicious location lists used. Recal… view at source ↗
read the original abstract

Developers often make similar mistakes across code locations implementing related functionalities. These locations, called siblings, share similar issues and require similar fixes. Accurately identifying siblings and consistently repairing them are crucial for automated program repair. Hercules is a SOTA technique designed for sibling repair. However, it is limited by strong assumptions about sibling locations and commit-history availability, rigid AST-based sibling matching, and inflexible template-based patch generation. To address these limitations, we present SiblingRepair, a new LLM-based multi-hunk APR technique specialized for sibling repair. Starting from a suspicious location identified by spectrum-based fault localization, SiblingRepair searches for semantically related sibling candidates using token- and embedding-based code matching, without restricting discovery to failing-test coverage or commit history. It then uses an LLM to identify failure-relevant siblings and generate consistent patches through two complementary strategies: simultaneous repair, which jointly repairs siblings, and iterative repair, which progressively analyzes candidates for patch construction. SiblingRepair further preserves promising patches generated from earlier suspicious locations and combines them into generalized multi-hunk patches. We evaluate SiblingRepair on the Defects4J and GHRB benchmarks. The results show that SiblingRepair substantially outperforms SOTA multi-hunk repair techniques including Hercules. Our evaluation further demonstrates its repair efficiency, the effectiveness of its sibling detection and repair components, and limited impact of the LLM data leakage on the results. Overall, SiblingRepair advances automated sibling and general multi-hunk repair.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces SiblingRepair, an LLM-based multi-hunk automated program repair technique that identifies semantically related sibling locations via token- and embedding-based matching (without relying on commit history or coverage restrictions), then applies an LLM for failure-relevant filtering and consistent patch generation using simultaneous or iterative strategies. It preserves and generalizes patches across suspicious locations and evaluates on Defects4J and GHRB, claiming substantial outperformance over SOTA multi-hunk techniques including Hercules, plus effective ablations on sibling components and limited LLM data leakage impact.

Significance. If the central performance claims hold after addressing evaluation gaps, the work meaningfully advances multi-hunk APR by relaxing rigid assumptions in prior template/AST-based methods like Hercules and demonstrating practical use of LLMs for semantic sibling detection and cross-location consistency. Strengths include the dual repair strategies and explicit handling of data leakage concerns, which are load-bearing for empirical claims in this domain.

major comments (3)
  1. [§5] §5 (Evaluation): The headline claim of substantial outperformance over Hercules and other SOTA techniques on Defects4J/GHRB is reported without quantitative details on patch correctness verification (beyond test-suite passage), statistical significance (e.g., p-values or effect sizes for repair count differences), or exact numbers of correctly fixed bugs per benchmark; this prevents verification that gains are not artifacts of evaluation setup.
  2. [§5.2] §5.2 (Ablation study on sibling components): The ablation demonstrates effectiveness of sibling detection/repair but does not include a control baseline of the same LLM prompted for independent single-location repair (without sibling search or consistency enforcement); this leaves open whether reported gains derive from the sibling machinery or simply stronger LLM patch generation, undermining attribution of the central contribution.
  3. [§5.4] §5.4 (Data leakage analysis): While the paper asserts limited impact of LLM data leakage, the concrete method for detecting leakage (e.g., overlap checks between Defects4J/GHRB and LLM training corpora, or prompt-specific contamination tests) is not specified with sufficient detail or results to allow independent verification of the claim.
minor comments (2)
  1. [Abstract/Introduction] The abstract and introduction could more explicitly define 'sibling' locations with an example to ground the token/embedding matching approach.
  2. [§5] Figure captions and axis labels in the evaluation plots should include error bars or variance measures to improve clarity of the reported repair efficiencies.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to improve clarity, verifiability, and attribution of contributions.

read point-by-point responses
  1. Referee: [§5] §5 (Evaluation): The headline claim of substantial outperformance over Hercules and other SOTA techniques on Defects4J/GHRB is reported without quantitative details on patch correctness verification (beyond test-suite passage), statistical significance (e.g., p-values or effect sizes for repair count differences), or exact numbers of correctly fixed bugs per benchmark; this prevents verification that gains are not artifacts of evaluation setup.

    Authors: We appreciate this observation. The manuscript reports the number of bugs fixed by SiblingRepair versus baselines (including Hercules) on both Defects4J and GHRB, but we agree that additional quantitative details would strengthen verifiability. In the revision, we will expand §5 with: exact counts of correctly fixed bugs per benchmark (including any manual verification of patch correctness beyond test-suite passage), statistical significance tests (e.g., McNemar's test with p-values for paired repair count differences), and effect sizes. These will be added to updated tables and accompanying text. revision: yes

  2. Referee: [§5.2] §5.2 (Ablation study on sibling components): The ablation demonstrates effectiveness of sibling detection/repair but does not include a control baseline of the same LLM prompted for independent single-location repair (without sibling search or consistency enforcement); this leaves open whether reported gains derive from the sibling machinery or simply stronger LLM patch generation, undermining attribution of the central contribution.

    Authors: This is a valid critique of the ablation design. To better isolate the contribution of sibling detection and consistency enforcement, we will add a new control baseline in §5.2: the same LLM prompted for independent single-location repair on each suspicious location, without sibling search or cross-location consistency. Results from this baseline will be directly compared to full SiblingRepair to quantify the incremental gains from the sibling components. revision: yes

  3. Referee: [§5.4] §5.4 (Data leakage analysis): While the paper asserts limited impact of LLM data leakage, the concrete method for detecting leakage (e.g., overlap checks between Defects4J/GHRB and LLM training corpora, or prompt-specific contamination tests) is not specified with sufficient detail or results to allow independent verification of the claim.

    Authors: We acknowledge the need for greater methodological transparency here. In the revised §5.4, we will explicitly detail the leakage detection approach, including overlap checks against publicly documented LLM training data for Defects4J/GHRB code and bug reports, any prompt-specific contamination tests performed, and the quantitative results (e.g., overlap percentages) supporting the limited-impact claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical technique paper with independent benchmark evaluation

full rationale

The paper proposes an LLM-based method for identifying and repairing sibling code locations in multi-hunk APR, then reports experimental results on Defects4J and GHRB against baselines like Hercules. No equations, first-principles derivations, or predictions appear in the provided text. The evaluation measures repair success via test-suite passage on external benchmarks, without any fitted parameters, self-definitions, or self-citation chains that reduce the central claims to the inputs by construction. Prompt engineering and component ablations are described as design choices, not as statistically forced outputs. This is a standard empirical SE paper whose claims rest on observable performance differences rather than internal redefinitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven domain assumption that current LLMs possess sufficient code semantics understanding to identify relevant siblings and generate consistent multi-location patches. No free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption LLMs can accurately identify failure-relevant siblings from token- and embedding-based candidates and generate consistent patches
    The entire repair pipeline depends on this capability; the abstract presents it as given.

pith-pipeline@v0.9.0 · 5577 in / 1302 out tokens · 38967 ms · 2026-05-08T08:46:11.746358+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 44 canonical work pages · 2 internal anchors

  1. [1]

    Automated program repair

    C. L. Goues, M. Pradel, and A. Roychoudhury, “Automated program repair,”Communications of the ACM, vol. 62, no. 12, pp. 56–65, 2019, https://doi.org/10.1145/3318162

  2. [2]

    A systematic literature review on large language models for auto- mated program repair,

    Q. Zhang, C. Fang, Y . Xie, Y . Ma, W. Sun, Y . Yang, and Z. Chen, “A systematic literature review on large language models for auto- mated program repair,”ACM Transactions on Software Engineering and Methodology, 2024, https://doi.org/10.1145/3799693

  3. [3]

    Bug replication in code clones: An empirical study,

    J. F. Islam, M. Mondal, and C. K. Roy, “Bug replication in code clones: An empirical study,” inProceedings of the 23rd IEEE International Con- ference on Software Analysis, Evolution, and Reengineering (SANER). IEEE, 2016, pp. 68–78, https://doi.org/10.1109/SANER.2016.78. 14

  4. [4]

    Harnessing evolution for multi-hunk program repair,

    S. Saha, R. k. Saha, and M. r. Prasad, “Harnessing evolution for multi-hunk program repair,” inProceedings of the 41st IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 2019, pp. 13–24, https://doi.org/10.1109/ICSE.2019.00020

  5. [5]

    Detecting, creating, repairing, and understanding indivisible multi-hunk bugs,

    Q. Xin, H. Wu, J. Tang, X. Liu, S. P. Reiss, and J. Xuan, “Detecting, creating, repairing, and understanding indivisible multi-hunk bugs,” Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 2747–2770, 2024, https://doi.org/10.1145/3660828

  6. [6]

    Defects4j: a database of existing faults to enable controlled testing studies for java programs,

    R. Just, D. Jalali, and M. D. Ernst, “Defects4J: A database of exist- ing faults to enable controlled testing studies for Java programs,” in Proceedings of the 23rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA). ACM, 2014, pp. 437–440, https://doi.org/10.1145/2610384.2628055

  7. [7]

    Fine-grained and accurate source code differencing,

    J. Falleri, F. Morandat, X. Blanc, M. Martinez, and M. Monperrus, “Fine-grained and accurate source code differencing,” inProceedings of the 29th IEEE/ACM International Conference on Automated Software Engineering (ASE). ACM, 2014, pp. 313–324, https://doi.org/10.1145/ 2642937.2642982

  8. [8]

    Fine-grained, accurate and scalable source differencing,

    J. Falleri and M. Martinez, “Fine-grained, accurate and scalable source differencing,” inProceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE). ACM, 2024, pp. 231:1– 231:12, https://doi.org/10.1145/3597503.3639148

  9. [9]

    Lahiri, and Sid- dhartha Sen

    X. Meng, X. Wang, H. Zhang, H. Sun, X. Liu, and C. Hu, “Template- based neural program repair,” inProceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 1456–1468, https://doi.org/10.1109/ICSE48619.2023.00127

  10. [11]

    Leskovec, A

    J. Leskovec, A. Rajaraman, and J. D. Ullman,Mining of massive data sets. Cambridge University Press, 2020, https://doi.org/10.1017/ 9781108684163

  11. [12]

    The github recent bugs dataset for evaluating llm-based debugging applications,

    J. Y . Lee, S. Kang, J. Yoon, and S. Yoo, “The github recent bugs dataset for evaluating llm-based debugging applications,” inProceedings of the 17th IEEE International Conference on Software Testing, Verification and Validation (ICST). IEEE, 2024, pp. 442–444, https://doi.org/10. 1109/ICST60714.2024.00049

  12. [13]

    Iter: Iterative neural repair for multi- location patches,

    H. Ye and M. Monperrus, “Iter: Iterative neural repair for multi- location patches,” inProceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE). ACM, 2024, pp. 1–13, https://doi.org/10.1145/3597503.3623337

  13. [14]

    Simple fast algorithms for the editing distance between trees and related problems.SIAM Journal on Computing, 18(6):1245–1262, 1989

    K. Zhang and D. Shasha, “Simple fast algorithms for the editing distance between trees and related problems,”SIAM journal on computing, vol. 18, no. 6, pp. 1245–1262, 1989, https://doi.org/10.1137/0218082

  14. [15]

    Semfix: Program repair via semantic analysis,

    D. Kim, J. Nam, J. Song, and S. Kim, “Automatic patch genera- tion learned from human-written patches,” inProceedings of the 35th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 2013, pp. 802–811, https://doi.org/10.1109/ICSE.2013.6606626

  15. [16]

    Elixir: Effective object-oriented program repair,

    R. K. Saha, Y . Lyu, H. Yoshida, and M. R. Prasad, “Elixir: Effective object-oriented program repair,” inProceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2017, pp. 648–659, https://doi.org/10.1109/ASE.2017.8115675

  16. [17]

    Precise condition synthesis for program repair,

    Y . Xiong, J. Wang, R. Yan, J. Zhang, S. Han, G. Huang, and L. Zhang, “Precise condition synthesis for program repair,” inProceedings of the 39th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 2017, pp. 416–426, https://doi.org/10.1109/ICSE.2017. 45

  17. [18]

    On the accuracy of spectrum-based fault localization,

    R. Abreu, P. Zoeteweij, and A. J. Van Gemund, “On the accuracy of spectrum-based fault localization,” inProceedings of the Testing: Academic and Industrial Conference Practice and Research Techniques- MUTATION (TAICPART-MUTATION 2007). IEEE, 2007, pp. 89–98, https://doi.org/10.1109/TAIC.PART.2007.13

  18. [19]

    New and improved embedding model,

    R. Greene, T. Sanders, L. Weng, and A. Neelakantan, “New and improved embedding model,” https://openai.com/blog/ new-and-improved-embedding-model, 2022, Accessed: 2026-04- 10

  19. [20]

    Code2vec: Learning distributed representations of code,

    U. Alon, M. Zilberstein, O. Levy, and E. Yahav, “code2vec: learn- ing distributed representations of code,”Proceedings of the ACM on Programming Languages, vol. 3, no. POPL, pp. 1–29, 2019, https: //doi.org/10.1145/3290353

  20. [21]

    Assessing the generalizability of code2vec token embeddings,

    H. J. Kang, T. F. Bissyand ´e, and D. Lo, “Assessing the generalizability of code2vec token embeddings,” inProceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2019, pp. 1–12, https://doi.org/10.1109/ASE.2019.00011

  21. [22]

    Studying and understanding the effectiveness and failures of conversational llm-based repair,

    A. Chen, H. Wu, Q. Xin, S. P. Reiss, and J. Xuan, “Studying and understanding the effectiveness and failures of conversational llm-based repair,” inProceedings of the 2025 IEEE/ACM International Workshop on Automated Program Repair (APR). IEEE, 2025, pp. 56–59, https://doi.org/10.1109/APR66717.2025.00014

  22. [23]

    Defects4J repository,

    R. Just, D. Jalali, and M. D. Ernst, “Defects4J repository,” https://github. com/rjust/defects4j, 2023, Accessed: 2026-04-24

  23. [24]

    Leveraging search- based and pre-trained code language models for automated program repair,

    O. Lijzenga, I. Hemati Moghadam, and V . Zaytsev, “Leveraging search- based and pre-trained code language models for automated program repair,” inProceedings of the 40th ACM/SIGAPP Symposium on Applied Computing (SAC). ACM, 2025, pp. 1627–1636, https://doi.org/10.1145/ 3672608.3707774

  24. [25]

    The Hercules tool,

    Q. Xin, H. Wu, J. Tang, X. Liu, S. P. Reiss, and J. Xuan, “The Hercules tool,” https://github.com/give-to/Hercules, 2026, Accessed: 2026-04-25

  25. [26]

    The ITER tool,

    H. Ye and M. Monperrus, “The ITER tool,” https://github.com/ ASSERT-KTH/ITER, 2024, Accessed: 2026-04-24

  26. [27]

    The ARJACLM tool,

    O. Lijzenga, I. Hemati Moghadam, and V . Zaytsev, “The ARJACLM tool,” https://doi.org/10.5281/zenodo.14222432, 2024, Accessed: 2025- 10-20

  27. [28]

    GenProg: A generic method for automatic software repair,

    C. Le Goues, T. Nguyen, S. Forrest, and W. Weimer, “GenProg: A generic method for automatic software repair,”IEEE Transactions on Software Engineering, vol. 38, no. 1, pp. 54–72, 2011, https://doi.org/ 10.1109/TSE.2011.104

  28. [29]

    ARJA: Automated repair of java programs via multi-objective genetic programming,

    Y . Yuan and W. Banzhaf, “ARJA: Automated repair of java programs via multi-objective genetic programming,”IEEE Transactions on software engineering, vol. 46, no. 10, pp. 1040–1067, 2018, https://doi.org/10. 1109/TSE.2018.2874648

  29. [30]

    A hybrid evolutionary system for automatic software repair,

    ——, “A hybrid evolutionary system for automatic software repair,” inProceedings of the 2019 Genetic and Evolutionary Computation Conference (GECCO). ACM, 2019, pp. 1417–1425, https://doi.org/ 10.1145/3321707.3321830

  30. [31]

    Multimend: Multilingual program repair with context augmentation and multi-hunk patch generation,

    R. Gharibi, M. H. Sadreddini, and S. M. Fakhrahmad, “Multimend: Multilingual program repair with context augmentation and multi-hunk patch generation,”Automated Software Engineering, vol. 33, no. 2, p. 69, 2025, https://doi.org/10.1007/s10515-026-00611-2

  31. [32]

    64 Sigma Jahan, Saurabh Singh Rajput, Tushar Sharma, and Mohammad Masudur Rahman

    I. Bouzenia, P. Devanbu, and M. Pradel, “Repairagent: An autonomous, llm-based agent for program repair,” inProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 2025, pp. 2188–2200, https://doi.org/10.1109/ICSE55347.2025. 00157

  32. [33]

    Repairllama: Efficient repre- sentations and fine-tuned adapters for program repair,

    A. Silva, S. Fang, and M. Monperrus, “Repairllama: Efficient repre- sentations and fine-tuned adapters for program repair,”IEEE Transac- tions on Software Engineering, vol. 51, no. 8, pp. 2366–2380, 2025, https://doi.org/10.1109/TSE.2025.3581062

  33. [34]

    PReMM: LLM-based program repair for multi-method bugs via divide and con- quer,

    L. Xie, Z. Li, Y . Pei, Z. Wen, K. Liu, T. Zhang, and X. Li, “PReMM: LLM-based program repair for multi-method bugs via divide and con- quer,”Proceedings of the ACM on Programming Languages, vol. 9, no. OOPSLA2, pp. 1316–1344, 2025, https://doi.org/10.1145/3763097

  34. [35]

    DynaFix: Iterative Automated Program Repair Driven by Execution-Level Dynamic Information

    Z. Huang, L. Xu, C. Liu, W. Sun, X. Zhang, Y . Lei, M. Yan, and H. Zhang, “Dynafix: Iterative automated program repair driven by execution-level dynamic information,”Computing Research Repos- itory, vol. abs/2512.24635, 2025, https://doi.org/10.48550/arXiv.2512. 24635v1

  35. [36]

    DynaFix: Iterative Automated Program Repair Driven by Execution-Level Dynamic Information

    ——, “Dynafix: Iterative automated program repair driven by execution- level dynamic information,”Computing Research Repository, vol. abs/2512.24635, 2025, https://doi.org/10.48550/arXiv.2512.24635v2

  36. [37]

    How far can we go with practical function-level program repair?

    J. Xiang, X. Xu, F. Kong, M. Wu, Z. Zhang, H. Zhang, and Y . Zhang, “How far can we go with practical function-level program repair?” Computing Research Repository, vol. abs/2404.12833, 2024, https://doi. org/10.48550/arXiv.2404.12833

  37. [39]

    Autocoderover: Autonomous program improvement,

    Y . Zhang, H. Ruan, Z. Fan, and A. Roychoudhury, “Autocoderover: Autonomous program improvement,” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA). ACM, 2024, p. 1592–1604, https://doi.org/10.1145/3650212. 3680384

  38. [40]

    64 Sigma Jahan, Saurabh Singh Rajput, Tushar Sharma, and Mohammad Masudur Rahman

    H. Ruan, Y . Zhang, and A. Roychoudhury, “Specrover: Code intent extraction via llms,” inProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 2025, p. 963–974, https://doi.org/10.1109/ICSE55347.2025.00080

  39. [41]

    Demystifying llm-based software engineering agents,

    C. S. Xia, Y . Deng, S. Dunn, and L. Zhang, “Demystifying llm-based software engineering agents,”Proceedings of the ACM on Software LIUet al.: SIBLINGREPAIR: SIBLING-BASED MULTI-HUNK REPAIR WITH LARGE LANGUAGE MODELS 15 Engineering, vol. 2, no. FSE, pp. 801–824, 2025, https://doi.org/10.1145/ 3715754

  40. [42]

    Log parsing: How far can chatgpt go? InProceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering, ASE ’23, page 1699–1704

    C. S. Xia, Y . Ding, and L. Zhang, “The plastic surgery hypothesis in the era of large language models,” inProceedings of the 38th IEEE/ACM International Conference on Automated Software Engi- neering (ASE). IEEE, 2023, pp. 522–534, https://doi.org/10.1109/ ASE56229.2023.00047

  41. [43]

    InProceedings of the 38th IEEE/ACM International Confer- ence on Automated Software Engineering(Echternach, Luxembourg) (ASE ’23)

    Q. Zhang, C. Fang, T. Zhang, B. Yu, W. Sun, and Z. Chen, “Gamma: Revisiting template-based automated program repair via mask predic- tion,” inProceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2023, pp. 535–547, https://doi.org/10.1109/ASE56229.2023.00063

  42. [44]

    Less training, more repairing please: revisiting automated program repair via zero-shot learning,

    C. S. Xia and L. Zhang, “Less training, more repairing please: revisiting automated program repair via zero-shot learning,” inProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). ACM, 2022, pp. 959–971, https://doi.org/10.1145/3540250.3549101

  43. [45]

    Gzoltar: an eclipse plug-in for testing and debugging,

    J. Campos, A. Riboira, A. Perez, and R. Abreu, “Gzoltar: an eclipse plug-in for testing and debugging,” inProceedings of the 27th IEEE/ACM International Conference on Automated Software Engineer- ing (ASE). ACM, 2012, p. 378–381, https://doi.org/10.1145/2351676. 2351752

  44. [46]

    ARJACLM,

    X. Liu, J. Ren, Y . Wang, Q. Xin, X. Xie, and J. Xuan, “ARJACLM,” https://github.com/chengkangda2/ARJACLM, 2026, Accessed: 2026-05- 05

  45. [47]

    Evolv- ing paradigms in automated program repair: Taxonomy, challenges, and opportunities,

    K. Huang, Z. Xu, S. Yang, H. Sun, X. Li, Z. Yan, and Y . Zhang, “Evolv- ing paradigms in automated program repair: Taxonomy, challenges, and opportunities,”ACM Computing Surveys, vol. 57, no. 2, pp. 1–43, 2024, https://doi.org/10.1145/3696450

  46. [48]

    Semfix: Program repair via semantic analysis,

    H. D. T. Nguyen, D. Qi, A. Roychoudhury, and S. Chandra, “Semfix: Program repair via semantic analysis,” inProceedings of the 35th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 2013, pp. 772–781, https:/doi.org/10.1109/ICSE.2013.6606623

  47. [49]

    Elixir: Effective object-oriented program repair,

    Q. Xin and S. P. Reiss, “Leveraging syntax-related code for automated program repair,” inProceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2017, pp. 660–670, https://doi.org/10.1109/ASE.2017.8115676

  48. [50]

    Shaping program repair space with existing patches and similar code,

    J. Jiang, Y . Xiong, H. Zhang, Q. Gao, and X. Chen, “Shaping program repair space with existing patches and similar code,” inProceedings of ACM 27th SIGSOFT International Symposium on Software Testing and Analysis (ISSTA). ACM, 2018, pp. 298–309, https://doi.org/10.1145/ 3213846.3213871

  49. [51]

    Automatic patch generation with context-based change application,

    J. Kim and S. Kim, “Automatic patch generation with context-based change application,”Empirical Software Engineering, vol. 24, no. 6, pp. 4071–4106, 2019, https://doi.org/10.1007/s10664-019-09742-5

  50. [52]

    Bissyandé, Dongsun Kim, Jacques Klein, Martin Monperrus, and Yves Le Traon

    A. Koyuncu, K. Liu, T. F. Bissyand ´e, D. Kim, J. Klein, M. Monperrus, and Y . Le Traon, “Fixminer: Mining relevant fix patterns for automated program repair,”Empirical Software Engineering, vol. 25, no. 3, pp. 1980–2024, 2020, https://doi.org/10.1007/s10664-019-09780-z

  51. [53]

    Bissyandé

    S. Dou, J. Shan, H. Jia, W. Deng, Z. Xi, W. He, Y . Wu, T. Gui, Y . Liu, and X. Huang, “Towards understanding the capability of large language models on code clone detection: A survey,”Computing Research Repos- itory, vol. abs/2308.01191, 2023, https://doi.org/10.48550/arXiv.2308. 01191