Recognition: no theorem link
PAFT: Preservation Aware Fine-Tuning for Minimal-Edit Program Repair
Pith reviewed 2026-05-13 19:18 UTC · model grok-4.3
The pith
Preservation-aware fine-tuning produces more accurate yet shorter patches in LLM-based program repair.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PAFT derives token-level preservation signals by aligning buggy and fixed code, combines them with full-sequence masking, and applies an edit-difficulty curriculum. This produces models that generate plausible patches with higher pass@1 rates and lower average edit distances on Defects4J and HumanEval-Java, outperforming both standard supervised fine-tuning and the AdaPatcher baseline while requiring no extra search or reranking at inference time.
What carries the argument
Token-level preservation signals obtained from aligning buggy and fixed code versions, combined with full-sequence masking and an edit-difficulty curriculum.
If this is right
- Patches concentrate changes on faulty regions while preserving stable surrounding code.
- Pass@1 rates rise by up to 65.6 percent over standard fine-tuning on the evaluated benchmarks.
- Average edit distance falls by up to 32.6 percent, lowering review effort.
- The method beats preference-based baselines such as AdaPatcher on Defects4J without inference-time search or post-processing.
- Smaller, localized patches emerge directly from the fine-tuned model.
Where Pith is reading between the lines
- The preservation signals could be reused for other code-editing tasks such as refactoring or test generation.
- Testing the same signals on models larger than 6.7B parameters might show whether the locality benefit scales.
- Pairing PAFT with automated test-case generation could help when minimal edits still fail incomplete test suites.
- Lower edit distances imply reduced long-term maintenance costs once the patches are deployed.
Load-bearing premise
Aligning buggy and fixed code produces reliable token-level preservation signals that generalize without introducing alignment artifacts or dataset-specific biases.
What would settle it
Apply the same training procedure to a new program-repair dataset whose buggy-fixed pairs contain noisy or inconsistent alignments and check whether the reported gains in pass@1 and reductions in edit distance disappear.
Figures
read the original abstract
Large language models (LLMs) are effective for automated program repair, but plausible patches that pass the full test suite often rewrite more code than necessary, increasing review and maintenance costs. This over-editing is common because most bugs are localized, while standard supervised fine-tuning provides no explicit signal about which tokens should be preserved and which should be changed. We propose PAFT, a preservation-aware fine-tuning method for minimal-edit program repair. PAFT derives token-level preservation signals by aligning buggy and fixed code, combines them with full-sequence masking, and applies an edit-difficulty curriculum. Across Defects4J and HumanEval-Java, PAFT improves pass@1 by up to 65.6% over standard supervised fine-tuning (StdFT) while reducing average edit distance (AED) by up to 32.6%. On Defects4J with DeepSeek-Coder-6.7B, PAFT also outperforms AdaPatcher, a strong preference-based repair baseline, improving pass@1 from 5.9% to 10.1% while reducing median AED from 61.0 to 42.0. Overall, PAFT preserves stable context and concentrates edits on faulty regions, yielding smaller, more localized, plausible patches without inference-time search, reranking, or post-processing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes PAFT, a preservation-aware fine-tuning method for LLMs in automated program repair. It derives token-level preservation signals by aligning buggy and fixed code, combines them with full-sequence masking, and applies an edit-difficulty curriculum during training. Experiments on Defects4J and HumanEval-Java report up to 65.6% pass@1 gains over standard supervised fine-tuning, up to 32.6% reduction in average edit distance, and outperformance of AdaPatcher (pass@1 from 5.9% to 10.1%, median AED from 61.0 to 42.0) on Defects4J with DeepSeek-Coder-6.7B, yielding smaller localized patches without inference-time changes.
Significance. If the gains are shown to stem specifically from the alignment-derived preservation signals rather than masking or curriculum alone, PAFT would offer a lightweight training-time mechanism to reduce over-editing in repair patches, lowering review and maintenance costs while remaining compatible with existing inference pipelines.
major comments (2)
- [§3] §3 (Method): The token-level preservation labels are extracted via string alignment of buggy and fixed code; the manuscript does not specify the alignment algorithm (e.g., diff, edit script) or how it handles identifier renames, reordered equivalent statements, or whitespace-only changes. Because the loss directly weights these labels, any systematic misalignment would make the reported pass@1 and AED improvements benchmark-specific rather than a general advance in minimal-edit repair.
- [§4] §4 (Experiments): The headline results (65.6% pass@1 lift, 32.6% AED reduction, AdaPatcher comparison) are presented without error bars, ablation studies isolating the preservation signal from masking and curriculum, or statistical significance tests. Without these, it is impossible to confirm that the preservation component is load-bearing for the central claim.
minor comments (2)
- [Abstract] Abstract: The phrase 'up to 65.6%' is used for both pass@1 and AED; clarify whether these maxima occur on the same benchmark/model pair.
- [§4.2] §4.2: Provide the exact number of training examples and the distribution of edit difficulties used in the curriculum.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight important areas for improving reproducibility and experimental rigor. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [§3] §3 (Method): The token-level preservation labels are extracted via string alignment of buggy and fixed code; the manuscript does not specify the alignment algorithm (e.g., diff, edit script) or how it handles identifier renames, reordered equivalent statements, or whitespace-only changes. Because the loss directly weights these labels, any systematic misalignment would make the reported pass@1 and AED improvements benchmark-specific rather than a general advance in minimal-edit repair.
Authors: We agree that the alignment procedure must be specified in detail to ensure reproducibility and support the generality of the claims. In the revised manuscript, we will describe the exact method: token sequences are extracted after normalizing whitespace and comments, then aligned using Python's difflib.SequenceMatcher to identify the longest common subsequence. Tokens in the LCS are labeled as preserved. Identifier renames are treated as edits (as they differ in string content), reordered statements receive partial LCS matches where possible, and whitespace-only differences are eliminated by normalization prior to alignment. We will add pseudocode, an example, and a brief discussion of limitations regarding purely semantic equivalences (e.g., commutative reordering) to §3. revision: yes
-
Referee: [§4] §4 (Experiments): The headline results (65.6% pass@1 lift, 32.6% AED reduction, AdaPatcher comparison) are presented without error bars, ablation studies isolating the preservation signal from masking and curriculum, or statistical significance tests. Without these, it is impossible to confirm that the preservation component is load-bearing for the central claim.
Authors: We acknowledge that the current experimental section lacks the requested statistical controls and ablations. In the revision, we will add: (i) error bars computed from five independent training runs with distinct random seeds; (ii) ablation experiments comparing standard fine-tuning, masking-only, curriculum-only, and full PAFT; and (iii) statistical significance tests (paired t-tests on pass@1 and Wilcoxon signed-rank tests on AED) across the reported benchmarks. Updated tables and figures will be included in §4 to demonstrate that the preservation signal contributes measurably beyond the other components. revision: yes
Circularity Check
No circularity: preservation labels derived externally from code alignment
full rationale
The paper defines PAFT by aligning buggy and fixed code to extract token-level preservation signals, then combining them with masking and a curriculum. This alignment step is external to the model and its outputs; no equations, fitted parameters, or self-citations are shown that reduce the claimed improvements (pass@1 lift, AED reduction) back to the inputs by construction. The derivation chain is therefore self-contained against the benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Alberto Bacchelli and Christian Bird. 2013. Expectations, outcomes, and chal- lenges of modern code review. In35th International Conference on Software Engineering, ICSE ’13, San Francisco, CA, USA, May 18-26, 2013, David Notkin, Betty H. C. Cheng, and Klaus Pohl (Eds.). IEEE Computer Society, 712–721. doi:10.1109/ICSE.2013.6606617
-
[2]
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. InProceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009 (ACM International Conference Proceeding Series, Vol. 382), Andrea Pohoreckyj Danyluk, Léon Bottou, and Michael L. Littman (Eds.)...
-
[3]
Worawalan Chatlatanagulchai, Kundjanasith Thonglek, Brittany Reid, Yutaro Kashiwa, Pattara Leelaprute, Arnon Rungsawang, Bundit Manaskasemsak, and Hajimu Iida. 2025. On the Use of Agentic Coding Manifests: An Empirical Study of Claude Code. InProduct-Focused Software Process Improvement - 26th International Conference, PROFES 2025, Salerno, Italy, Decembe...
-
[4]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
Zhenlong Dai, Bingrui Chen, Zhuoluo Zhao, Xiu Tang, Sai Wu, Chang Yao, Zhipeng Gao, and Jingyuan Chen. 2025. Less Is More: Adaptive Program Repair with Bug Localization and Preference Learning. InAAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, Toby Walsh, Julie Shah,...
-
[6]
DeepSeek-AI. 2024. DeepSeek-V3 Technical Report.CoRRabs/2412.19437 (2024). arXiv:2412.19437 doi:10.48550/ARXIV.2412.19437
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2024
-
[7]
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs. (2023). http://papers.nips. cc/paper_files/paper/2023/hash/1feb87871436031bdc0f2beaa62a049b-Abstract- Conference.html
work page 2023
-
[8]
Marco Di Biase, Magiel Bruntink, Arie Van Deursen, and Alberto Bacchelli. 2019. The effects of change decomposition on code review - a controlled experiment. PeerJ Computer Science5 (2019), e193
work page 2019
-
[9]
Ramtin Ehsani, Sakshi Pathak, Shriya Rawal, Abdullah Al Mujahid, Mia Moham- mad Imran, and Preetha Chatterjee. 2026. Where Do AI Coding Agents Fail? An Empirical Study of Failed Agentic Pull Requests in GitHub.CoRRabs/2601.15195 (2026). arXiv:2601.15195 doi:10.48550/ARXIV.2601.15195
-
[10]
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guant- ing Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang
-
[11]
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence.CoRRabs/2401.14196 (2024). arXiv:2401.14196 doi:10.48550/ARXIV.2401.14196
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.14196 2024
-
[12]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=nZeVKeeFYf9
work page 2022
-
[13]
Kai Huang, Xiangxin Meng, Jian Zhang, Yang Liu, Wenjie Wang, Shuhao Li, and Yuqing Zhang. 2024. An Empirical Study on Fine-Tuning Large Language Models of Code for Automated Program Repair. InProceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering(Echternach, Luxem- bourg)(ASE ’23). IEEE Press, 1162–1174. doi:10.1109/AS...
-
[14]
Siming Huang, Tianhao Cheng, Jason Klein Liu, Weidi Xu, Jiaran Hao, Liuyihan Song, Yang Xu, Jian Yang, Jiaheng Liu, Chenchen Zhang, Linzheng Chai, Ruifeng Yuan, Xianzhen Luo, Qiufeng Wang, YuanTao Fan, Qingfu Zhu, Zhaoxiang Zhang, Yang Gao, Jie Fu, Qian Liu, Houyi Li, Ge Zhang, Yuan Qi, Yinghui Xu, Wei Chu, and Zili Wang. 2025. OpenCoder: The Open Cookboo...
work page 2025
-
[15]
Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan. 2023. Impact of Code Language Models on Automated Program Repair. In45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20,
work page 2023
-
[16]
IEEE, 1430–1442. doi:10.1109/ICSE48619.2023.00125
-
[17]
René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: a database of existing faults to enable controlled testing studies for Java programs. InInterna- tional Symposium on Software Testing and Analysis, ISSTA ’14, San Jose, CA, USA - July 21 - 26, 2014, Corina S. Pasareanu and Darko Marinov (Eds.). ACM, 437–440. doi:10.1145/2610384.2628055
-
[18]
Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. 2012. GenProg: A Generic Method for Automatic Software Repair.IEEE Trans. Software Eng.38, 1 (2012), 54–72. doi:10.1109/TSE.2011.104
-
[19]
Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. 2019. Automated program repair.Commun. ACM62, 12 (2019), 56–65. doi:10.1145/3318162
-
[20]
Backpropagation through time and the brain.Current Opinion in Neurobiology, 55:82–89, 2019
Kui Liu, Li Li, Anil Koyuncu, Dongsun Kim, Zhe Liu, Jacques Klein, and Tegawendé F. Bissyandé. 2021. A critical review on the evaluation of auto- mated program repair systems.J. Syst. Softw.171 (2021), 110817. doi:10.1016/J. JSS.2020.110817
work page doi:10.1016/j 2021
-
[21]
Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In7th International Conference on Learning Representations, ICLR 2019, New Or- leans, LA, USA, May 6-9, 2019. OpenReview.net. https://openreview.net/forum? id=Bkg6RiCqY7
work page 2019
-
[22]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with hu-...
work page 2022
-
[23]
Manning, Stefano Ermon, and Chelsea Finn
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. (2023). http://papers.nips.cc/ paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract- Conference.html
work page 2023
-
[24]
Achyudh Ram, Anand Ashok Sawant, Marco Castelluccio, and Alberto Bacchelli
-
[25]
What makes a code change easier to review: an empirical investigation on code change reviewability. InProceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE 2018). Association for Computing Machinery, New York, NY, USA, 201–212....
-
[26]
John W. Ratcliff and David E. Metzener. 1988. Pattern Matching: The Gestalt Approach.Dr. Dobb’s Journal(July 1988), 46. https://jacobfilipp.com/DrDobbs/ articles/DDJ/1988/8807/8807c/8807c.htm HTML reprint/archive of the original magazine article
work page 1988
-
[27]
André Silva, Sen Fang, and Martin Monperrus. 2025. RepairLLaMA: Efficient Rep- resentations and Fine-Tuned Adapters for Program Repair.IEEE Trans. Software Eng.51, 8 (2025), 2366–2380. doi:10.1109/TSE.2025.3581062
-
[28]
Xiaolong Tian. 2024. Evaluating the Repair Ability of LLM Under Different Prompt Settings. InIEEE International Conference on Software Services Engineering, SSE 2024, Shenzhen, China, July 7-13, 2024. IEEE, 313–322. doi:10.1109/SSE62657.2024. 00053
-
[29]
Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated Pro- gram Repair in the Era of Large Pre-trained Language Models. In45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 1482–1494. doi:10.1109/ICSE48619.2023.00129
-
[30]
Chunqiu Steven Xia and Lingming Zhang. 2024. Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using chatgpt. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 819–831
work page 2024
-
[31]
Junjielong Xu, Ying Fu, Shin Hwei Tan, and Pinjia He. 2025. Aligning the Objective of LLM-Based Program Repair. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 2548–2560
work page 2025
-
[32]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Liangha...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
-
[33]
Bissyandé, Yang Liu, and Haoye Tian
Boyang Yang, Zijian Cai, Fengling Liu, Bach Le, Lingming Zhang, Tegawendé F. Bissyandé, Yang Liu, and Haoye Tian. 2025. A Survey of LLM-based Auto- mated Program Repair: Taxonomies, Design Paradigms, and Applications.CoRR abs/2506.23749 (2025). arXiv:2506.23749 doi:10.48550/ARXIV.2506.23749 , , Boyang Yang, Zijian Cai, Shunfu Jin, and Haoye Tian
-
[34]
Boyang Yang, Haoye Tian, Weiguo Pian, Haoran Yu, Haitao Wang, Jacques Klein, Tegawendé F. Bissyandé, and Shunfu Jin. 2024. CREF: An LLM-Based Conversa- tional Software Repair Framework for Programming Tutors. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, Vienna, Austria, September 16-20, 2024, ...
-
[35]
Boyang Yang, Haoye Tian, Jiadong Ren, Hongyu Zhang, Jacques Klein, Tegawende Bissyande, Claire Le Goues, and Shunfu Jin. 2025. MORepair: Teach- ing LLMs to Repair Code via Multi-Objective Fine-Tuning.ACM Transactions on Software Engineering and Methodology(2025)
work page 2025
-
[36]
Jooyong Yi, Shin Hwei Tan, Sergey Mechtaev, Marcel Böhme, and Abhik Roy- choudhury. 2018. A correlation study between automated program repair and test-suite metrics. InProceedings of the 40th International Conference on Soft- ware Engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018, Michel Chaudron, Ivica Crnkovic, Marsha Chechik, and Mark...
-
[37]
Xin Yin, Chao Ni, Shaohua Wang, Zhenhao Li, Limin Zeng, and Xiaohu Yang
-
[38]
ThinkRepair: Self-Directed Automated Program Repair. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, Vienna, Austria, September 16-20, 2024, Maria Christakis and Michael Pradel (Eds.). ACM, 1274–1286. doi:10.1145/3650212.3680359
-
[39]
Andreas Zeller and Ralf Hildebrandt. 2002. Simplifying and Isolating Failure- Inducing Input.IEEE Trans. Software Eng.28, 2 (2002), 183–200. doi:10.1109/32. 988498
work page doi:10.1109/32 2002
-
[40]
Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. 2023. LIMA: Less Is More for Alignment. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.