pith. sign in

arxiv: 2606.28998 · v1 · pith:ZAK6A7DInew · submitted 2026-06-27 · 💻 cs.SE · cs.AI

Reward-Free Code Alignment from Pretrained or Fine-Tuned LLM: Unpacking the Trade-offs for Code Generation

Pith reviewed 2026-06-30 08:27 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords LLM code alignmentpretrained versus finetunedDPOBoNBoNSelfCodeAligncode generationfunctional and non-functional requirements
0
0 comments X

The pith

Aligning code LLMs from pretrained bases yields larger gains than from finetuned versions, though absolute accuracy stays lower.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether reward-free alignment for code generation works better when applied to base pretrained LLMs or to their already instruction-tuned versions. It applies DPO and BoNBoN to five models after creating preference pairs with SelfCodeAlign, then measures results on four functional benchmarks and one non-functional code-quality benchmark. Pretrained starting points show bigger relative lifts after alignment. Finetuned starting points show smaller lifts or outright drops.

Core claim

Pretrained-to-aligned pathways achieve larger improvements in the aligned variant over its pretrained variant. But the pretrained variant is generally less accurate than its finetuned variant. However, finetuned-to-aligned offers smaller performance improvements or, in some cases, degradation in the aligned variant than its finetuned variant.

What carries the argument

The pretrained-to-aligned versus finetuned-to-aligned comparison using reward-free DPO and BoNBoN on SelfCodeAlign preference pairs.

If this is right

  • Alignment produces bigger relative gains when started from pretrained models rather than finetuned ones.
  • Even after alignment, models begun from pretrained checkpoints remain less accurate than those begun from finetuned checkpoints.
  • Further alignment on already finetuned code models can yield little benefit or can reduce performance.
  • Reward-free methods can be applied in either pathway, but the size and direction of the effect differ by starting point.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • When building a new code-specialized LLM, starting alignment from the base pretrained checkpoint may be preferable if the priority is maximizing relative improvement.
  • The trade-off suggests practitioners may need to choose between maximizing final accuracy (favoring finetuned start) and maximizing alignment uplift (favoring pretrained start).
  • The observed pattern could be tested on non-code tasks to see whether the same starting-point dependence appears outside software engineering.

Load-bearing premise

The SelfCodeAlign pipeline produces high-quality preference pairs that correctly reflect functional and non-functional code quality, and the five chosen benchmarks provide unbiased, comprehensive measurement of both correctness and software-engineering quality dimensions.

What would settle it

Replace the SelfCodeAlign preference pairs with human-annotated ones and re-run the pretrained versus finetuned alignment experiments to check whether the pattern of larger gains from pretrained still appears.

Figures

Figures reproduced from arXiv: 2606.28998 by Gias Uddin, Sanjeepan Sivapiran.

Figure 1
Figure 1. Figure 1: Examples of aligned vs. misaligned LLM responses across different task domains [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: RLHF vs DPO LLM alignment employs several techniques to train models using preference data to produce outputs that better reflect human values and preferences. These approaches can be categorized into two main types: reward-based (RLHF [38], RLAIF [27]) and reward-free approaches (DPO [40], KTO [12], ORPO [19], RPO[59]) [57]. Reward-based techniques include Reinforcement Learning from Human Feedback (RLHF)… view at source ↗
Figure 3
Figure 3. Figure 3: Best-of N and BoNBoN Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE116. Publication date: July 2026 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Data Generation Workflow 3.1.1 Functional Requirements in Coding Tasks. Functional require￾ments specify what the code must accomplish to be executable and pro￾duce correct outputs. This includes generating syntactically correct, com￾pilable code in target programming languages that follows established standards, incorporates proper error handling ( [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Error handling Our methodology integrates prior work on LLM align￾ment and code generation, notably SelfCodeAlign [53] and BoNBoN [16]. We utilize SelfCodeAlign to generate fully transparent training preference datasets for Direct Preference Optimization (DPO) and BoNBoN LLM align￾ment techniques, as shown in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Code Readability Our model selection prioritizes diversity among selected models while maintaining a small resource footprint. The selection consists of different sizes ranging from 1.3B to 8B parameters, providing insight into how model scale affects alignment effectiveness. We include both general￾purpose models (Meta-Llama-3-8B) and code-specialized models (Qwen2.5-Coder, CodeLlama, deepseek-coder), ena… view at source ↗
read the original abstract

Large Language Model (LLM) alignment trains an LLM using preference data to produce outputs that better meet established quality standards. While LLM alignment techniques are studied for non-coding tasks, we know little about their usefulness for coding tasks. It is unclear whether LLM code alignment could support both functional requirements (producing executable, correct code) and non-functional requirements (code readability, style, maintainability). It is also unknown whether alignment for a code LLM should begin with base pretrained version or the finetuned (i.e., instruction-tuned) version of the LLM. In this paper, we offer insights on the above two research questions by conducting an empirical study. We studied five state-of-the-art (SOTA) LLMs using two widely used LLM alignment techniques: Direct Preference Optimization (DPO) and BoNBoN. For each training record, we created a preference pair as accepted and rejected instances by using the SelfCodeAlign pipeline. DPO and BoNBoN are reward-free models, i.e., they eliminate the need for multiple reward scores for output preferences. We tuned each LLM using the two alignment techniques in two settings: pretrained and finetuned versions of an LLM. We evaluated functional requirements using four SOTA benchmarks (HumanEval+, MBPP+, EvalPerf, EvoEval) and non-functional requirements using the CODAL benchmark, which evaluates code quality across five dimensions derived from software engineering practices. We find that pretrained-to-aligned pathways achieve larger improvements in the aligned variant over its pretrained variant. But the pretrained variant is generally less accurate than its finetuned variant. However, finetuned- to-aligned offers smaller performance improvements or, in some cases, degradation in the aligned variant than its finetuned variant.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper conducts an empirical comparison of reward-free alignment (DPO and BoNBoN) on five SOTA LLMs, using SelfCodeAlign to generate preference pairs. It evaluates two pathways—pretrained-to-aligned versus finetuned-to-aligned—on functional correctness benchmarks (HumanEval+, MBPP+, EvalPerf, EvoEval) and non-functional code quality (CODAL across five SE dimensions), claiming larger gains from the pretrained starting point despite lower absolute performance.

Significance. If the differential gains are robust, the result would inform whether code alignment pipelines should begin from base pretrained models rather than instruction-tuned checkpoints, with implications for balancing functional correctness against maintainability and style. The reward-free framing and dual evaluation of functional/non-functional requirements are positive aspects.

major comments (3)
  1. [Abstract] Abstract and experimental protocol: the central claim of larger improvements in pretrained-to-aligned versus finetuned-to-aligned rests on reported performance deltas, yet the manuscript supplies no statistical tests, confidence intervals, or error bars on any benchmark scores, leaving open whether observed differences exceed prompt or sampling variance.
  2. [Abstract] SelfCodeAlign pipeline description: the preference pairs used for both DPO and BoNBoN are generated by this pipeline, but the text provides no validation (e.g., correlation with HumanEval+ execution outcomes, inter-rater agreement, or human judgment) that accepted/rejected instances accurately reflect functional correctness or the five CODAL dimensions; without this, differential gains could reflect model sensitivity to pipeline artifacts rather than genuine alignment dynamics.
  3. [Abstract] Benchmark and confound discussion: the study cites five benchmarks but does not address potential overlap between the SelfCodeAlign training data and the evaluation sets, nor does it report ablations on prompt sensitivity or multiple random seeds, both of which are load-bearing for the comparative claim across pretrained and finetuned starting points.
minor comments (1)
  1. [Abstract] The abstract states that "pretrained variant is generally less accurate than its finetuned variant" but does not quantify this baseline gap on each benchmark, which would help readers interpret the magnitude of subsequent alignment gains.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on statistical rigor, pipeline validation, and potential confounds. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core empirical findings.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experimental protocol: the central claim of larger improvements in pretrained-to-aligned versus finetuned-to-aligned rests on reported performance deltas, yet the manuscript supplies no statistical tests, confidence intervals, or error bars on any benchmark scores, leaving open whether observed differences exceed prompt or sampling variance.

    Authors: We agree that the absence of statistical tests and error bars weakens the robustness of the comparative claims. In the revised manuscript we will add 95% bootstrap confidence intervals for all reported scores and conduct paired statistical tests (McNemar’s test for pass@1 rates and Wilcoxon signed-rank for other metrics) on the deltas between the two alignment pathways to demonstrate that the larger gains from the pretrained starting point exceed sampling variance. revision: yes

  2. Referee: [Abstract] SelfCodeAlign pipeline description: the preference pairs used for both DPO and BoNBoN are generated by this pipeline, but the text provides no validation (e.g., correlation with HumanEval+ execution outcomes, inter-rater agreement, or human judgment) that accepted/rejected instances accurately reflect functional correctness or the five CODAL dimensions; without this, differential gains could reflect model sensitivity to pipeline artifacts rather than genuine alignment dynamics.

    Authors: The SelfCodeAlign pipeline relies on execution-based filtering for functional correctness and static-analysis rules for the CODAL dimensions; these mechanisms are inherited from the original pipeline paper. We acknowledge that our manuscript does not report explicit validation metrics. We will add a dedicated subsection that computes the correlation between pipeline labels and HumanEval+ pass rates on a 200-example held-out sample, reports inter-rater agreement from our internal human review of 100 pairs, and discusses any residual risk of pipeline artifacts. revision: yes

  3. Referee: [Abstract] Benchmark and confound discussion: the study cites five benchmarks but does not address potential overlap between the SelfCodeAlign training data and the evaluation sets, nor does it report ablations on prompt sensitivity or multiple random seeds, both of which are load-bearing for the comparative claim across pretrained and finetuned starting points.

    Authors: We did not perform explicit overlap checks or multi-seed runs in the original experiments. In revision we will (1) quantify lexical and semantic overlap between SelfCodeAlign training prompts and the five evaluation benchmarks, (2) report all main results averaged over three random seeds, and (3) include a prompt-sensitivity ablation on two representative models (one pretrained, one finetuned) to confirm that the differential gains are not driven by prompt wording. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical head-to-head comparison on external benchmarks

full rationale

The paper reports an empirical study that applies DPO and BoNBoN to pretrained vs. finetuned LLMs using preference pairs generated by the external SelfCodeAlign pipeline, then measures outcomes on five independent benchmarks (HumanEval+, MBPP+, EvalPerf, EvoEval, CODAL). No equations, fitted parameters, or predictions are defined inside the paper; reported improvements are direct differences against those external benchmarks. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The derivation chain is therefore self-contained against outside data and does not reduce to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central findings rest on two untested domain assumptions: that SelfCodeAlign preference pairs faithfully capture desired code properties and that the selected benchmarks validly measure both functional correctness and non-functional software quality. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption SelfCodeAlign pipeline produces valid accepted/rejected preference pairs for code quality
    Used to generate training data for both DPO and BoNBoN alignment.
  • domain assumption HumanEval+, MBPP+, EvalPerf, EvoEval and CODAL benchmarks accurately and comprehensively measure functional and non-functional code requirements
    All evaluation conclusions depend on these benchmarks reflecting real software-engineering needs.

pith-pipeline@v0.9.1-grok · 5858 in / 1467 out tokens · 42627 ms · 2026-06-30T08:27:53.837563+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 59 canonical work pages · 26 internal anchors

  1. [1]

    Wasi Uddin Ahmad, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Vahid Noroozi, Somshubra Majumdar, and Boris Ginsburg. 2025. OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs.arXiv preprint arXiv:2504.04030(2025). doi:10.48550/arXiv.2504.04030

  2. [2]

    Jomar Thomas Almonte, Santhosh Anitha Boominathan, and Nathalia Nascimento. 2025. Automated Non-Functional Requirements Generation in Software Engineering with Large Language Models: A Comparative Study.arXiv preprint arXiv:2503.15248(2025). doi:10.48550/arXiv.2503.15248

  3. [3]

    AI Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card1, 1 (2024), 4

  4. [4]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models.arXiv preprint arXiv:2108.07732 (2021). doi:10.48550/arXiv.2108.07732

  5. [5]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021). doi:10.48550/arXiv.2107.03374

  6. [7]

    Jacob Cohen. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological bulletin70, 4 (1968), 213. doi:10.1037/h0026256

  7. [8]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025). doi:10.48550/arXiv.2507. 06261

  8. [9]

    Xun Deng, Han Zhong, Rui Ai, Fuli Feng, Zheng Wang, and Xiangnan He. 2025. Less is more: Improving llm alignment via preference data selection.arXiv preprint arXiv:2502.14560(2025). doi:10.48550/arXiv.2502.14560

  9. [10]

    Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. arXiv preprint arXiv:2308.01861(2023). doi:10.48550/arXiv.2308.01861

  10. [11]

    Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. 2024. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475(2024). doi:10.48550/arXiv.2404.04475

  11. [12]

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. 2024. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306(2024). doi:10.48550/arXiv.2402.01306

  12. [13]

    Lovedeep Gondara, Jonathan Simkin, Graham Sayle, Shebnum Devji, Gregory Arbour, and Raymond Ng. 2025. Small or Large? Zero-Shot or Finetuned? Guiding Language Model Choice for Specialized Applications in Healthcare.arXiv preprint arXiv:2504.21191(2025). doi:10.48550/arXiv.2504.21191

  13. [14]

    Leo A. Goodman. 1961. Snowball Sampling.The Annals of Mathematical Statistics32, 1 (1961), 148–170. doi:10.1214/ aoms/1177705148

  14. [15]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al . 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024). doi:10.48550/arXiv.2407.21783

  15. [16]

    Lin Gui, Cristina Garbacea, and Victor Veitch. 2024. BoNBoN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. doi:10.52202/ 079017-0094

  16. [17]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025). doi:10.48550/arXiv.2501.12948

  17. [18]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196(2024). doi:10.48550/arXiv.2401.14196

  18. [19]

    Jiwoo Hong, Noah Lee, and James Thorne. 2024. Orpo: Monolithic preference optimization without reference model. arXiv preprint arXiv:2403.07691(2024). doi:10.48550/arXiv.2403.07691

  19. [20]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186(2024). doi:10.48550/arXiv.2409.12186

  20. [21]

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation. arXiv:2406.00515 [cs.CL] doi:10.48550/arXiv.2406.00515

  21. [22]

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023). doi:10. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE116. Publication date: July 2026. FSE116:22 Sanjeepan Sivapiran and Gias Uddin 4855...

  22. [23]

    Ishan Jindal, Chandana Badrinath, Pranjal Bharti, Lakkidi Vinay, and Sachin Dev Sharma. 2024. Balancing Continuous Pre-Training and Instruction Fine-Tuning: Optimizing Instruction-Following in LLMs.arXiv preprint arXiv:2410.10739 (2024). doi:10.48550/arXiv.2410.10739

  23. [24]

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences114, 13 (2017), 3521–3526. doi:10.1073/pnas.1611835114

  24. [25]

    Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, and Harm de Vries. 2022. The Stack: 3 TB of permissively licensed source code. arXiv:2211.15533 [cs.CL] doi:10.48550/arXiv.2211.15533

  25. [26]

    Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan. 2023. Understanding catastrophic forgetting in language models via implicit inference.arXiv preprint arXiv:2309.10105(2023). doi:10.48550/arXiv.2309.10105

  26. [27]

    Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, et al. 2023. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback.arXiv preprint arXiv:2309.00267(2023). doi:10.48550/arXiv.2309.00267

  27. [29]

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and LINGMING ZHANG. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 21558–21...

  28. [30]

    Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. 2025. An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning. arXiv:2308.08747 [cs.CL] doi:10.48550/arXiv.2308. 08747

  29. [31]

    Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2023. Wizardcoder: Empowering code large language models with evol-instruct.arXiv preprint arXiv:2306.08568(2023). doi:10.48550/arXiv.2306.08568

  30. [32]

    Zeyuan Ma, Hongshu Guo, Jiacheng Chen, Guojun Peng, Zhiguang Cao, Yining Ma, and Yue-Jiao Gong. 2024. Llamoco: Instruction tuning of large language models for optimization code generation.arXiv preprint arXiv:2403.01131(2024). doi:10.48550/arXiv.2403.01131

  31. [33]

    Mary L McHugh. 2012. Interrater reliability: the kappa statistic.Biochemia medica22, 3 (2012), 276–282

  32. [34]

    Yibo Miao, Bofei Gao, Shanghaoran Quan, Junyang Lin, Daoguang Zan, Jiaheng Liu, Jian Yang, Tianyu Liu, and Zhijie Deng. 2024. Aligning codellms with direct preference optimization.arXiv preprint arXiv:2410.18585(2024). doi:10.48550/arXiv.2410.18585

  33. [35]

    Sherin Muckatira, Vijeta Deshpande, Vladislav Lialin, and Anna Rumshisky. 2024. Emergent abilities in reduced-scale generative language models.arXiv preprint arXiv:2404.02204(2024). doi:10.48550/arXiv.2404.02204

  34. [36]

    Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro Von Werra, and Shayne Longpre. 2023. Octopack: Instruction tuning code large language models. In NeurIPS 2023 workshop on instruction tuning and instruction following. doi:10.48550/arXiv.2308.07124

  35. [37]

    OpenAI, Josh Achiam, et al. 2024. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL] doi:10.48550/arXiv.2303.08774

  36. [38]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems35 (2022), 27730–27744. doi:10.48550/arXiv.2203.02155

  37. [39]

    Richard Yuanzhe Pang, Weizhe Yuan, He He, Kyunghyun Cho, Sainbayar Sukhbaatar, and Jason Weston. 2024. Iterative reasoning preference optimization.Advances in Neural Information Processing Systems37 (2024), 116617–116637. doi:10.48550/arXiv.2404.19733

  38. [40]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. InThirty-seventh Conference on Neural Information Processing Systems. doi:10.48550/arXiv.2305.18290

  39. [41]

    Vinay Venkatesh Ramasesh, Aitor Lewkowycz, and Ethan Dyer. 2021. Effect of scale on catastrophic forgetting in neural networks. InInternational conference on learning representations. doi:10.48550/arXiv.2010.04495

  40. [42]

    Weijieying Ren, Xinlong Li, Lei Wang, Tianxiang Zhao, and Wei Qin. 2024. Analyzing and Reducing Catastrophic Forgetting in Parameter Efficient Tuning. arXiv:2402.18865 [cs.LG] doi:10.48550/arXiv.2402.18865

  41. [43]

    Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950 (2023). doi:10.48550/arXiv.2308.12950 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE116. Publication date: July 2026. ...

  42. [44]

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms.CoRRabs/1707.06347 (2017). arXiv:1707.06347 doi:10.48550/arXiv.1707.06347

  43. [45]

    Manav Singhal, Tushar Aggarwal, Abhijeet Awasthi, Nagarajan Natarajan, and Aditya Kanade. 2024. Nofuneval: Funny how code lms falter on requirements beyond functional correctness.arXiv preprint arXiv:2401.15963(2024). doi:10.48550/arXiv.2401.15963

  44. [46]

    Jacob Mitchell Springer, Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig, and Aditi Raghunathan. 2025. Overtrained language models are harder to fine-tune.arXiv preprint arXiv:2503.19206(2025). doi:10.48550/arXiv.2503.19206

  45. [47]

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback.Advances in neural information processing systems33 (2020), 3008–3021. doi:10.48550/arXiv.2009.01325

  46. [48]

    Yun-Da Tsai, Mingjie Liu, and Haoxing Ren. 2024. Code less, align more: Efficient llm fine-tuning for code generation with data pruning.arXiv preprint arXiv:2407.05040(2024). doi:10.48550/arXiv.2407.05040

  47. [49]

    Jonathan Ullrich, Matthias Koch, and Andreas Vogelsang. 2025. From Requirements to Code: Understanding Developer Practices in LLM-Assisted Software Engineering.arXiv preprint arXiv:2507.07548(2025). doi:10.48550/arXiv.2507.07548

  48. [50]

    Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. 2023. Codet5+: Open code large language models for code understanding and generation.arXiv preprint arXiv:2305.07922(2023). doi:10.48550/arXiv.2305.07922

  49. [51]

    Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Xiang-Bo Mao, Sitaram Asur, et al. 2024. A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more.arXiv preprint arXiv:2407.16216(2024). doi:10.48550/arXiv.2407.16216

  50. [52]

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682 (2022). doi:10.48550/arXiv.2206.07682

  51. [53]

    Yuxiang Wei, Federico Cassano, Jiawei Liu, Yifeng Ding, Naman Jain, Zachary Mueller, Harm de Vries, Leandro Von Werra, Arjun Guha, and LINGMING ZHANG. 2024. SelfCodeAlign: Self-Alignment for Code Generation. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. doi:10.48550/arXiv.2410.24198

  52. [54]

    Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2023. Magicoder: Empowering code generation with oss-instruct.arXiv preprint arXiv:2312.02120(2023). doi:10.48550/arXiv.2312.02120

  53. [55]

    Martin Weyssow, Aton Kamanda, Xin Zhou, and Houari Sahraoui. 2025. CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences.ACM Trans. Softw. Eng. Methodol.(May 2025). doi:10.1145/3736407 Just Accepted

  54. [56]

    Chunqiu Steven Xia, Yinlin Deng, and LINGMING ZHANG. 2024. Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM. InFirst Conference on Language Modeling. doi:10.48550/ arXiv.2403.19114

  55. [57]

    Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, and Yi Wu. 2024. Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719(2024). doi:10.48550/ arXiv.2404.10719

  56. [58]

    Zhou Yang, Zhensu Sun, Terry Zhuo Yue, Premkumar Devanbu, and David Lo. 2024. Robustness, Security, Privacy, Explainability, Efficiency, and Usability of Large Language Models for Code. arXiv:2403.07506 [cs.SE] doi:10.48550/ arXiv.2403.07506

  57. [59]

    Yueqin Yin, Zhendong Wang, Yi Gu, Hai Huang, Weizhu Chen, and Mingyuan Zhou. 2024. Relative preference optimization: Enhancing llm alignment through contrasting responses across identical and diverse prompts.arXiv preprint arXiv:2402.10958(2024). doi:10.48550/arXiv.2402.10958

  58. [60]

    Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12. doi:10.48550/arXiv.2302.00288

  59. [61]

    Kechi Zhang, Ge Li, Yihong Dong, Jingjing Xu, Jun Zhang, Jing Su, Yongfei Liu, and Zhi Jin. 2024. Codedpo: Aligning code models with self generated and verified source code.arXiv preprint arXiv:2410.05605(2024). doi:10.48550/arXiv. 2410.05605

  60. [62]

    Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Yu Liu, Kai Chen, and Ping Luo. 2024. Gpt4roi: Instruction tuning large language model on region-of-interest. InEuropean conference on computer vision. Springer, 52–70. doi:10.48550/arXiv.2307.03601

  61. [63]

    Yichi Zhang, Zhuo Chen, Yin Fang, Yanxi Lu, Li Fangming, Wen Zhang, and Huajun Chen. 2024. Knowledgeable Preference Alignment for LLMs in Domain-specific Question Answering. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thaila...

  62. [64]

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623. doi:10.48550/arXiv.2306.05685 Received 2026-02-10; accepted 2026-03-24 Proc. ACM Softw...