Reward-Free Code Alignment from Pretrained or Fine-Tuned LLM: Unpacking the Trade-offs for Code Generation

Gias Uddin; Sanjeepan Sivapiran

arxiv: 2606.28998 · v1 · pith:ZAK6A7DInew · submitted 2026-06-27 · 💻 cs.SE · cs.AI

Reward-Free Code Alignment from Pretrained or Fine-Tuned LLM: Unpacking the Trade-offs for Code Generation

Gias Uddin , Sanjeepan Sivapiran This is my paper

Pith reviewed 2026-06-30 08:27 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords LLM code alignmentpretrained versus finetunedDPOBoNBoNSelfCodeAligncode generationfunctional and non-functional requirements

0 comments

The pith

Aligning code LLMs from pretrained bases yields larger gains than from finetuned versions, though absolute accuracy stays lower.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether reward-free alignment for code generation works better when applied to base pretrained LLMs or to their already instruction-tuned versions. It applies DPO and BoNBoN to five models after creating preference pairs with SelfCodeAlign, then measures results on four functional benchmarks and one non-functional code-quality benchmark. Pretrained starting points show bigger relative lifts after alignment. Finetuned starting points show smaller lifts or outright drops.

Core claim

Pretrained-to-aligned pathways achieve larger improvements in the aligned variant over its pretrained variant. But the pretrained variant is generally less accurate than its finetuned variant. However, finetuned-to-aligned offers smaller performance improvements or, in some cases, degradation in the aligned variant than its finetuned variant.

What carries the argument

The pretrained-to-aligned versus finetuned-to-aligned comparison using reward-free DPO and BoNBoN on SelfCodeAlign preference pairs.

If this is right

Alignment produces bigger relative gains when started from pretrained models rather than finetuned ones.
Even after alignment, models begun from pretrained checkpoints remain less accurate than those begun from finetuned checkpoints.
Further alignment on already finetuned code models can yield little benefit or can reduce performance.
Reward-free methods can be applied in either pathway, but the size and direction of the effect differ by starting point.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

When building a new code-specialized LLM, starting alignment from the base pretrained checkpoint may be preferable if the priority is maximizing relative improvement.
The trade-off suggests practitioners may need to choose between maximizing final accuracy (favoring finetuned start) and maximizing alignment uplift (favoring pretrained start).
The observed pattern could be tested on non-code tasks to see whether the same starting-point dependence appears outside software engineering.

Load-bearing premise

The SelfCodeAlign pipeline produces high-quality preference pairs that correctly reflect functional and non-functional code quality, and the five chosen benchmarks provide unbiased, comprehensive measurement of both correctness and software-engineering quality dimensions.

What would settle it

Replace the SelfCodeAlign preference pairs with human-annotated ones and re-run the pretrained versus finetuned alignment experiments to check whether the pattern of larger gains from pretrained still appears.

Figures

Figures reproduced from arXiv: 2606.28998 by Gias Uddin, Sanjeepan Sivapiran.

**Figure 2.** Figure 2: RLHF vs DPO LLM alignment employs several techniques to train models using preference data to produce outputs that better reflect human values and preferences. These approaches can be categorized into two main types: reward-based (RLHF [38], RLAIF [27]) and reward-free approaches (DPO [40], KTO [12], ORPO [19], RPO[59]) [57]. Reward-based techniques include Reinforcement Learning from Human Feedback (RLHF)… view at source ↗

**Figure 3.** Figure 3: Best-of N and BoNBoN Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE116. Publication date: July 2026 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Data Generation Workflow 3.1.1 Functional Requirements in Coding Tasks. Functional requirements specify what the code must accomplish to be executable and produce correct outputs. This includes generating syntactically correct, compilable code in target programming languages that follows established standards, incorporates proper error handling ( [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Error handling Our methodology integrates prior work on LLM alignment and code generation, notably SelfCodeAlign [53] and BoNBoN [16]. We utilize SelfCodeAlign to generate fully transparent training preference datasets for Direct Preference Optimization (DPO) and BoNBoN LLM alignment techniques, as shown in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Code Readability Our model selection prioritizes diversity among selected models while maintaining a small resource footprint. The selection consists of different sizes ranging from 1.3B to 8B parameters, providing insight into how model scale affects alignment effectiveness. We include both generalpurpose models (Meta-Llama-3-8B) and code-specialized models (Qwen2.5-Coder, CodeLlama, deepseek-coder), ena… view at source ↗

read the original abstract

Large Language Model (LLM) alignment trains an LLM using preference data to produce outputs that better meet established quality standards. While LLM alignment techniques are studied for non-coding tasks, we know little about their usefulness for coding tasks. It is unclear whether LLM code alignment could support both functional requirements (producing executable, correct code) and non-functional requirements (code readability, style, maintainability). It is also unknown whether alignment for a code LLM should begin with base pretrained version or the finetuned (i.e., instruction-tuned) version of the LLM. In this paper, we offer insights on the above two research questions by conducting an empirical study. We studied five state-of-the-art (SOTA) LLMs using two widely used LLM alignment techniques: Direct Preference Optimization (DPO) and BoNBoN. For each training record, we created a preference pair as accepted and rejected instances by using the SelfCodeAlign pipeline. DPO and BoNBoN are reward-free models, i.e., they eliminate the need for multiple reward scores for output preferences. We tuned each LLM using the two alignment techniques in two settings: pretrained and finetuned versions of an LLM. We evaluated functional requirements using four SOTA benchmarks (HumanEval+, MBPP+, EvalPerf, EvoEval) and non-functional requirements using the CODAL benchmark, which evaluates code quality across five dimensions derived from software engineering practices. We find that pretrained-to-aligned pathways achieve larger improvements in the aligned variant over its pretrained variant. But the pretrained variant is generally less accurate than its finetuned variant. However, finetuned- to-aligned offers smaller performance improvements or, in some cases, degradation in the aligned variant than its finetuned variant.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper runs a head-to-head empirical comparison of reward-free alignment on code LLMs starting from pretrained versus finetuned checkpoints and reports larger relative gains from the pretrained side.

read the letter

The main new piece here is the direct pretrained-versus-finetuned comparison for DPO and BoNBoN on code tasks. The authors generate preference pairs with SelfCodeAlign, align five models in both settings, and measure on HumanEval+, MBPP+, EvalPerf, EvoEval for correctness plus CODAL for readability, style, and maintainability. They report that the pretrained-to-aligned route produces bigger lifts over its starting point, while the finetuned-to-aligned route yields smaller gains or occasional drops, even though the finetuned starting points are stronger in absolute terms.

The work is straightforward about applying existing reward-free methods to code and about testing both functional and non-functional dimensions. That scope is useful for anyone who already ships code models and has to decide where to apply alignment.

The soft spots are exactly where the stress-test note flags them. The abstract and setup give no evidence that the SelfCodeAlign pairs track execution outcomes or human quality judgments, no inter-rater numbers, and no correlation checks against the benchmarks. Without that, the differential gains could be an artifact of how sensitive each base model is to whatever labeling noise is present. There are also no error bars, statistical tests, or ablations on prompt sensitivity or benchmark overlap mentioned.

This is for practitioners tuning code LLMs rather than theorists. A reader who needs data points on starting-point choice will find something to consider, but will also want the missing validation before treating the trade-off as settled.

It is worth sending to peer review. The question is practical, the protocol is replicable in principle, and the gaps are fixable with added checks on the preference data and basic statistical reporting.

Referee Report

3 major / 1 minor

Summary. The paper conducts an empirical comparison of reward-free alignment (DPO and BoNBoN) on five SOTA LLMs, using SelfCodeAlign to generate preference pairs. It evaluates two pathways—pretrained-to-aligned versus finetuned-to-aligned—on functional correctness benchmarks (HumanEval+, MBPP+, EvalPerf, EvoEval) and non-functional code quality (CODAL across five SE dimensions), claiming larger gains from the pretrained starting point despite lower absolute performance.

Significance. If the differential gains are robust, the result would inform whether code alignment pipelines should begin from base pretrained models rather than instruction-tuned checkpoints, with implications for balancing functional correctness against maintainability and style. The reward-free framing and dual evaluation of functional/non-functional requirements are positive aspects.

major comments (3)

[Abstract] Abstract and experimental protocol: the central claim of larger improvements in pretrained-to-aligned versus finetuned-to-aligned rests on reported performance deltas, yet the manuscript supplies no statistical tests, confidence intervals, or error bars on any benchmark scores, leaving open whether observed differences exceed prompt or sampling variance.
[Abstract] SelfCodeAlign pipeline description: the preference pairs used for both DPO and BoNBoN are generated by this pipeline, but the text provides no validation (e.g., correlation with HumanEval+ execution outcomes, inter-rater agreement, or human judgment) that accepted/rejected instances accurately reflect functional correctness or the five CODAL dimensions; without this, differential gains could reflect model sensitivity to pipeline artifacts rather than genuine alignment dynamics.
[Abstract] Benchmark and confound discussion: the study cites five benchmarks but does not address potential overlap between the SelfCodeAlign training data and the evaluation sets, nor does it report ablations on prompt sensitivity or multiple random seeds, both of which are load-bearing for the comparative claim across pretrained and finetuned starting points.

minor comments (1)

[Abstract] The abstract states that "pretrained variant is generally less accurate than its finetuned variant" but does not quantify this baseline gap on each benchmark, which would help readers interpret the magnitude of subsequent alignment gains.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on statistical rigor, pipeline validation, and potential confounds. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core empirical findings.

read point-by-point responses

Referee: [Abstract] Abstract and experimental protocol: the central claim of larger improvements in pretrained-to-aligned versus finetuned-to-aligned rests on reported performance deltas, yet the manuscript supplies no statistical tests, confidence intervals, or error bars on any benchmark scores, leaving open whether observed differences exceed prompt or sampling variance.

Authors: We agree that the absence of statistical tests and error bars weakens the robustness of the comparative claims. In the revised manuscript we will add 95% bootstrap confidence intervals for all reported scores and conduct paired statistical tests (McNemar’s test for pass@1 rates and Wilcoxon signed-rank for other metrics) on the deltas between the two alignment pathways to demonstrate that the larger gains from the pretrained starting point exceed sampling variance. revision: yes
Referee: [Abstract] SelfCodeAlign pipeline description: the preference pairs used for both DPO and BoNBoN are generated by this pipeline, but the text provides no validation (e.g., correlation with HumanEval+ execution outcomes, inter-rater agreement, or human judgment) that accepted/rejected instances accurately reflect functional correctness or the five CODAL dimensions; without this, differential gains could reflect model sensitivity to pipeline artifacts rather than genuine alignment dynamics.

Authors: The SelfCodeAlign pipeline relies on execution-based filtering for functional correctness and static-analysis rules for the CODAL dimensions; these mechanisms are inherited from the original pipeline paper. We acknowledge that our manuscript does not report explicit validation metrics. We will add a dedicated subsection that computes the correlation between pipeline labels and HumanEval+ pass rates on a 200-example held-out sample, reports inter-rater agreement from our internal human review of 100 pairs, and discusses any residual risk of pipeline artifacts. revision: yes
Referee: [Abstract] Benchmark and confound discussion: the study cites five benchmarks but does not address potential overlap between the SelfCodeAlign training data and the evaluation sets, nor does it report ablations on prompt sensitivity or multiple random seeds, both of which are load-bearing for the comparative claim across pretrained and finetuned starting points.

Authors: We did not perform explicit overlap checks or multi-seed runs in the original experiments. In revision we will (1) quantify lexical and semantic overlap between SelfCodeAlign training prompts and the five evaluation benchmarks, (2) report all main results averaged over three random seeds, and (3) include a prompt-sensitivity ablation on two representative models (one pretrained, one finetuned) to confirm that the differential gains are not driven by prompt wording. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical head-to-head comparison on external benchmarks

full rationale

The paper reports an empirical study that applies DPO and BoNBoN to pretrained vs. finetuned LLMs using preference pairs generated by the external SelfCodeAlign pipeline, then measures outcomes on five independent benchmarks (HumanEval+, MBPP+, EvalPerf, EvoEval, CODAL). No equations, fitted parameters, or predictions are defined inside the paper; reported improvements are direct differences against those external benchmarks. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The derivation chain is therefore self-contained against outside data and does not reduce to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central findings rest on two untested domain assumptions: that SelfCodeAlign preference pairs faithfully capture desired code properties and that the selected benchmarks validly measure both functional correctness and non-functional software quality. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption SelfCodeAlign pipeline produces valid accepted/rejected preference pairs for code quality
Used to generate training data for both DPO and BoNBoN alignment.
domain assumption HumanEval+, MBPP+, EvalPerf, EvoEval and CODAL benchmarks accurately and comprehensively measure functional and non-functional code requirements
All evaluation conclusions depend on these benchmarks reflecting real software-engineering needs.

pith-pipeline@v0.9.1-grok · 5858 in / 1467 out tokens · 42627 ms · 2026-06-30T08:27:53.837563+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 59 canonical work pages · 26 internal anchors

[1]

Wasi Uddin Ahmad, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Vahid Noroozi, Somshubra Majumdar, and Boris Ginsburg. 2025. OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs.arXiv preprint arXiv:2504.04030(2025). doi:10.48550/arXiv.2504.04030

work page doi:10.48550/arxiv.2504.04030 2025
[2]

Jomar Thomas Almonte, Santhosh Anitha Boominathan, and Nathalia Nascimento. 2025. Automated Non-Functional Requirements Generation in Software Engineering with Large Language Models: A Comparative Study.arXiv preprint arXiv:2503.15248(2025). doi:10.48550/arXiv.2503.15248

work page doi:10.48550/arxiv.2503.15248 2025
[3]

AI Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card1, 1 (2024), 4

2024
[4]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models.arXiv preprint arXiv:2108.07732 (2021). doi:10.48550/arXiv.2108.07732

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2108.07732 2021
[5]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021). doi:10.48550/arXiv.2107.03374

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2107.03374 2021
[7]

Jacob Cohen. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological bulletin70, 4 (1968), 213. doi:10.1037/h0026256

work page doi:10.1037/h0026256 1968
[8]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025). doi:10.48550/arXiv.2507. 06261

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507 2025
[9]

Xun Deng, Han Zhong, Rui Ai, Fuli Feng, Zheng Wang, and Xiangnan He. 2025. Less is more: Improving llm alignment via preference data selection.arXiv preprint arXiv:2502.14560(2025). doi:10.48550/arXiv.2502.14560

work page doi:10.48550/arxiv.2502.14560 2025
[10]

Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. arXiv preprint arXiv:2308.01861(2023). doi:10.48550/arXiv.2308.01861

work page doi:10.48550/arxiv.2308.01861 2023
[11]

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. 2024. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475(2024). doi:10.48550/arXiv.2404.04475

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.04475 2024
[12]

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. 2024. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306(2024). doi:10.48550/arXiv.2402.01306

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.01306 2024
[13]

Lovedeep Gondara, Jonathan Simkin, Graham Sayle, Shebnum Devji, Gregory Arbour, and Raymond Ng. 2025. Small or Large? Zero-Shot or Finetuned? Guiding Language Model Choice for Specialized Applications in Healthcare.arXiv preprint arXiv:2504.21191(2025). doi:10.48550/arXiv.2504.21191

work page doi:10.48550/arxiv.2504.21191 2025
[14]

Leo A. Goodman. 1961. Snowball Sampling.The Annals of Mathematical Statistics32, 1 (1961), 148–170. doi:10.1214/ aoms/1177705148

work page arXiv 1961
[15]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al . 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024). doi:10.48550/arXiv.2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
[16]

Lin Gui, Cristina Garbacea, and Victor Veitch. 2024. BoNBoN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. doi:10.52202/ 079017-0094

2024
[17]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025). doi:10.48550/arXiv.2501.12948

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025
[18]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196(2024). doi:10.48550/arXiv.2401.14196

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.14196 2024
[19]

Jiwoo Hong, Noah Lee, and James Thorne. 2024. Orpo: Monolithic preference optimization without reference model. arXiv preprint arXiv:2403.07691(2024). doi:10.48550/arXiv.2403.07691

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.07691 2024
[20]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186(2024). doi:10.48550/arXiv.2409.12186

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.12186 2024
[21]

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation. arXiv:2406.00515 [cs.CL] doi:10.48550/arXiv.2406.00515

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.00515 2024
[22]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023). doi:10. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE116. Publication date: July 2026. FSE116:22 Sanjeepan Sivapiran and Gias Uddin 4855...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Ishan Jindal, Chandana Badrinath, Pranjal Bharti, Lakkidi Vinay, and Sachin Dev Sharma. 2024. Balancing Continuous Pre-Training and Instruction Fine-Tuning: Optimizing Instruction-Following in LLMs.arXiv preprint arXiv:2410.10739 (2024). doi:10.48550/arXiv.2410.10739

work page doi:10.48550/arxiv.2410.10739 2024
[24]

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences114, 13 (2017), 3521–3526. doi:10.1073/pnas.1611835114

work page doi:10.1073/pnas.1611835114 2017
[25]

Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, and Harm de Vries. 2022. The Stack: 3 TB of permissively licensed source code. arXiv:2211.15533 [cs.CL] doi:10.48550/arXiv.2211.15533

work page doi:10.48550/arxiv.2211.15533 2022
[26]

Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan. 2023. Understanding catastrophic forgetting in language models via implicit inference.arXiv preprint arXiv:2309.10105(2023). doi:10.48550/arXiv.2309.10105

work page doi:10.48550/arxiv.2309.10105 2023
[27]

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, et al. 2023. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback.arXiv preprint arXiv:2309.00267(2023). doi:10.48550/arXiv.2309.00267

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.00267 2023
[29]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and LINGMING ZHANG. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 21558–21...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.01210 2023
[30]

Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. 2025. An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning. arXiv:2308.08747 [cs.CL] doi:10.48550/arXiv.2308. 08747

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308 2025
[31]

Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2023. Wizardcoder: Empowering code large language models with evol-instruct.arXiv preprint arXiv:2306.08568(2023). doi:10.48550/arXiv.2306.08568

work page doi:10.48550/arxiv.2306.08568 2023
[32]

Zeyuan Ma, Hongshu Guo, Jiacheng Chen, Guojun Peng, Zhiguang Cao, Yining Ma, and Yue-Jiao Gong. 2024. Llamoco: Instruction tuning of large language models for optimization code generation.arXiv preprint arXiv:2403.01131(2024). doi:10.48550/arXiv.2403.01131

work page doi:10.48550/arxiv.2403.01131 2024
[33]

Mary L McHugh. 2012. Interrater reliability: the kappa statistic.Biochemia medica22, 3 (2012), 276–282

2012
[34]

Yibo Miao, Bofei Gao, Shanghaoran Quan, Junyang Lin, Daoguang Zan, Jiaheng Liu, Jian Yang, Tianyu Liu, and Zhijie Deng. 2024. Aligning codellms with direct preference optimization.arXiv preprint arXiv:2410.18585(2024). doi:10.48550/arXiv.2410.18585

work page doi:10.48550/arxiv.2410.18585 2024
[35]

Sherin Muckatira, Vijeta Deshpande, Vladislav Lialin, and Anna Rumshisky. 2024. Emergent abilities in reduced-scale generative language models.arXiv preprint arXiv:2404.02204(2024). doi:10.48550/arXiv.2404.02204

work page doi:10.48550/arxiv.2404.02204 2024
[36]

Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro Von Werra, and Shayne Longpre. 2023. Octopack: Instruction tuning code large language models. In NeurIPS 2023 workshop on instruction tuning and instruction following. doi:10.48550/arXiv.2308.07124

work page doi:10.48550/arxiv.2308.07124 2023
[37]

OpenAI, Josh Achiam, et al. 2024. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL] doi:10.48550/arXiv.2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2024
[38]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems35 (2022), 27730–27744. doi:10.48550/arXiv.2203.02155

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.02155 2022
[39]

Richard Yuanzhe Pang, Weizhe Yuan, He He, Kyunghyun Cho, Sainbayar Sukhbaatar, and Jason Weston. 2024. Iterative reasoning preference optimization.Advances in Neural Information Processing Systems37 (2024), 116617–116637. doi:10.48550/arXiv.2404.19733

work page doi:10.48550/arxiv.2404.19733 2024
[40]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. InThirty-seventh Conference on Neural Information Processing Systems. doi:10.48550/arXiv.2305.18290

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.18290 2023
[41]

Vinay Venkatesh Ramasesh, Aitor Lewkowycz, and Ethan Dyer. 2021. Effect of scale on catastrophic forgetting in neural networks. InInternational conference on learning representations. doi:10.48550/arXiv.2010.04495

work page doi:10.48550/arxiv.2010.04495 2021
[42]

Weijieying Ren, Xinlong Li, Lei Wang, Tianxiang Zhao, and Wei Qin. 2024. Analyzing and Reducing Catastrophic Forgetting in Parameter Efficient Tuning. arXiv:2402.18865 [cs.LG] doi:10.48550/arXiv.2402.18865

work page doi:10.48550/arxiv.2402.18865 2024
[43]

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950 (2023). doi:10.48550/arXiv.2308.12950 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE116. Publication date: July 2026. ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.12950 2023
[44]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms.CoRRabs/1707.06347 (2017). arXiv:1707.06347 doi:10.48550/arXiv.1707.06347

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1707.06347 2017
[45]

Manav Singhal, Tushar Aggarwal, Abhijeet Awasthi, Nagarajan Natarajan, and Aditya Kanade. 2024. Nofuneval: Funny how code lms falter on requirements beyond functional correctness.arXiv preprint arXiv:2401.15963(2024). doi:10.48550/arXiv.2401.15963

work page doi:10.48550/arxiv.2401.15963 2024
[46]

Jacob Mitchell Springer, Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig, and Aditi Raghunathan. 2025. Overtrained language models are harder to fine-tune.arXiv preprint arXiv:2503.19206(2025). doi:10.48550/arXiv.2503.19206

work page doi:10.48550/arxiv.2503.19206 2025
[47]

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback.Advances in neural information processing systems33 (2020), 3008–3021. doi:10.48550/arXiv.2009.01325

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2009.01325 2020
[48]

Yun-Da Tsai, Mingjie Liu, and Haoxing Ren. 2024. Code less, align more: Efficient llm fine-tuning for code generation with data pruning.arXiv preprint arXiv:2407.05040(2024). doi:10.48550/arXiv.2407.05040

work page doi:10.48550/arxiv.2407.05040 2024
[49]

Jonathan Ullrich, Matthias Koch, and Andreas Vogelsang. 2025. From Requirements to Code: Understanding Developer Practices in LLM-Assisted Software Engineering.arXiv preprint arXiv:2507.07548(2025). doi:10.48550/arXiv.2507.07548

work page doi:10.48550/arxiv.2507.07548 2025
[50]

Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. 2023. Codet5+: Open code large language models for code understanding and generation.arXiv preprint arXiv:2305.07922(2023). doi:10.48550/arXiv.2305.07922

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.07922 2023
[51]

Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Xiang-Bo Mao, Sitaram Asur, et al. 2024. A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more.arXiv preprint arXiv:2407.16216(2024). doi:10.48550/arXiv.2407.16216

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.16216 2024
[52]

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682 (2022). doi:10.48550/arXiv.2206.07682

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2206.07682 2022
[53]

Yuxiang Wei, Federico Cassano, Jiawei Liu, Yifeng Ding, Naman Jain, Zachary Mueller, Harm de Vries, Leandro Von Werra, Arjun Guha, and LINGMING ZHANG. 2024. SelfCodeAlign: Self-Alignment for Code Generation. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. doi:10.48550/arXiv.2410.24198

work page doi:10.48550/arxiv.2410.24198 2024
[54]

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2023. Magicoder: Empowering code generation with oss-instruct.arXiv preprint arXiv:2312.02120(2023). doi:10.48550/arXiv.2312.02120

work page doi:10.48550/arxiv.2312.02120 2023
[55]

Martin Weyssow, Aton Kamanda, Xin Zhou, and Houari Sahraoui. 2025. CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences.ACM Trans. Softw. Eng. Methodol.(May 2025). doi:10.1145/3736407 Just Accepted

work page doi:10.1145/3736407 2025
[56]

Chunqiu Steven Xia, Yinlin Deng, and LINGMING ZHANG. 2024. Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM. InFirst Conference on Language Modeling. doi:10.48550/ arXiv.2403.19114

work page arXiv 2024
[57]

Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, and Yi Wu. 2024. Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719(2024). doi:10.48550/ arXiv.2404.10719

work page arXiv 2024
[58]

Zhou Yang, Zhensu Sun, Terry Zhuo Yue, Premkumar Devanbu, and David Lo. 2024. Robustness, Security, Privacy, Explainability, Efficiency, and Usability of Large Language Models for Code. arXiv:2403.07506 [cs.SE] doi:10.48550/ arXiv.2403.07506

work page arXiv 2024
[59]

Yueqin Yin, Zhendong Wang, Yi Gu, Hai Huang, Weizhu Chen, and Mingyuan Zhou. 2024. Relative preference optimization: Enhancing llm alignment through contrasting responses across identical and diverse prompts.arXiv preprint arXiv:2402.10958(2024). doi:10.48550/arXiv.2402.10958

work page doi:10.48550/arxiv.2402.10958 2024
[60]

Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12. doi:10.48550/arXiv.2302.00288

work page doi:10.48550/arxiv.2302.00288 2024
[61]

Kechi Zhang, Ge Li, Yihong Dong, Jingjing Xu, Jun Zhang, Jing Su, Yongfei Liu, and Zhi Jin. 2024. Codedpo: Aligning code models with self generated and verified source code.arXiv preprint arXiv:2410.05605(2024). doi:10.48550/arXiv. 2410.05605

work page internal anchor Pith review doi:10.48550/arxiv 2024
[62]

Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Yu Liu, Kai Chen, and Ping Luo. 2024. Gpt4roi: Instruction tuning large language model on region-of-interest. InEuropean conference on computer vision. Springer, 52–70. doi:10.48550/arXiv.2307.03601

work page doi:10.48550/arxiv.2307.03601 2024
[63]

Yichi Zhang, Zhuo Chen, Yin Fang, Yanxi Lu, Li Fangming, Wen Zhang, and Huajun Chen. 2024. Knowledgeable Preference Alignment for LLMs in Domain-specific Question Answering. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thaila...

work page doi:10.18653/v1/2024.findings-acl.52 2024
[64]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623. doi:10.48550/arXiv.2306.05685 Received 2026-02-10; accepted 2026-03-24 Proc. ACM Softw...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.05685 2023

[1] [1]

Wasi Uddin Ahmad, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Vahid Noroozi, Somshubra Majumdar, and Boris Ginsburg. 2025. OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs.arXiv preprint arXiv:2504.04030(2025). doi:10.48550/arXiv.2504.04030

work page doi:10.48550/arxiv.2504.04030 2025

[2] [2]

Jomar Thomas Almonte, Santhosh Anitha Boominathan, and Nathalia Nascimento. 2025. Automated Non-Functional Requirements Generation in Software Engineering with Large Language Models: A Comparative Study.arXiv preprint arXiv:2503.15248(2025). doi:10.48550/arXiv.2503.15248

work page doi:10.48550/arxiv.2503.15248 2025

[3] [3]

AI Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card1, 1 (2024), 4

2024

[4] [4]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models.arXiv preprint arXiv:2108.07732 (2021). doi:10.48550/arXiv.2108.07732

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2108.07732 2021

[5] [5]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021). doi:10.48550/arXiv.2107.03374

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2107.03374 2021

[6] [7]

Jacob Cohen. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological bulletin70, 4 (1968), 213. doi:10.1037/h0026256

work page doi:10.1037/h0026256 1968

[7] [8]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025). doi:10.48550/arXiv.2507. 06261

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507 2025

[8] [9]

Xun Deng, Han Zhong, Rui Ai, Fuli Feng, Zheng Wang, and Xiangnan He. 2025. Less is more: Improving llm alignment via preference data selection.arXiv preprint arXiv:2502.14560(2025). doi:10.48550/arXiv.2502.14560

work page doi:10.48550/arxiv.2502.14560 2025

[9] [10]

Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. arXiv preprint arXiv:2308.01861(2023). doi:10.48550/arXiv.2308.01861

work page doi:10.48550/arxiv.2308.01861 2023

[10] [11]

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. 2024. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475(2024). doi:10.48550/arXiv.2404.04475

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.04475 2024

[11] [12]

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. 2024. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306(2024). doi:10.48550/arXiv.2402.01306

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.01306 2024

[12] [13]

Lovedeep Gondara, Jonathan Simkin, Graham Sayle, Shebnum Devji, Gregory Arbour, and Raymond Ng. 2025. Small or Large? Zero-Shot or Finetuned? Guiding Language Model Choice for Specialized Applications in Healthcare.arXiv preprint arXiv:2504.21191(2025). doi:10.48550/arXiv.2504.21191

work page doi:10.48550/arxiv.2504.21191 2025

[13] [14]

Leo A. Goodman. 1961. Snowball Sampling.The Annals of Mathematical Statistics32, 1 (1961), 148–170. doi:10.1214/ aoms/1177705148

work page arXiv 1961

[14] [15]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al . 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024). doi:10.48550/arXiv.2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024

[15] [16]

Lin Gui, Cristina Garbacea, and Victor Veitch. 2024. BoNBoN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. doi:10.52202/ 079017-0094

2024

[16] [17]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025). doi:10.48550/arXiv.2501.12948

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025

[17] [18]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196(2024). doi:10.48550/arXiv.2401.14196

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.14196 2024

[18] [19]

Jiwoo Hong, Noah Lee, and James Thorne. 2024. Orpo: Monolithic preference optimization without reference model. arXiv preprint arXiv:2403.07691(2024). doi:10.48550/arXiv.2403.07691

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.07691 2024

[19] [20]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186(2024). doi:10.48550/arXiv.2409.12186

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.12186 2024

[20] [21]

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation. arXiv:2406.00515 [cs.CL] doi:10.48550/arXiv.2406.00515

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.00515 2024

[21] [22]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023). doi:10. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE116. Publication date: July 2026. FSE116:22 Sanjeepan Sivapiran and Gias Uddin 4855...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [23]

Ishan Jindal, Chandana Badrinath, Pranjal Bharti, Lakkidi Vinay, and Sachin Dev Sharma. 2024. Balancing Continuous Pre-Training and Instruction Fine-Tuning: Optimizing Instruction-Following in LLMs.arXiv preprint arXiv:2410.10739 (2024). doi:10.48550/arXiv.2410.10739

work page doi:10.48550/arxiv.2410.10739 2024

[23] [24]

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences114, 13 (2017), 3521–3526. doi:10.1073/pnas.1611835114

work page doi:10.1073/pnas.1611835114 2017

[24] [25]

Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, and Harm de Vries. 2022. The Stack: 3 TB of permissively licensed source code. arXiv:2211.15533 [cs.CL] doi:10.48550/arXiv.2211.15533

work page doi:10.48550/arxiv.2211.15533 2022

[25] [26]

Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan. 2023. Understanding catastrophic forgetting in language models via implicit inference.arXiv preprint arXiv:2309.10105(2023). doi:10.48550/arXiv.2309.10105

work page doi:10.48550/arxiv.2309.10105 2023

[26] [27]

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, et al. 2023. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback.arXiv preprint arXiv:2309.00267(2023). doi:10.48550/arXiv.2309.00267

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.00267 2023

[27] [29]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and LINGMING ZHANG. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 21558–21...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.01210 2023

[28] [30]

Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. 2025. An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning. arXiv:2308.08747 [cs.CL] doi:10.48550/arXiv.2308. 08747

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308 2025

[29] [31]

Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2023. Wizardcoder: Empowering code large language models with evol-instruct.arXiv preprint arXiv:2306.08568(2023). doi:10.48550/arXiv.2306.08568

work page doi:10.48550/arxiv.2306.08568 2023

[30] [32]

Zeyuan Ma, Hongshu Guo, Jiacheng Chen, Guojun Peng, Zhiguang Cao, Yining Ma, and Yue-Jiao Gong. 2024. Llamoco: Instruction tuning of large language models for optimization code generation.arXiv preprint arXiv:2403.01131(2024). doi:10.48550/arXiv.2403.01131

work page doi:10.48550/arxiv.2403.01131 2024

[31] [33]

Mary L McHugh. 2012. Interrater reliability: the kappa statistic.Biochemia medica22, 3 (2012), 276–282

2012

[32] [34]

Yibo Miao, Bofei Gao, Shanghaoran Quan, Junyang Lin, Daoguang Zan, Jiaheng Liu, Jian Yang, Tianyu Liu, and Zhijie Deng. 2024. Aligning codellms with direct preference optimization.arXiv preprint arXiv:2410.18585(2024). doi:10.48550/arXiv.2410.18585

work page doi:10.48550/arxiv.2410.18585 2024

[33] [35]

Sherin Muckatira, Vijeta Deshpande, Vladislav Lialin, and Anna Rumshisky. 2024. Emergent abilities in reduced-scale generative language models.arXiv preprint arXiv:2404.02204(2024). doi:10.48550/arXiv.2404.02204

work page doi:10.48550/arxiv.2404.02204 2024

[34] [36]

Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro Von Werra, and Shayne Longpre. 2023. Octopack: Instruction tuning code large language models. In NeurIPS 2023 workshop on instruction tuning and instruction following. doi:10.48550/arXiv.2308.07124

work page doi:10.48550/arxiv.2308.07124 2023

[35] [37]

OpenAI, Josh Achiam, et al. 2024. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL] doi:10.48550/arXiv.2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2024

[36] [38]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems35 (2022), 27730–27744. doi:10.48550/arXiv.2203.02155

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.02155 2022

[37] [39]

Richard Yuanzhe Pang, Weizhe Yuan, He He, Kyunghyun Cho, Sainbayar Sukhbaatar, and Jason Weston. 2024. Iterative reasoning preference optimization.Advances in Neural Information Processing Systems37 (2024), 116617–116637. doi:10.48550/arXiv.2404.19733

work page doi:10.48550/arxiv.2404.19733 2024

[38] [40]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. InThirty-seventh Conference on Neural Information Processing Systems. doi:10.48550/arXiv.2305.18290

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.18290 2023

[39] [41]

Vinay Venkatesh Ramasesh, Aitor Lewkowycz, and Ethan Dyer. 2021. Effect of scale on catastrophic forgetting in neural networks. InInternational conference on learning representations. doi:10.48550/arXiv.2010.04495

work page doi:10.48550/arxiv.2010.04495 2021

[40] [42]

Weijieying Ren, Xinlong Li, Lei Wang, Tianxiang Zhao, and Wei Qin. 2024. Analyzing and Reducing Catastrophic Forgetting in Parameter Efficient Tuning. arXiv:2402.18865 [cs.LG] doi:10.48550/arXiv.2402.18865

work page doi:10.48550/arxiv.2402.18865 2024

[41] [43]

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950 (2023). doi:10.48550/arXiv.2308.12950 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE116. Publication date: July 2026. ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.12950 2023

[42] [44]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms.CoRRabs/1707.06347 (2017). arXiv:1707.06347 doi:10.48550/arXiv.1707.06347

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1707.06347 2017

[43] [45]

Manav Singhal, Tushar Aggarwal, Abhijeet Awasthi, Nagarajan Natarajan, and Aditya Kanade. 2024. Nofuneval: Funny how code lms falter on requirements beyond functional correctness.arXiv preprint arXiv:2401.15963(2024). doi:10.48550/arXiv.2401.15963

work page doi:10.48550/arxiv.2401.15963 2024

[44] [46]

Jacob Mitchell Springer, Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig, and Aditi Raghunathan. 2025. Overtrained language models are harder to fine-tune.arXiv preprint arXiv:2503.19206(2025). doi:10.48550/arXiv.2503.19206

work page doi:10.48550/arxiv.2503.19206 2025

[45] [47]

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback.Advances in neural information processing systems33 (2020), 3008–3021. doi:10.48550/arXiv.2009.01325

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2009.01325 2020

[46] [48]

Yun-Da Tsai, Mingjie Liu, and Haoxing Ren. 2024. Code less, align more: Efficient llm fine-tuning for code generation with data pruning.arXiv preprint arXiv:2407.05040(2024). doi:10.48550/arXiv.2407.05040

work page doi:10.48550/arxiv.2407.05040 2024

[47] [49]

Jonathan Ullrich, Matthias Koch, and Andreas Vogelsang. 2025. From Requirements to Code: Understanding Developer Practices in LLM-Assisted Software Engineering.arXiv preprint arXiv:2507.07548(2025). doi:10.48550/arXiv.2507.07548

work page doi:10.48550/arxiv.2507.07548 2025

[48] [50]

Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. 2023. Codet5+: Open code large language models for code understanding and generation.arXiv preprint arXiv:2305.07922(2023). doi:10.48550/arXiv.2305.07922

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.07922 2023

[49] [51]

Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Xiang-Bo Mao, Sitaram Asur, et al. 2024. A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more.arXiv preprint arXiv:2407.16216(2024). doi:10.48550/arXiv.2407.16216

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.16216 2024

[50] [52]

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682 (2022). doi:10.48550/arXiv.2206.07682

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2206.07682 2022

[51] [53]

Yuxiang Wei, Federico Cassano, Jiawei Liu, Yifeng Ding, Naman Jain, Zachary Mueller, Harm de Vries, Leandro Von Werra, Arjun Guha, and LINGMING ZHANG. 2024. SelfCodeAlign: Self-Alignment for Code Generation. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. doi:10.48550/arXiv.2410.24198

work page doi:10.48550/arxiv.2410.24198 2024

[52] [54]

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2023. Magicoder: Empowering code generation with oss-instruct.arXiv preprint arXiv:2312.02120(2023). doi:10.48550/arXiv.2312.02120

work page doi:10.48550/arxiv.2312.02120 2023

[53] [55]

Martin Weyssow, Aton Kamanda, Xin Zhou, and Houari Sahraoui. 2025. CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences.ACM Trans. Softw. Eng. Methodol.(May 2025). doi:10.1145/3736407 Just Accepted

work page doi:10.1145/3736407 2025

[54] [56]

Chunqiu Steven Xia, Yinlin Deng, and LINGMING ZHANG. 2024. Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM. InFirst Conference on Language Modeling. doi:10.48550/ arXiv.2403.19114

work page arXiv 2024

[55] [57]

Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, and Yi Wu. 2024. Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719(2024). doi:10.48550/ arXiv.2404.10719

work page arXiv 2024

[56] [58]

Zhou Yang, Zhensu Sun, Terry Zhuo Yue, Premkumar Devanbu, and David Lo. 2024. Robustness, Security, Privacy, Explainability, Efficiency, and Usability of Large Language Models for Code. arXiv:2403.07506 [cs.SE] doi:10.48550/ arXiv.2403.07506

work page arXiv 2024

[57] [59]

Yueqin Yin, Zhendong Wang, Yi Gu, Hai Huang, Weizhu Chen, and Mingyuan Zhou. 2024. Relative preference optimization: Enhancing llm alignment through contrasting responses across identical and diverse prompts.arXiv preprint arXiv:2402.10958(2024). doi:10.48550/arXiv.2402.10958

work page doi:10.48550/arxiv.2402.10958 2024

[58] [60]

Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12. doi:10.48550/arXiv.2302.00288

work page doi:10.48550/arxiv.2302.00288 2024

[59] [61]

Kechi Zhang, Ge Li, Yihong Dong, Jingjing Xu, Jun Zhang, Jing Su, Yongfei Liu, and Zhi Jin. 2024. Codedpo: Aligning code models with self generated and verified source code.arXiv preprint arXiv:2410.05605(2024). doi:10.48550/arXiv. 2410.05605

work page internal anchor Pith review doi:10.48550/arxiv 2024

[60] [62]

Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Yu Liu, Kai Chen, and Ping Luo. 2024. Gpt4roi: Instruction tuning large language model on region-of-interest. InEuropean conference on computer vision. Springer, 52–70. doi:10.48550/arXiv.2307.03601

work page doi:10.48550/arxiv.2307.03601 2024

[61] [63]

Yichi Zhang, Zhuo Chen, Yin Fang, Yanxi Lu, Li Fangming, Wen Zhang, and Huajun Chen. 2024. Knowledgeable Preference Alignment for LLMs in Domain-specific Question Answering. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thaila...

work page doi:10.18653/v1/2024.findings-acl.52 2024

[62] [64]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623. doi:10.48550/arXiv.2306.05685 Received 2026-02-10; accepted 2026-03-24 Proc. ACM Softw...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.05685 2023