Reward-Free Code Alignment from Pretrained or Fine-Tuned LLM: Unpacking the Trade-offs for Code Generation
Pith reviewed 2026-06-30 08:27 UTC · model grok-4.3
The pith
Aligning code LLMs from pretrained bases yields larger gains than from finetuned versions, though absolute accuracy stays lower.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Pretrained-to-aligned pathways achieve larger improvements in the aligned variant over its pretrained variant. But the pretrained variant is generally less accurate than its finetuned variant. However, finetuned-to-aligned offers smaller performance improvements or, in some cases, degradation in the aligned variant than its finetuned variant.
What carries the argument
The pretrained-to-aligned versus finetuned-to-aligned comparison using reward-free DPO and BoNBoN on SelfCodeAlign preference pairs.
If this is right
- Alignment produces bigger relative gains when started from pretrained models rather than finetuned ones.
- Even after alignment, models begun from pretrained checkpoints remain less accurate than those begun from finetuned checkpoints.
- Further alignment on already finetuned code models can yield little benefit or can reduce performance.
- Reward-free methods can be applied in either pathway, but the size and direction of the effect differ by starting point.
Where Pith is reading between the lines
- When building a new code-specialized LLM, starting alignment from the base pretrained checkpoint may be preferable if the priority is maximizing relative improvement.
- The trade-off suggests practitioners may need to choose between maximizing final accuracy (favoring finetuned start) and maximizing alignment uplift (favoring pretrained start).
- The observed pattern could be tested on non-code tasks to see whether the same starting-point dependence appears outside software engineering.
Load-bearing premise
The SelfCodeAlign pipeline produces high-quality preference pairs that correctly reflect functional and non-functional code quality, and the five chosen benchmarks provide unbiased, comprehensive measurement of both correctness and software-engineering quality dimensions.
What would settle it
Replace the SelfCodeAlign preference pairs with human-annotated ones and re-run the pretrained versus finetuned alignment experiments to check whether the pattern of larger gains from pretrained still appears.
Figures
read the original abstract
Large Language Model (LLM) alignment trains an LLM using preference data to produce outputs that better meet established quality standards. While LLM alignment techniques are studied for non-coding tasks, we know little about their usefulness for coding tasks. It is unclear whether LLM code alignment could support both functional requirements (producing executable, correct code) and non-functional requirements (code readability, style, maintainability). It is also unknown whether alignment for a code LLM should begin with base pretrained version or the finetuned (i.e., instruction-tuned) version of the LLM. In this paper, we offer insights on the above two research questions by conducting an empirical study. We studied five state-of-the-art (SOTA) LLMs using two widely used LLM alignment techniques: Direct Preference Optimization (DPO) and BoNBoN. For each training record, we created a preference pair as accepted and rejected instances by using the SelfCodeAlign pipeline. DPO and BoNBoN are reward-free models, i.e., they eliminate the need for multiple reward scores for output preferences. We tuned each LLM using the two alignment techniques in two settings: pretrained and finetuned versions of an LLM. We evaluated functional requirements using four SOTA benchmarks (HumanEval+, MBPP+, EvalPerf, EvoEval) and non-functional requirements using the CODAL benchmark, which evaluates code quality across five dimensions derived from software engineering practices. We find that pretrained-to-aligned pathways achieve larger improvements in the aligned variant over its pretrained variant. But the pretrained variant is generally less accurate than its finetuned variant. However, finetuned- to-aligned offers smaller performance improvements or, in some cases, degradation in the aligned variant than its finetuned variant.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts an empirical comparison of reward-free alignment (DPO and BoNBoN) on five SOTA LLMs, using SelfCodeAlign to generate preference pairs. It evaluates two pathways—pretrained-to-aligned versus finetuned-to-aligned—on functional correctness benchmarks (HumanEval+, MBPP+, EvalPerf, EvoEval) and non-functional code quality (CODAL across five SE dimensions), claiming larger gains from the pretrained starting point despite lower absolute performance.
Significance. If the differential gains are robust, the result would inform whether code alignment pipelines should begin from base pretrained models rather than instruction-tuned checkpoints, with implications for balancing functional correctness against maintainability and style. The reward-free framing and dual evaluation of functional/non-functional requirements are positive aspects.
major comments (3)
- [Abstract] Abstract and experimental protocol: the central claim of larger improvements in pretrained-to-aligned versus finetuned-to-aligned rests on reported performance deltas, yet the manuscript supplies no statistical tests, confidence intervals, or error bars on any benchmark scores, leaving open whether observed differences exceed prompt or sampling variance.
- [Abstract] SelfCodeAlign pipeline description: the preference pairs used for both DPO and BoNBoN are generated by this pipeline, but the text provides no validation (e.g., correlation with HumanEval+ execution outcomes, inter-rater agreement, or human judgment) that accepted/rejected instances accurately reflect functional correctness or the five CODAL dimensions; without this, differential gains could reflect model sensitivity to pipeline artifacts rather than genuine alignment dynamics.
- [Abstract] Benchmark and confound discussion: the study cites five benchmarks but does not address potential overlap between the SelfCodeAlign training data and the evaluation sets, nor does it report ablations on prompt sensitivity or multiple random seeds, both of which are load-bearing for the comparative claim across pretrained and finetuned starting points.
minor comments (1)
- [Abstract] The abstract states that "pretrained variant is generally less accurate than its finetuned variant" but does not quantify this baseline gap on each benchmark, which would help readers interpret the magnitude of subsequent alignment gains.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on statistical rigor, pipeline validation, and potential confounds. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core empirical findings.
read point-by-point responses
-
Referee: [Abstract] Abstract and experimental protocol: the central claim of larger improvements in pretrained-to-aligned versus finetuned-to-aligned rests on reported performance deltas, yet the manuscript supplies no statistical tests, confidence intervals, or error bars on any benchmark scores, leaving open whether observed differences exceed prompt or sampling variance.
Authors: We agree that the absence of statistical tests and error bars weakens the robustness of the comparative claims. In the revised manuscript we will add 95% bootstrap confidence intervals for all reported scores and conduct paired statistical tests (McNemar’s test for pass@1 rates and Wilcoxon signed-rank for other metrics) on the deltas between the two alignment pathways to demonstrate that the larger gains from the pretrained starting point exceed sampling variance. revision: yes
-
Referee: [Abstract] SelfCodeAlign pipeline description: the preference pairs used for both DPO and BoNBoN are generated by this pipeline, but the text provides no validation (e.g., correlation with HumanEval+ execution outcomes, inter-rater agreement, or human judgment) that accepted/rejected instances accurately reflect functional correctness or the five CODAL dimensions; without this, differential gains could reflect model sensitivity to pipeline artifacts rather than genuine alignment dynamics.
Authors: The SelfCodeAlign pipeline relies on execution-based filtering for functional correctness and static-analysis rules for the CODAL dimensions; these mechanisms are inherited from the original pipeline paper. We acknowledge that our manuscript does not report explicit validation metrics. We will add a dedicated subsection that computes the correlation between pipeline labels and HumanEval+ pass rates on a 200-example held-out sample, reports inter-rater agreement from our internal human review of 100 pairs, and discusses any residual risk of pipeline artifacts. revision: yes
-
Referee: [Abstract] Benchmark and confound discussion: the study cites five benchmarks but does not address potential overlap between the SelfCodeAlign training data and the evaluation sets, nor does it report ablations on prompt sensitivity or multiple random seeds, both of which are load-bearing for the comparative claim across pretrained and finetuned starting points.
Authors: We did not perform explicit overlap checks or multi-seed runs in the original experiments. In revision we will (1) quantify lexical and semantic overlap between SelfCodeAlign training prompts and the five evaluation benchmarks, (2) report all main results averaged over three random seeds, and (3) include a prompt-sensitivity ablation on two representative models (one pretrained, one finetuned) to confirm that the differential gains are not driven by prompt wording. revision: yes
Circularity Check
No circularity: empirical head-to-head comparison on external benchmarks
full rationale
The paper reports an empirical study that applies DPO and BoNBoN to pretrained vs. finetuned LLMs using preference pairs generated by the external SelfCodeAlign pipeline, then measures outcomes on five independent benchmarks (HumanEval+, MBPP+, EvalPerf, EvoEval, CODAL). No equations, fitted parameters, or predictions are defined inside the paper; reported improvements are direct differences against those external benchmarks. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The derivation chain is therefore self-contained against outside data and does not reduce to its own inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption SelfCodeAlign pipeline produces valid accepted/rejected preference pairs for code quality
- domain assumption HumanEval+, MBPP+, EvalPerf, EvoEval and CODAL benchmarks accurately and comprehensively measure functional and non-functional code requirements
Reference graph
Works this paper leans on
-
[1]
Wasi Uddin Ahmad, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Vahid Noroozi, Somshubra Majumdar, and Boris Ginsburg. 2025. OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs.arXiv preprint arXiv:2504.04030(2025). doi:10.48550/arXiv.2504.04030
-
[2]
Jomar Thomas Almonte, Santhosh Anitha Boominathan, and Nathalia Nascimento. 2025. Automated Non-Functional Requirements Generation in Software Engineering with Large Language Models: A Comparative Study.arXiv preprint arXiv:2503.15248(2025). doi:10.48550/arXiv.2503.15248
-
[3]
AI Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card1, 1 (2024), 4
2024
-
[4]
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models.arXiv preprint arXiv:2108.07732 (2021). doi:10.48550/arXiv.2108.07732
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2108.07732 2021
-
[5]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021). doi:10.48550/arXiv.2107.03374
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2107.03374 2021
-
[7]
Jacob Cohen. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological bulletin70, 4 (1968), 213. doi:10.1037/h0026256
-
[8]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025). doi:10.48550/arXiv.2507. 06261
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507 2025
-
[9]
Xun Deng, Han Zhong, Rui Ai, Fuli Feng, Zheng Wang, and Xiangnan He. 2025. Less is more: Improving llm alignment via preference data selection.arXiv preprint arXiv:2502.14560(2025). doi:10.48550/arXiv.2502.14560
-
[10]
Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. arXiv preprint arXiv:2308.01861(2023). doi:10.48550/arXiv.2308.01861
-
[11]
Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. 2024. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475(2024). doi:10.48550/arXiv.2404.04475
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.04475 2024
-
[12]
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. 2024. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306(2024). doi:10.48550/arXiv.2402.01306
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.01306 2024
-
[13]
Lovedeep Gondara, Jonathan Simkin, Graham Sayle, Shebnum Devji, Gregory Arbour, and Raymond Ng. 2025. Small or Large? Zero-Shot or Finetuned? Guiding Language Model Choice for Specialized Applications in Healthcare.arXiv preprint arXiv:2504.21191(2025). doi:10.48550/arXiv.2504.21191
- [14]
-
[15]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al . 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024). doi:10.48550/arXiv.2407.21783
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
-
[16]
Lin Gui, Cristina Garbacea, and Victor Veitch. 2024. BoNBoN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. doi:10.52202/ 079017-0094
2024
-
[17]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025). doi:10.48550/arXiv.2501.12948
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025
-
[18]
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196(2024). doi:10.48550/arXiv.2401.14196
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.14196 2024
-
[19]
Jiwoo Hong, Noah Lee, and James Thorne. 2024. Orpo: Monolithic preference optimization without reference model. arXiv preprint arXiv:2403.07691(2024). doi:10.48550/arXiv.2403.07691
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.07691 2024
-
[20]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186(2024). doi:10.48550/arXiv.2409.12186
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.12186 2024
-
[21]
Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation. arXiv:2406.00515 [cs.CL] doi:10.48550/arXiv.2406.00515
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.00515 2024
-
[22]
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023). doi:10. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE116. Publication date: July 2026. FSE116:22 Sanjeepan Sivapiran and Gias Uddin 4855...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Ishan Jindal, Chandana Badrinath, Pranjal Bharti, Lakkidi Vinay, and Sachin Dev Sharma. 2024. Balancing Continuous Pre-Training and Instruction Fine-Tuning: Optimizing Instruction-Following in LLMs.arXiv preprint arXiv:2410.10739 (2024). doi:10.48550/arXiv.2410.10739
-
[24]
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences114, 13 (2017), 3521–3526. doi:10.1073/pnas.1611835114
-
[25]
Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, and Harm de Vries. 2022. The Stack: 3 TB of permissively licensed source code. arXiv:2211.15533 [cs.CL] doi:10.48550/arXiv.2211.15533
-
[26]
Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan. 2023. Understanding catastrophic forgetting in language models via implicit inference.arXiv preprint arXiv:2309.10105(2023). doi:10.48550/arXiv.2309.10105
-
[27]
Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, et al. 2023. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback.arXiv preprint arXiv:2309.00267(2023). doi:10.48550/arXiv.2309.00267
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.00267 2023
-
[29]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and LINGMING ZHANG. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 21558–21...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.01210 2023
-
[30]
Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. 2025. An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning. arXiv:2308.08747 [cs.CL] doi:10.48550/arXiv.2308. 08747
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308 2025
-
[31]
Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2023. Wizardcoder: Empowering code large language models with evol-instruct.arXiv preprint arXiv:2306.08568(2023). doi:10.48550/arXiv.2306.08568
-
[32]
Zeyuan Ma, Hongshu Guo, Jiacheng Chen, Guojun Peng, Zhiguang Cao, Yining Ma, and Yue-Jiao Gong. 2024. Llamoco: Instruction tuning of large language models for optimization code generation.arXiv preprint arXiv:2403.01131(2024). doi:10.48550/arXiv.2403.01131
-
[33]
Mary L McHugh. 2012. Interrater reliability: the kappa statistic.Biochemia medica22, 3 (2012), 276–282
2012
-
[34]
Yibo Miao, Bofei Gao, Shanghaoran Quan, Junyang Lin, Daoguang Zan, Jiaheng Liu, Jian Yang, Tianyu Liu, and Zhijie Deng. 2024. Aligning codellms with direct preference optimization.arXiv preprint arXiv:2410.18585(2024). doi:10.48550/arXiv.2410.18585
-
[35]
Sherin Muckatira, Vijeta Deshpande, Vladislav Lialin, and Anna Rumshisky. 2024. Emergent abilities in reduced-scale generative language models.arXiv preprint arXiv:2404.02204(2024). doi:10.48550/arXiv.2404.02204
-
[36]
Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro Von Werra, and Shayne Longpre. 2023. Octopack: Instruction tuning code large language models. In NeurIPS 2023 workshop on instruction tuning and instruction following. doi:10.48550/arXiv.2308.07124
-
[37]
OpenAI, Josh Achiam, et al. 2024. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL] doi:10.48550/arXiv.2303.08774
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2024
-
[38]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems35 (2022), 27730–27744. doi:10.48550/arXiv.2203.02155
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.02155 2022
-
[39]
Richard Yuanzhe Pang, Weizhe Yuan, He He, Kyunghyun Cho, Sainbayar Sukhbaatar, and Jason Weston. 2024. Iterative reasoning preference optimization.Advances in Neural Information Processing Systems37 (2024), 116617–116637. doi:10.48550/arXiv.2404.19733
-
[40]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. InThirty-seventh Conference on Neural Information Processing Systems. doi:10.48550/arXiv.2305.18290
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.18290 2023
-
[41]
Vinay Venkatesh Ramasesh, Aitor Lewkowycz, and Ethan Dyer. 2021. Effect of scale on catastrophic forgetting in neural networks. InInternational conference on learning representations. doi:10.48550/arXiv.2010.04495
-
[42]
Weijieying Ren, Xinlong Li, Lei Wang, Tianxiang Zhao, and Wei Qin. 2024. Analyzing and Reducing Catastrophic Forgetting in Parameter Efficient Tuning. arXiv:2402.18865 [cs.LG] doi:10.48550/arXiv.2402.18865
-
[43]
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950 (2023). doi:10.48550/arXiv.2308.12950 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE116. Publication date: July 2026. ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.12950 2023
-
[44]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms.CoRRabs/1707.06347 (2017). arXiv:1707.06347 doi:10.48550/arXiv.1707.06347
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1707.06347 2017
-
[45]
Manav Singhal, Tushar Aggarwal, Abhijeet Awasthi, Nagarajan Natarajan, and Aditya Kanade. 2024. Nofuneval: Funny how code lms falter on requirements beyond functional correctness.arXiv preprint arXiv:2401.15963(2024). doi:10.48550/arXiv.2401.15963
-
[46]
Jacob Mitchell Springer, Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig, and Aditi Raghunathan. 2025. Overtrained language models are harder to fine-tune.arXiv preprint arXiv:2503.19206(2025). doi:10.48550/arXiv.2503.19206
-
[47]
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback.Advances in neural information processing systems33 (2020), 3008–3021. doi:10.48550/arXiv.2009.01325
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2009.01325 2020
-
[48]
Yun-Da Tsai, Mingjie Liu, and Haoxing Ren. 2024. Code less, align more: Efficient llm fine-tuning for code generation with data pruning.arXiv preprint arXiv:2407.05040(2024). doi:10.48550/arXiv.2407.05040
-
[49]
Jonathan Ullrich, Matthias Koch, and Andreas Vogelsang. 2025. From Requirements to Code: Understanding Developer Practices in LLM-Assisted Software Engineering.arXiv preprint arXiv:2507.07548(2025). doi:10.48550/arXiv.2507.07548
-
[50]
Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. 2023. Codet5+: Open code large language models for code understanding and generation.arXiv preprint arXiv:2305.07922(2023). doi:10.48550/arXiv.2305.07922
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.07922 2023
-
[51]
Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Xiang-Bo Mao, Sitaram Asur, et al. 2024. A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more.arXiv preprint arXiv:2407.16216(2024). doi:10.48550/arXiv.2407.16216
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.16216 2024
-
[52]
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682 (2022). doi:10.48550/arXiv.2206.07682
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2206.07682 2022
-
[53]
Yuxiang Wei, Federico Cassano, Jiawei Liu, Yifeng Ding, Naman Jain, Zachary Mueller, Harm de Vries, Leandro Von Werra, Arjun Guha, and LINGMING ZHANG. 2024. SelfCodeAlign: Self-Alignment for Code Generation. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. doi:10.48550/arXiv.2410.24198
-
[54]
Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2023. Magicoder: Empowering code generation with oss-instruct.arXiv preprint arXiv:2312.02120(2023). doi:10.48550/arXiv.2312.02120
-
[55]
Martin Weyssow, Aton Kamanda, Xin Zhou, and Houari Sahraoui. 2025. CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences.ACM Trans. Softw. Eng. Methodol.(May 2025). doi:10.1145/3736407 Just Accepted
- [56]
- [57]
- [58]
-
[59]
Yueqin Yin, Zhendong Wang, Yi Gu, Hai Huang, Weizhu Chen, and Mingyuan Zhou. 2024. Relative preference optimization: Enhancing llm alignment through contrasting responses across identical and diverse prompts.arXiv preprint arXiv:2402.10958(2024). doi:10.48550/arXiv.2402.10958
-
[60]
Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12. doi:10.48550/arXiv.2302.00288
-
[61]
Kechi Zhang, Ge Li, Yihong Dong, Jingjing Xu, Jun Zhang, Jing Su, Yongfei Liu, and Zhi Jin. 2024. Codedpo: Aligning code models with self generated and verified source code.arXiv preprint arXiv:2410.05605(2024). doi:10.48550/arXiv. 2410.05605
work page internal anchor Pith review doi:10.48550/arxiv 2024
-
[62]
Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Yu Liu, Kai Chen, and Ping Luo. 2024. Gpt4roi: Instruction tuning large language model on region-of-interest. InEuropean conference on computer vision. Springer, 52–70. doi:10.48550/arXiv.2307.03601
-
[63]
Yichi Zhang, Zhuo Chen, Yin Fang, Yanxi Lu, Li Fangming, Wen Zhang, and Huajun Chen. 2024. Knowledgeable Preference Alignment for LLMs in Domain-specific Question Answering. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thaila...
-
[64]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623. doi:10.48550/arXiv.2306.05685 Received 2026-02-10; accepted 2026-03-24 Proc. ACM Softw...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.05685 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.