arxiv: 2605.00754 · v3 · submitted 2026-05-01 · 💻 cs.SE · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

Glava\v{s} Glavas, Indraneil Paul, Iryna Gurevych

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:53 UTC · model grok-4.3

classification 💻 cs.SE cs.LG

keywords code reward modelsmultilingual code preferencesmulti-criteria scoringpreference datasetcode generationreward modelingThemis-RMcode alignment

0 comments

The pith

A dataset of over 350k code preference pairs trains multilingual reward models that score code on five criteria instead of correctness alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to move code reward models beyond their current focus on functional correctness in executable programs. It does so by first creating a benchmark that tests reward models on five preference dimensions across eight languages, then assembling Themis-CodePreference, the largest open collection of code preferences with more than 350k pairs. These pairs are used to train Themis-RM, a family of models from 600M to 32B parameters. Experiments show positive scaling with size, strong transfer when training data spans languages, and better reliability when multiple criteria are learned together rather than in isolation. A reader would care because reward models guide the alignment and scaling of code-generating language models, so expanding their scope could improve generated code on qualities such as readability, efficiency, and security.

Core claim

By compiling Themis-CodePreference with more than 350k preference pairs and training Themis-RM on it, the authors produce multilingual code reward models that perform flexible multi-criteria scoring. The resulting models exhibit positive scaling trends, strong cross-lingual transfer from diverse training preferences, and improved reliability when trained on multiple criteria simultaneously.

What carries the argument

Themis-CodePreference, a collection of more than 350k code preference pairs spanning eight languages and five criteria, which supplies the training signal for Themis-RM models to learn multi-criteria judgments.

If this is right

Code generation pipelines can optimize for readability, efficiency, and security in addition to functional correctness.
Training on preferences from multiple languages produces reward models that generalize across languages without language-specific fine-tuning.
Larger parameter counts in the Themis-RM suite continue to improve scoring accuracy on multi-criteria tasks.
Multi-criteria training data is required for reward models to avoid over-optimizing single aspects of code quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These reward models could be plugged directly into reinforcement learning loops for code LLMs to produce outputs that better match developer preferences.
The same preference-collection method might be applied to related tasks such as code repair or test generation.
Hybrid systems that combine the multi-criteria scores with traditional execution feedback could yield more robust alignment signals.

Load-bearing premise

The collected preference pairs accurately reflect what constitutes high-quality code across the five criteria and eight languages.

What would settle it

When Themis-RM scores are used to rank or filter code outputs on a downstream generation task, the selected code shows no measurable gain in human preference or execution success over code ranked by prior single-criterion reward models.

Figures

Figures reproduced from arXiv: 2605.00754 by Glava\v{s} Glavas, Indraneil Paul, Iryna Gurevych.

**Figure 1.** Figure 1: Overview of our pipeline for mining multi-programming-language multi-criteria code preferences from view at source ↗

**Figure 2.** Figure 2: Comparison of Themis-CodeRewardBench against the code subsets of popular existing RM evaluation benchmarks. Themis-CodeRewardBench judges RMs on (a) longer and (b) more complex code responses, over a (c) largely novel distribution of prompts. of existing RM benchmarks, showing that Themis-CodeRewardBench introduces a largely novel distribution of code preferences (Figure 2c), for code of increased complexi… view at source ↗

read the original abstract

Reward models (RMs) have become an indispensable fixture of the language model (LM) post-training playbook, enabling policy alignment and test-time scaling. Research on the application of RMs in code generation, however, has been comparatively sparse, with existing work largely focusing on execution feedback. This choice constrains post-training to optimizing functional correctness over self-contained executable code. In this work, we examine the training and evaluation of multilingual, multi-criteria code RMs. To this end, we first compile Themis-CodeRewardBench, a benchmark to evaluate code RMs across five preference dimensions (i.e., criteria) and eight programming languages, on which we profile 50+ code, math, and general-purpose RMs. Observing the limited proficiency of current RMs beyond scoring for functional correctness, we develop Themis-CodePreference, the largest open-source collection of code preferences to date (more than 350k preference pairs), and use it to train Themis-RM, a suite of multilingual code reward models for flexible multi-criteria scoring, ranging in size from 600M to 32B parameters. Our experiments and ablations demonstrate positive scaling trends, strong cross-lingual transfer when training on diverse preferences, and the importance of multi-criteria training for reliable code reward modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's real contribution is the large open code preference dataset and multi-criteria benchmark, but the evaluation risks being circular without external grounding.

read the letter

The paper puts out Themis-CodePreference, a collection of over 350k code preference pairs that is the largest open one so far, along with Themis-CodeRewardBench that tests reward models on five criteria across eight languages. They then train Themis-RM models from 600M up to 32B parameters and report scaling trends, cross-lingual transfer from diverse data, and gains from training on multiple criteria at once. This moves code reward modeling past the usual narrow focus on execution-based correctness and gives researchers concrete new resources to work with. The multilingual coverage and attention to non-functional qualities are practical steps that match real needs in code generation systems. The main soft spot is the benchmark construction. It comes from the same preference collection pipeline as the training data, so any systematic issues in how the pairs were created—such as LLM-generated judgments that favor certain styles or languages—could inflate both training and test results without proving better performance on actual downstream tasks. The abstract gives no sign of independent checks like human ratings on generated code or execution metrics that would break that loop. This work is for groups already working on reward models or alignment for code LLMs who need data and benchmarks to build on. The dataset and benchmark are the parts most likely to get used by others. I would send it for peer review because the new resources are substantial enough to warrant checking the details and confirming the reported trends.

Referee Report

2 major / 2 minor

Summary. The paper introduces Themis-CodeRewardBench, a benchmark evaluating code reward models across five preference dimensions and eight programming languages, profiling over 50 existing RMs. It then presents Themis-CodePreference, an open dataset of more than 350k code preference pairs, used to train Themis-RM models (600M to 32B parameters) for multilingual multi-criteria scoring. Experiments claim positive scaling trends, strong cross-lingual transfer from diverse preferences, and the necessity of multi-criteria training for reliable performance beyond functional correctness.

Significance. If the results hold, this work provides a substantial open resource for code reward modeling, extending beyond execution-based feedback to multi-criteria evaluation in a multilingual setting. The scale of the preference dataset and the suite of trained models represent a clear contribution, with demonstrated scaling and transfer effects offering practical value for post-training alignment in code generation.

major comments (2)

[Benchmark Construction] Benchmark Construction (likely §3 or equivalent): Themis-CodeRewardBench is built from the same preference collection pipeline as Themis-CodePreference. This creates a circularity risk where any systematic biases in the (possibly LLM-generated) preferences could inflate both training performance and benchmark scores. External grounding against human judgments or downstream execution-based metrics for code generation tasks is needed to validate that the benchmark reliably measures real-world RM utility.
[Experiments and Ablations] Experiments and Ablations (likely §5): While positive scaling trends and benefits of multi-criteria training are reported, the manuscript must provide more granular details on ablation controls, including exact comparison setups between multi-criteria and single-criterion models, statistical significance of cross-lingual transfer results, and how the five criteria are balanced during training. Without these, the claim that multi-criteria training is 'important for reliable code reward modeling' remains under-supported.

minor comments (2)

[Abstract/Introduction] The abstract and introduction should explicitly list the five preference dimensions (criteria) rather than referring to them generically, to improve immediate clarity for readers.
[Model Training] Model size notation (600M, 32B) and parameter counts should be used consistently in tables and text; minor inconsistencies in reporting training hyperparameters across model scales would benefit from standardization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. The two major comments raise important points about potential circularity in benchmark construction and the need for more detailed experimental controls. We address each below and have incorporated revisions to strengthen the paper.

read point-by-point responses

Referee: [Benchmark Construction] Benchmark Construction (likely §3 or equivalent): Themis-CodeRewardBench is built from the same preference collection pipeline as Themis-CodePreference. This creates a circularity risk where any systematic biases in the (possibly LLM-generated) preferences could inflate both training performance and benchmark scores. External grounding against human judgments or downstream execution-based metrics for code generation tasks is needed to validate that the benchmark reliably measures real-world RM utility.

Authors: We appreciate this concern regarding potential circularity. The benchmark and training dataset do share the same preference collection pipeline, but the benchmark instances were explicitly held out and disjoint from the training pairs to avoid leakage. To address the need for external grounding, we have added new experiments in the revised manuscript comparing Themis-RM scores against human judgments on a sampled subset of the benchmark (with inter-annotator agreement reported) as well as downstream code generation performance using execution-based metrics on tasks like HumanEval and MBPP. These additions help validate that the benchmark captures meaningful RM utility beyond pipeline-specific biases. revision: yes
Referee: [Experiments and Ablations] Experiments and Ablations (likely §5): While positive scaling trends and benefits of multi-criteria training are reported, the manuscript must provide more granular details on ablation controls, including exact comparison setups between multi-criteria and single-criterion models, statistical significance of cross-lingual transfer results, and how the five criteria are balanced during training. Without these, the claim that multi-criteria training is 'important for reliable code reward modeling' remains under-supported.

Authors: We agree that additional granularity is required to fully support the claims. In the revised manuscript, we have expanded Section 5 with: (i) precise ablation setups detailing matched data volumes and training steps for multi-criteria versus single-criterion models; (ii) statistical significance results using paired t-tests and bootstrap confidence intervals for the cross-lingual transfer experiments; and (iii) explicit details on criterion balancing, implemented via equal-proportion sampling across the five dimensions during training. These revisions provide stronger empirical support for the necessity of multi-criteria training. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical data collection and training pipeline

full rationale

The paper's core claims rest on compiling a new benchmark (Themis-CodeRewardBench) for profiling existing RMs, then collecting a large independent preference dataset (Themis-CodePreference with >350k pairs) to train new multilingual RMs (Themis-RM) from 600M to 32B parameters. Experiments report scaling trends, cross-lingual transfer, and benefits of multi-criteria training. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the derivation; the benchmark and dataset are presented as distinct artifacts without reducing evaluation metrics to training inputs by construction. This is standard empirical ML work whose results remain falsifiable against external downstream metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work relies on standard supervised preference learning and reward modeling techniques from prior RLHF literature; no new axioms, free parameters fitted to target results, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5540 in / 1225 out tokens · 41553 ms · 2026-05-11T01:53:49.070608+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We train Themis-RM models ... using the Bradley-Terry reward modeling objective on preference tuples ... L = -E[...] log σ(rθ(pc,yc) - rθ(pr,yr)) + λ·log pθ(yc|...) + μ·(rθ...)^2
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
positive scaling trends, strong cross-lingual transfer ... importance of multi-criteria training

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 2 internal anchors

[1]

doi: 10.1109/CLOUD62652.2024.00052

IEEE, 2024. doi: 10.1109/CLOUD62652.2024.00052. URLhttps://doi.org/10.1109/CLOUD62652. 2024.00052. Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. In Lun-Wei Ku, Andre Martins, and ...

work page doi:10.1109/cloud62652.2024.00052 2024
[2]

URLhttps://openreview.net/forum?id=UunCPtPOlZ

OpenReview.net, 2025. URLhttps://openreview.net/forum?id=UunCPtPOlZ. Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Dinesh Manocha, Furong Huang, Amrit S. Bedi, and Mengdi Wang. Maxmin-rlhf: Alignment with diverse human preferences. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Fel...

work page 2025
[3]

Improving code generation by training with natural language feedback

URLhttps://proceedings.mlr.press/v235/chakraborty24b.html. Angelica Chen, Jérémy Scheurer, Tomasz Korbak, Jon Ander Campos, Jun Shern Chan, Samuel R. Bowman, Kyunghyun Cho, and Ethan Perez. Improving code generation by training with natural language feedback. CoRR, abs/2303.16749, 2023a. doi: 10.48550/ARXIV.2303.16749. URLhttps://doi.org/10.48550/arXiv. 2...

work page doi:10.48550/arxiv.2303.16749 2023
[4]

URLhttps://doi.org/10.48550/arXiv.2408.13855

doi: 10.48550/ARXIV.2408.13855. URLhttps://doi.org/10.48550/arXiv.2408.13855. Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe RLHF: safe reinforcement learning from human feedback. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe...

work page doi:10.48550/arxiv.2408.13855 2024
[5]

Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking.arXiv preprint arXiv:2312.09244,

doi: 10.48550/ARXIV.2312.09244. URLhttps://doi.org/10.48550/arXiv.2312.09244. Yanai Elazar, Akshita Bhagia, Ian Magnusson, Abhilasha Ravichander, Dustin Schwenk, Alane Suhr, Evan Pete Walsh, Dirk Groeneveld, Luca Soldaini, Sameer Singh, Hannaneh Hajishirzi, Noah A. Smith, and Jesse Dodge. What’s in my big data? InThe Twelfth International Conference on Le...

work page doi:10.48550/arxiv.2312.09244 2024
[6]

URLhttps://doi.org/10.24963/ijcai.2017/656

doi: 10.24963/IJCAI.2017/656. URLhttps://doi.org/10.24963/ijcai.2017/656. Zhiyu Fan, Kirill Vasilevski, Dayi Lin, Boyuan Chen, Yihao Chen, Zhiqing Zhong, Jie M. Zhang, Pinjia He, and Ahmed E. Hassan. Swe-effi: Re-evaluating software AI agent system effectiveness under resource constraints.CoRR, abs/2509.09853, 2025. doi: 10.48550/ARXIV.2509.09853. URL htt...

work page doi:10.24963/ijcai.2017/656 2017
[7]

URLhttps://openreview.net/forum?id=cbttLtO94Q

OpenReview.net, 2025. URLhttps://openreview.net/forum?id=cbttLtO94Q. Yanjun Fu, Ethan Baker, and Yizheng Chen. Constrained decoding for secure code generation.CoRR, abs/2405.00218, 2024. doi: 10.48550/ARXIV.2405.00218. URLhttps://doi.org/10.48550/arXiv.2405. 00218. Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. I...

work page doi:10.48550/arxiv.2405.00218 2025
[8]

Acero, Z

URLhttps://openreview.net/forum?id=Sx038qxjek. Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.CoRR, abs/2507.17746, 2025. doi: 10.48550/ARXIV. 2507.17746. URLhttps://doi.org/10.48550/arXiv.2507.17746. Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Do...

work page internal anchor Pith review doi:10.48550/arxiv 2025
[9]

Mohammed Kharma, Soohyeon Choi, Mohammed AlKhanafseh, and David Mohaisen

URLhttps://doi.org/10.18653/v1/2023.findings-emnlp.410. Mohammed Kharma, Soohyeon Choi, Mohammed AlKhanafseh, and David Mohaisen. Security and quality in llm-generated code: A multi-language, multi-model analysis.CoRR, abs/2502.01853, 2025. doi: 10.48550/ARXIV.2502.01853. URLhttps://doi.org/10.48550/arXiv.2502.01853. Seungone Kim, Jamin Shin, Yejin Choi, ...

work page doi:10.18653/v1/2023.findings-emnlp.410 2023
[10]

Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback

OpenReview.net, 2024b. URLhttps://openreview.net/forum?id=PXD3FAVHJT. Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Vinayak Bhalerao, Christopher L. Buckley, Jason Phang, Samuel R. Bowman, and Ethan Perez. Pretraining language models with human preferences. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan...

work page doi:10.18653/v1/2023.emnlp-demo.28 2023
[11]

G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment

OpenReview.net, 2025e. URLhttps://openreview.net/forum?id=88AS5MQnmC. Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP...

work page doi:10.18653/v1/2023.emnlp-main.153 2023
[12]

doi: 10.18653/V1/2025.EMNLP-MAIN.162

Association for Computational Linguistics, 2025. doi: 10.18653/V1/2025.EMNLP-MAIN.162. URL https://doi.org/10.18653/v1/2025.emnlp-main.162. JeffreyJianMa, MiladHashemi, AmirYazdanbakhsh, KevinSwersky, OfirPress, EnhuiLi, VijayJanapaReddi, and Parthasarathy Ranganathan. Swe-fficiency: Can language models optimize real-world repositories on real workloads?C...

work page doi:10.18653/v1/2025.emnlp-main.162 2025
[13]

URLhttps://doi.org/10.48550/arXiv.2311.07215

doi: 10.48550/ARXIV.2311.07215. URLhttps://doi.org/10.48550/arXiv.2311.07215. Niklas Muennighoff, Qian Liu, Armel Randy Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. Octopack: Instruction tuning code large language models. InThe Twelfth International Conference on Learning Representat...

work page doi:10.48550/arxiv.2311.07215 2024
[14]

Asleep at the keyboard? Assessing the security of github copilot’s code contributions.Commun

doi: 10.1145/3610721. URLhttps://doi.org/10.1145/3610721. Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.CoRR, abs/1910.00177, 2019. URL http://arxiv.org/abs/ 1910.00177. Yun Peng, Akhilesh Deepak Gotmare, Michael R. Lyu, Caiming Xiong, Silvio Savarese, and D...

work page doi:10.1145/3610721 1910
[15]

Crossing the reward bridge: Expanding rl with verifiable rewards across diverse domains.arXiv preprint arXiv:2503.23829, 2025

doi: 10.48550/ARXIV.2503.23829. URLhttps://doi.org/10.48550/arXiv.2503.23829. Zhiqing Sun, Yikang Shen, Hongxin Zhang, Qinhong Zhou, Zhenfang Chen, David Daniel Cox, Yiming Yang, and Chuang Gan. SALMON: self-alignment with instructable reward models. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, ...

work page doi:10.48550/arxiv.2503.23829 2024
[16]

Gokul Swamy, Sanjiban Choudhury, Wen Sun, Zhiwei Steven Wu, and J

URLhttps://openreview.net/forum?id=xJbsmB8UMx. Gokul Swamy, Sanjiban Choudhury, Wen Sun, Zhiwei Steven Wu, and J. Andrew Bagnell. All roads lead to likelihood: The value of reinforcement learning in fine-tuning.CoRR, abs/2503.01067, 2025. doi: 10.48550/ARXIV.2503.01067. URLhttps://doi.org/10.48550/arXiv.2503.01067. Fahim Tajwar, Anikait Singh, Archit Shar...

work page doi:10.48550/arxiv.2503.01067 2025
[17]

Gemma 2: Improving Open Language Models at a Practical Size

OpenReview.net, 2025. URLhttps://openreview.net/forum?id=G0dksFayVq. Leitian Tao, Xiang Chen, Tong Yu, Tung Mai, Ryan A. Rossi, Yixuan Li, and Saayan Mitra. Codelutra: Boosting LLM code generation via preference-guided refinement.Trans. Mach. Learn. Res., 2025, 2025. URLhttps://openreview.net/forum?id=IGsEgWM4to. Gemma Team. Gemma 2: Improving open langua...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.00118 2025
[18]

URLhttps://doi.org/10.48550/arXiv.2504.04699

doi: 10.48550/ARXIV.2504.04699. URLhttps://doi.org/10.48550/arXiv.2504.04699. Chenxi Whitehouse, Tianlu Wang, Ping Yu, Xian Li, Jason Weston, Ilia Kulikov, and Swarnadeep Saha. J1: incentivizing thinking in llm-as-a-judge via reinforcement learning.CoRR, abs/2505.10320, 2025. doi: 10.48550/ARXIV.2505.10320. URLhttps://doi.org/10.48550/arXiv.2505.10320. Ge...

work page doi:10.48550/arxiv.2504.04699 2025
[19]

URLhttps://openreview.net/forum?id=Pnk7vMbznK

OpenReview.net, 2025c. URLhttps://openreview.net/forum?id=Pnk7vMbznK. Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.),Findings of the Association for Computational Ling...

work page doi:10.18653/v1/2025.findings-acl.365 2025
[20]

doi: 10.18653/V1/2024.FINDINGS-EMNLP.382

Association for Computational Linguistics, 2024e. doi: 10.18653/V1/2024.FINDINGS-EMNLP.382. URLhttps://doi.org/10.18653/v1/2024.findings-emnlp.382. 37 Weiqing Yang, Hanbin Wang, Zhenghao Liu, Xinze Li, Yukun Yan, Shuo Wang, Yu Gu, Minghe Yu, Zhiyuan Liu, and Ge Yu. COAST: enhancing the code debugging ability of llms through communicative agent based data ...

work page doi:10.18653/v1/2024.findings-emnlp.382 2024
[21]

Towards a Unified Multi-Dimensional Evaluator for Text Generation

Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.EMNLP-MAIN.131. URL https://doi.org/10.18653/v1/2022.emnlp-main.131. Changzhi Zhou, Xinyu Zhang, Dandan Song, Xiancai Chen, Wanli Gu, Huipeng Ma, Yuhang Tian, Mengdi Zhang, and Linmei Hu. Refinecoder: Iterative improving of large language models via adaptive critique refinement for cod...

work page doi:10.18653/v1/2022.emnlp-main.131 2022
[22]

Additionally, we ensure that all GitHub commit preference data we train on is sourced no later thanMarch 2019

We first ensure that all samples inThemis-GeneralPreference and Themis-CodePreference are no longer than 2560 and 4096 tokens, respectively.7 Subsequently, we filter out samples with trivial code responses whose syntax tree is shallower than3 levels deep. Additionally, we ensure that all GitHub commit preference data we train on is sourced no later thanMarch 2019

work page 2019
[23]

We next leverage the GlotLID (Kargaran et al., 2023) language classifier to discard samples with non- English prompts, followed by filtering out samples with prompt perplexities greater than1200, as measured by a KenLM (Heafield, 2011) model trained on the OSCAR EN corpus (Abadji et al., 2022)

work page 2023
[24]

Next, we run a dataset-level (i.e.,Themis-GeneralPreference and Themis-CodePreference separately) near-deduplication step using a MinHash (Broder, 1997) filter with a shingle size of20 and a similarity threshold of 0.75. Finally, following prior work (Brown et al., 2020; Elazar et al., 2024), we decontaminate our training data by removing any sample whose...

work page 1997
[25]

The change may or may not also contain unrelated edits that are not specific to {{criteria}}

The change doesn't improve the code's {{criteria}} or degrades it overall. The change may or may not also contain unrelated edits that are not specific to {{criteria}}. The change may also introduce other issues or bugs unrelated to {{criteria}}

work page
[26]

The change may or may not also contain unrelated edits that are not specific to {{criteria}}

The code change is unnecessary and does not have any discernible effect on the code's {{criteria}}, but does not degrade its {{criteria}} either. The change may or may not also contain unrelated edits that are not specific to {{criteria}}

work page
[27]

The change might also contain unnecessary edits unrelated to {{criteria}}, but the majority of the changes are specific to {{criteria}}

The code change makes the code slightly better with respect to {{criteria}} but largely leaves it the same. The change might also contain unnecessary edits unrelated to {{criteria}}, but the majority of the changes are specific to {{criteria}}

work page
[28]

Sporadic edits unrelated to {{criteria}} may exist, but the majority of the changes are specific to {{criteria}}

The code change makes the code significantly better with respect to {{criteria}}. Sporadic edits unrelated to {{criteria}} may exist, but the majority of the changes are specific to {{criteria}}. The change does not introduce any new issues or bugs unrelated to {{criteria}}

work page
[29]

The change is also well implemented and specific, i.e., not a generic suggestion that could apply to any codebase

The code change greatly improves the code's {{criteria}}, making it a must−have feature or addition. The change is also well implemented and specific, i.e., not a generic suggestion that could apply to any codebase. The incidence of unnecessary edits that are unrelated to {{criteria}} is minimal or non−existent in the change. The change does not introduce...

work page
[30]

Is consistent in meaning with the provided reference solutions

work page
[31]

Would sufficiently identifiably lead a mid−tier to experienced developer to plausibly converge on either of the reference solutions (EXAMPLE1 or EXAMPLE2) with equal likelihood

work page
[32]

Resembles a {{content_style}} in its level of detail, complexity, structure, and style

work page
[33]

Provide the instruction you craft between [INSTRUCTION] and [\INSTRUCTION] tags

Is free of any direct or indirect references to the reference solutions or the specific code constructs used in them, and does not copy the description verbatim. Provide the instruction you craft between [INSTRUCTION] and [\INSTRUCTION] tags. Listing 3: The inverse instruction creation prompts for crafting realistic queries for code change pairs mined fro...

work page
[34]

It must be a modification of the reference solution that introduces only functional, logical , and algorithmic bugs

work page
[35]

However, you must try to maintain the code's surface−level structure as much as possible

The introduction of small syntax and grammatical errors is also allowed. However, you must try to maintain the code's surface−level structure as much as possible

work page
[36]

The problem statement and the reference solution are provided to you as part of the input, but you must not use them in your output

The buggy code must not allude to the original problem or the reference solution in any way. The problem statement and the reference solution are provided to you as part of the input, but you must not use them in your output

work page
[37]

Variables, functions, classes, and other identifiers should not be named in a way that suggests the presence of bugs

The buggy code must not allude to the introduced bugs in any way. Variables, functions, classes, and other identifiers should not be named in a way that suggests the presence of bugs. Similarly, the comments and documentation should not hint at the bugs

work page
[38]

The addition of new features or the removal of existing ones is out of scope for this task

work page
[39]

Below is a validated (PROBLEM, REFERENCE_SOLUTION) pair that you can use to generate the buggy code snippet

The introduction of security vulnerabilities, memory leaks, or other non−functional bugs is out of scope for this task. Below is a validated (PROBLEM, REFERENCE_SOLUTION) pair that you can use to generate the buggy code snippet. The problem is enclosed between the tags [PROBLEM] and [\PROBLEM]. The reference solution is enclosed between the tags [REFERENC...

work page 2025