pith. machine review for the scientific record. sign in

arxiv: 2605.00754 · v3 · submitted 2026-05-01 · 💻 cs.SE · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

Glava\v{s} Glavas, Indraneil Paul, Iryna Gurevych

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:53 UTC · model grok-4.3

classification 💻 cs.SE cs.LG
keywords code reward modelsmultilingual code preferencesmulti-criteria scoringpreference datasetcode generationreward modelingThemis-RMcode alignment
0
0 comments X

The pith

A dataset of over 350k code preference pairs trains multilingual reward models that score code on five criteria instead of correctness alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to move code reward models beyond their current focus on functional correctness in executable programs. It does so by first creating a benchmark that tests reward models on five preference dimensions across eight languages, then assembling Themis-CodePreference, the largest open collection of code preferences with more than 350k pairs. These pairs are used to train Themis-RM, a family of models from 600M to 32B parameters. Experiments show positive scaling with size, strong transfer when training data spans languages, and better reliability when multiple criteria are learned together rather than in isolation. A reader would care because reward models guide the alignment and scaling of code-generating language models, so expanding their scope could improve generated code on qualities such as readability, efficiency, and security.

Core claim

By compiling Themis-CodePreference with more than 350k preference pairs and training Themis-RM on it, the authors produce multilingual code reward models that perform flexible multi-criteria scoring. The resulting models exhibit positive scaling trends, strong cross-lingual transfer from diverse training preferences, and improved reliability when trained on multiple criteria simultaneously.

What carries the argument

Themis-CodePreference, a collection of more than 350k code preference pairs spanning eight languages and five criteria, which supplies the training signal for Themis-RM models to learn multi-criteria judgments.

If this is right

  • Code generation pipelines can optimize for readability, efficiency, and security in addition to functional correctness.
  • Training on preferences from multiple languages produces reward models that generalize across languages without language-specific fine-tuning.
  • Larger parameter counts in the Themis-RM suite continue to improve scoring accuracy on multi-criteria tasks.
  • Multi-criteria training data is required for reward models to avoid over-optimizing single aspects of code quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These reward models could be plugged directly into reinforcement learning loops for code LLMs to produce outputs that better match developer preferences.
  • The same preference-collection method might be applied to related tasks such as code repair or test generation.
  • Hybrid systems that combine the multi-criteria scores with traditional execution feedback could yield more robust alignment signals.

Load-bearing premise

The collected preference pairs accurately reflect what constitutes high-quality code across the five criteria and eight languages.

What would settle it

When Themis-RM scores are used to rank or filter code outputs on a downstream generation task, the selected code shows no measurable gain in human preference or execution success over code ranked by prior single-criterion reward models.

Figures

Figures reproduced from arXiv: 2605.00754 by Glava\v{s} Glavas, Indraneil Paul, Iryna Gurevych.

Figure 1
Figure 1. Figure 1: Overview of our pipeline for mining multi-programming-language multi-criteria code preferences from view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of Themis-CodeRewardBench against the code subsets of popular existing RM evaluation benchmarks. Themis-CodeRewardBench judges RMs on (a) longer and (b) more complex code responses, over a (c) largely novel distribution of prompts. of existing RM benchmarks, showing that Themis-CodeRewardBench introduces a largely novel distribution of code preferences (Figure 2c), for code of increased complexi… view at source ↗
read the original abstract

Reward models (RMs) have become an indispensable fixture of the language model (LM) post-training playbook, enabling policy alignment and test-time scaling. Research on the application of RMs in code generation, however, has been comparatively sparse, with existing work largely focusing on execution feedback. This choice constrains post-training to optimizing functional correctness over self-contained executable code. In this work, we examine the training and evaluation of multilingual, multi-criteria code RMs. To this end, we first compile Themis-CodeRewardBench, a benchmark to evaluate code RMs across five preference dimensions (i.e., criteria) and eight programming languages, on which we profile 50+ code, math, and general-purpose RMs. Observing the limited proficiency of current RMs beyond scoring for functional correctness, we develop Themis-CodePreference, the largest open-source collection of code preferences to date (more than 350k preference pairs), and use it to train Themis-RM, a suite of multilingual code reward models for flexible multi-criteria scoring, ranging in size from 600M to 32B parameters. Our experiments and ablations demonstrate positive scaling trends, strong cross-lingual transfer when training on diverse preferences, and the importance of multi-criteria training for reliable code reward modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Themis-CodeRewardBench, a benchmark evaluating code reward models across five preference dimensions and eight programming languages, profiling over 50 existing RMs. It then presents Themis-CodePreference, an open dataset of more than 350k code preference pairs, used to train Themis-RM models (600M to 32B parameters) for multilingual multi-criteria scoring. Experiments claim positive scaling trends, strong cross-lingual transfer from diverse preferences, and the necessity of multi-criteria training for reliable performance beyond functional correctness.

Significance. If the results hold, this work provides a substantial open resource for code reward modeling, extending beyond execution-based feedback to multi-criteria evaluation in a multilingual setting. The scale of the preference dataset and the suite of trained models represent a clear contribution, with demonstrated scaling and transfer effects offering practical value for post-training alignment in code generation.

major comments (2)
  1. [Benchmark Construction] Benchmark Construction (likely §3 or equivalent): Themis-CodeRewardBench is built from the same preference collection pipeline as Themis-CodePreference. This creates a circularity risk where any systematic biases in the (possibly LLM-generated) preferences could inflate both training performance and benchmark scores. External grounding against human judgments or downstream execution-based metrics for code generation tasks is needed to validate that the benchmark reliably measures real-world RM utility.
  2. [Experiments and Ablations] Experiments and Ablations (likely §5): While positive scaling trends and benefits of multi-criteria training are reported, the manuscript must provide more granular details on ablation controls, including exact comparison setups between multi-criteria and single-criterion models, statistical significance of cross-lingual transfer results, and how the five criteria are balanced during training. Without these, the claim that multi-criteria training is 'important for reliable code reward modeling' remains under-supported.
minor comments (2)
  1. [Abstract/Introduction] The abstract and introduction should explicitly list the five preference dimensions (criteria) rather than referring to them generically, to improve immediate clarity for readers.
  2. [Model Training] Model size notation (600M, 32B) and parameter counts should be used consistently in tables and text; minor inconsistencies in reporting training hyperparameters across model scales would benefit from standardization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. The two major comments raise important points about potential circularity in benchmark construction and the need for more detailed experimental controls. We address each below and have incorporated revisions to strengthen the paper.

read point-by-point responses
  1. Referee: [Benchmark Construction] Benchmark Construction (likely §3 or equivalent): Themis-CodeRewardBench is built from the same preference collection pipeline as Themis-CodePreference. This creates a circularity risk where any systematic biases in the (possibly LLM-generated) preferences could inflate both training performance and benchmark scores. External grounding against human judgments or downstream execution-based metrics for code generation tasks is needed to validate that the benchmark reliably measures real-world RM utility.

    Authors: We appreciate this concern regarding potential circularity. The benchmark and training dataset do share the same preference collection pipeline, but the benchmark instances were explicitly held out and disjoint from the training pairs to avoid leakage. To address the need for external grounding, we have added new experiments in the revised manuscript comparing Themis-RM scores against human judgments on a sampled subset of the benchmark (with inter-annotator agreement reported) as well as downstream code generation performance using execution-based metrics on tasks like HumanEval and MBPP. These additions help validate that the benchmark captures meaningful RM utility beyond pipeline-specific biases. revision: yes

  2. Referee: [Experiments and Ablations] Experiments and Ablations (likely §5): While positive scaling trends and benefits of multi-criteria training are reported, the manuscript must provide more granular details on ablation controls, including exact comparison setups between multi-criteria and single-criterion models, statistical significance of cross-lingual transfer results, and how the five criteria are balanced during training. Without these, the claim that multi-criteria training is 'important for reliable code reward modeling' remains under-supported.

    Authors: We agree that additional granularity is required to fully support the claims. In the revised manuscript, we have expanded Section 5 with: (i) precise ablation setups detailing matched data volumes and training steps for multi-criteria versus single-criterion models; (ii) statistical significance results using paired t-tests and bootstrap confidence intervals for the cross-lingual transfer experiments; and (iii) explicit details on criterion balancing, implemented via equal-proportion sampling across the five dimensions during training. These revisions provide stronger empirical support for the necessity of multi-criteria training. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical data collection and training pipeline

full rationale

The paper's core claims rest on compiling a new benchmark (Themis-CodeRewardBench) for profiling existing RMs, then collecting a large independent preference dataset (Themis-CodePreference with >350k pairs) to train new multilingual RMs (Themis-RM) from 600M to 32B parameters. Experiments report scaling trends, cross-lingual transfer, and benefits of multi-criteria training. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the derivation; the benchmark and dataset are presented as distinct artifacts without reducing evaluation metrics to training inputs by construction. This is standard empirical ML work whose results remain falsifiable against external downstream metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work relies on standard supervised preference learning and reward modeling techniques from prior RLHF literature; no new axioms, free parameters fitted to target results, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5540 in / 1225 out tokens · 41553 ms · 2026-05-11T01:53:49.070608+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 2 internal anchors

  1. [1]

    doi: 10.1109/CLOUD62652.2024.00052

    IEEE, 2024. doi: 10.1109/CLOUD62652.2024.00052. URLhttps://doi.org/10.1109/CLOUD62652. 2024.00052. Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. In Lun-Wei Ku, Andre Martins, and ...

  2. [2]

    URLhttps://openreview.net/forum?id=UunCPtPOlZ

    OpenReview.net, 2025. URLhttps://openreview.net/forum?id=UunCPtPOlZ. Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Dinesh Manocha, Furong Huang, Amrit S. Bedi, and Mengdi Wang. Maxmin-rlhf: Alignment with diverse human preferences. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Fel...

  3. [3]

    Improving code generation by training with natural language feedback

    URLhttps://proceedings.mlr.press/v235/chakraborty24b.html. Angelica Chen, Jérémy Scheurer, Tomasz Korbak, Jon Ander Campos, Jun Shern Chan, Samuel R. Bowman, Kyunghyun Cho, and Ethan Perez. Improving code generation by training with natural language feedback. CoRR, abs/2303.16749, 2023a. doi: 10.48550/ARXIV.2303.16749. URLhttps://doi.org/10.48550/arXiv. 2...

  4. [4]

    URLhttps://doi.org/10.48550/arXiv.2408.13855

    doi: 10.48550/ARXIV.2408.13855. URLhttps://doi.org/10.48550/arXiv.2408.13855. Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe RLHF: safe reinforcement learning from human feedback. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe...

  5. [5]

    Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking.arXiv preprint arXiv:2312.09244,

    doi: 10.48550/ARXIV.2312.09244. URLhttps://doi.org/10.48550/arXiv.2312.09244. Yanai Elazar, Akshita Bhagia, Ian Magnusson, Abhilasha Ravichander, Dustin Schwenk, Alane Suhr, Evan Pete Walsh, Dirk Groeneveld, Luca Soldaini, Sameer Singh, Hannaneh Hajishirzi, Noah A. Smith, and Jesse Dodge. What’s in my big data? InThe Twelfth International Conference on Le...

  6. [6]

    URLhttps://doi.org/10.24963/ijcai.2017/656

    doi: 10.24963/IJCAI.2017/656. URLhttps://doi.org/10.24963/ijcai.2017/656. Zhiyu Fan, Kirill Vasilevski, Dayi Lin, Boyuan Chen, Yihao Chen, Zhiqing Zhong, Jie M. Zhang, Pinjia He, and Ahmed E. Hassan. Swe-effi: Re-evaluating software AI agent system effectiveness under resource constraints.CoRR, abs/2509.09853, 2025. doi: 10.48550/ARXIV.2509.09853. URL htt...

  7. [7]

    URLhttps://openreview.net/forum?id=cbttLtO94Q

    OpenReview.net, 2025. URLhttps://openreview.net/forum?id=cbttLtO94Q. Yanjun Fu, Ethan Baker, and Yizheng Chen. Constrained decoding for secure code generation.CoRR, abs/2405.00218, 2024. doi: 10.48550/ARXIV.2405.00218. URLhttps://doi.org/10.48550/arXiv.2405. 00218. Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. I...

  8. [8]

    Acero, Z

    URLhttps://openreview.net/forum?id=Sx038qxjek. Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.CoRR, abs/2507.17746, 2025. doi: 10.48550/ARXIV. 2507.17746. URLhttps://doi.org/10.48550/arXiv.2507.17746. Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Do...

  9. [9]

    Mohammed Kharma, Soohyeon Choi, Mohammed AlKhanafseh, and David Mohaisen

    URLhttps://doi.org/10.18653/v1/2023.findings-emnlp.410. Mohammed Kharma, Soohyeon Choi, Mohammed AlKhanafseh, and David Mohaisen. Security and quality in llm-generated code: A multi-language, multi-model analysis.CoRR, abs/2502.01853, 2025. doi: 10.48550/ARXIV.2502.01853. URLhttps://doi.org/10.48550/arXiv.2502.01853. Seungone Kim, Jamin Shin, Yejin Choi, ...

  10. [10]

    Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback

    OpenReview.net, 2024b. URLhttps://openreview.net/forum?id=PXD3FAVHJT. Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Vinayak Bhalerao, Christopher L. Buckley, Jason Phang, Samuel R. Bowman, and Ethan Perez. Pretraining language models with human preferences. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan...

  11. [11]

    G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment

    OpenReview.net, 2025e. URLhttps://openreview.net/forum?id=88AS5MQnmC. Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP...

  12. [12]

    doi: 10.18653/V1/2025.EMNLP-MAIN.162

    Association for Computational Linguistics, 2025. doi: 10.18653/V1/2025.EMNLP-MAIN.162. URL https://doi.org/10.18653/v1/2025.emnlp-main.162. JeffreyJianMa, MiladHashemi, AmirYazdanbakhsh, KevinSwersky, OfirPress, EnhuiLi, VijayJanapaReddi, and Parthasarathy Ranganathan. Swe-fficiency: Can language models optimize real-world repositories on real workloads?C...

  13. [13]

    URLhttps://doi.org/10.48550/arXiv.2311.07215

    doi: 10.48550/ARXIV.2311.07215. URLhttps://doi.org/10.48550/arXiv.2311.07215. Niklas Muennighoff, Qian Liu, Armel Randy Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. Octopack: Instruction tuning code large language models. InThe Twelfth International Conference on Learning Representat...

  14. [14]

    Asleep at the keyboard? Assessing the security of github copilot’s code contributions.Commun

    doi: 10.1145/3610721. URLhttps://doi.org/10.1145/3610721. Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.CoRR, abs/1910.00177, 2019. URL http://arxiv.org/abs/ 1910.00177. Yun Peng, Akhilesh Deepak Gotmare, Michael R. Lyu, Caiming Xiong, Silvio Savarese, and D...

  15. [15]

    Crossing the reward bridge: Expanding rl with verifiable rewards across diverse domains.arXiv preprint arXiv:2503.23829, 2025

    doi: 10.48550/ARXIV.2503.23829. URLhttps://doi.org/10.48550/arXiv.2503.23829. Zhiqing Sun, Yikang Shen, Hongxin Zhang, Qinhong Zhou, Zhenfang Chen, David Daniel Cox, Yiming Yang, and Chuang Gan. SALMON: self-alignment with instructable reward models. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, ...

  16. [16]

    Gokul Swamy, Sanjiban Choudhury, Wen Sun, Zhiwei Steven Wu, and J

    URLhttps://openreview.net/forum?id=xJbsmB8UMx. Gokul Swamy, Sanjiban Choudhury, Wen Sun, Zhiwei Steven Wu, and J. Andrew Bagnell. All roads lead to likelihood: The value of reinforcement learning in fine-tuning.CoRR, abs/2503.01067, 2025. doi: 10.48550/ARXIV.2503.01067. URLhttps://doi.org/10.48550/arXiv.2503.01067. Fahim Tajwar, Anikait Singh, Archit Shar...

  17. [17]

    Gemma 2: Improving Open Language Models at a Practical Size

    OpenReview.net, 2025. URLhttps://openreview.net/forum?id=G0dksFayVq. Leitian Tao, Xiang Chen, Tong Yu, Tung Mai, Ryan A. Rossi, Yixuan Li, and Saayan Mitra. Codelutra: Boosting LLM code generation via preference-guided refinement.Trans. Mach. Learn. Res., 2025, 2025. URLhttps://openreview.net/forum?id=IGsEgWM4to. Gemma Team. Gemma 2: Improving open langua...

  18. [18]

    URLhttps://doi.org/10.48550/arXiv.2504.04699

    doi: 10.48550/ARXIV.2504.04699. URLhttps://doi.org/10.48550/arXiv.2504.04699. Chenxi Whitehouse, Tianlu Wang, Ping Yu, Xian Li, Jason Weston, Ilia Kulikov, and Swarnadeep Saha. J1: incentivizing thinking in llm-as-a-judge via reinforcement learning.CoRR, abs/2505.10320, 2025. doi: 10.48550/ARXIV.2505.10320. URLhttps://doi.org/10.48550/arXiv.2505.10320. Ge...

  19. [19]

    URLhttps://openreview.net/forum?id=Pnk7vMbznK

    OpenReview.net, 2025c. URLhttps://openreview.net/forum?id=Pnk7vMbznK. Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.),Findings of the Association for Computational Ling...

  20. [20]

    doi: 10.18653/V1/2024.FINDINGS-EMNLP.382

    Association for Computational Linguistics, 2024e. doi: 10.18653/V1/2024.FINDINGS-EMNLP.382. URLhttps://doi.org/10.18653/v1/2024.findings-emnlp.382. 37 Weiqing Yang, Hanbin Wang, Zhenghao Liu, Xinze Li, Yukun Yan, Shuo Wang, Yu Gu, Minghe Yu, Zhiyuan Liu, and Ge Yu. COAST: enhancing the code debugging ability of llms through communicative agent based data ...

  21. [21]

    Towards a Unified Multi-Dimensional Evaluator for Text Generation

    Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.EMNLP-MAIN.131. URL https://doi.org/10.18653/v1/2022.emnlp-main.131. Changzhi Zhou, Xinyu Zhang, Dandan Song, Xiancai Chen, Wanli Gu, Huipeng Ma, Yuhang Tian, Mengdi Zhang, and Linmei Hu. Refinecoder: Iterative improving of large language models via adaptive critique refinement for cod...

  22. [22]

    Additionally, we ensure that all GitHub commit preference data we train on is sourced no later thanMarch 2019

    We first ensure that all samples inThemis-GeneralPreference and Themis-CodePreference are no longer than 2560 and 4096 tokens, respectively.7 Subsequently, we filter out samples with trivial code responses whose syntax tree is shallower than3 levels deep. Additionally, we ensure that all GitHub commit preference data we train on is sourced no later thanMarch 2019

  23. [23]

    We next leverage the GlotLID (Kargaran et al., 2023) language classifier to discard samples with non- English prompts, followed by filtering out samples with prompt perplexities greater than1200, as measured by a KenLM (Heafield, 2011) model trained on the OSCAR EN corpus (Abadji et al., 2022)

  24. [24]

    Next, we run a dataset-level (i.e.,Themis-GeneralPreference and Themis-CodePreference separately) near-deduplication step using a MinHash (Broder, 1997) filter with a shingle size of20 and a similarity threshold of 0.75. Finally, following prior work (Brown et al., 2020; Elazar et al., 2024), we decontaminate our training data by removing any sample whose...

  25. [25]

    The change may or may not also contain unrelated edits that are not specific to {{criteria}}

    The change doesn't improve the code's {{criteria}} or degrades it overall. The change may or may not also contain unrelated edits that are not specific to {{criteria}}. The change may also introduce other issues or bugs unrelated to {{criteria}}

  26. [26]

    The change may or may not also contain unrelated edits that are not specific to {{criteria}}

    The code change is unnecessary and does not have any discernible effect on the code's {{criteria}}, but does not degrade its {{criteria}} either. The change may or may not also contain unrelated edits that are not specific to {{criteria}}

  27. [27]

    The change might also contain unnecessary edits unrelated to {{criteria}}, but the majority of the changes are specific to {{criteria}}

    The code change makes the code slightly better with respect to {{criteria}} but largely leaves it the same. The change might also contain unnecessary edits unrelated to {{criteria}}, but the majority of the changes are specific to {{criteria}}

  28. [28]

    Sporadic edits unrelated to {{criteria}} may exist, but the majority of the changes are specific to {{criteria}}

    The code change makes the code significantly better with respect to {{criteria}}. Sporadic edits unrelated to {{criteria}} may exist, but the majority of the changes are specific to {{criteria}}. The change does not introduce any new issues or bugs unrelated to {{criteria}}

  29. [29]

    The change is also well implemented and specific, i.e., not a generic suggestion that could apply to any codebase

    The code change greatly improves the code's {{criteria}}, making it a must−have feature or addition. The change is also well implemented and specific, i.e., not a generic suggestion that could apply to any codebase. The incidence of unnecessary edits that are unrelated to {{criteria}} is minimal or non−existent in the change. The change does not introduce...

  30. [30]

    Is consistent in meaning with the provided reference solutions

  31. [31]

    Would sufficiently identifiably lead a mid−tier to experienced developer to plausibly converge on either of the reference solutions (EXAMPLE1 or EXAMPLE2) with equal likelihood

  32. [32]

    Resembles a {{content_style}} in its level of detail, complexity, structure, and style

  33. [33]

    Provide the instruction you craft between [INSTRUCTION] and [\INSTRUCTION] tags

    Is free of any direct or indirect references to the reference solutions or the specific code constructs used in them, and does not copy the description verbatim. Provide the instruction you craft between [INSTRUCTION] and [\INSTRUCTION] tags. Listing 3: The inverse instruction creation prompts for crafting realistic queries for code change pairs mined fro...

  34. [34]

    It must be a modification of the reference solution that introduces only functional, logical , and algorithmic bugs

  35. [35]

    However, you must try to maintain the code's surface−level structure as much as possible

    The introduction of small syntax and grammatical errors is also allowed. However, you must try to maintain the code's surface−level structure as much as possible

  36. [36]

    The problem statement and the reference solution are provided to you as part of the input, but you must not use them in your output

    The buggy code must not allude to the original problem or the reference solution in any way. The problem statement and the reference solution are provided to you as part of the input, but you must not use them in your output

  37. [37]

    Variables, functions, classes, and other identifiers should not be named in a way that suggests the presence of bugs

    The buggy code must not allude to the introduced bugs in any way. Variables, functions, classes, and other identifiers should not be named in a way that suggests the presence of bugs. Similarly, the comments and documentation should not hint at the bugs

  38. [38]

    The addition of new features or the removal of existing ones is out of scope for this task

  39. [39]

    Below is a validated (PROBLEM, REFERENCE_SOLUTION) pair that you can use to generate the buggy code snippet

    The introduction of security vulnerabilities, memory leaks, or other non−functional bugs is out of scope for this task. Below is a validated (PROBLEM, REFERENCE_SOLUTION) pair that you can use to generate the buggy code snippet. The problem is enclosed between the tags [PROBLEM] and [\PROBLEM]. The reference solution is enclosed between the tags [REFERENC...