Recognition: 2 theorem links
· Lean TheoremThemis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring
Pith reviewed 2026-05-11 01:53 UTC · model grok-4.3
The pith
A dataset of over 350k code preference pairs trains multilingual reward models that score code on five criteria instead of correctness alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By compiling Themis-CodePreference with more than 350k preference pairs and training Themis-RM on it, the authors produce multilingual code reward models that perform flexible multi-criteria scoring. The resulting models exhibit positive scaling trends, strong cross-lingual transfer from diverse training preferences, and improved reliability when trained on multiple criteria simultaneously.
What carries the argument
Themis-CodePreference, a collection of more than 350k code preference pairs spanning eight languages and five criteria, which supplies the training signal for Themis-RM models to learn multi-criteria judgments.
If this is right
- Code generation pipelines can optimize for readability, efficiency, and security in addition to functional correctness.
- Training on preferences from multiple languages produces reward models that generalize across languages without language-specific fine-tuning.
- Larger parameter counts in the Themis-RM suite continue to improve scoring accuracy on multi-criteria tasks.
- Multi-criteria training data is required for reward models to avoid over-optimizing single aspects of code quality.
Where Pith is reading between the lines
- These reward models could be plugged directly into reinforcement learning loops for code LLMs to produce outputs that better match developer preferences.
- The same preference-collection method might be applied to related tasks such as code repair or test generation.
- Hybrid systems that combine the multi-criteria scores with traditional execution feedback could yield more robust alignment signals.
Load-bearing premise
The collected preference pairs accurately reflect what constitutes high-quality code across the five criteria and eight languages.
What would settle it
When Themis-RM scores are used to rank or filter code outputs on a downstream generation task, the selected code shows no measurable gain in human preference or execution success over code ranked by prior single-criterion reward models.
Figures
read the original abstract
Reward models (RMs) have become an indispensable fixture of the language model (LM) post-training playbook, enabling policy alignment and test-time scaling. Research on the application of RMs in code generation, however, has been comparatively sparse, with existing work largely focusing on execution feedback. This choice constrains post-training to optimizing functional correctness over self-contained executable code. In this work, we examine the training and evaluation of multilingual, multi-criteria code RMs. To this end, we first compile Themis-CodeRewardBench, a benchmark to evaluate code RMs across five preference dimensions (i.e., criteria) and eight programming languages, on which we profile 50+ code, math, and general-purpose RMs. Observing the limited proficiency of current RMs beyond scoring for functional correctness, we develop Themis-CodePreference, the largest open-source collection of code preferences to date (more than 350k preference pairs), and use it to train Themis-RM, a suite of multilingual code reward models for flexible multi-criteria scoring, ranging in size from 600M to 32B parameters. Our experiments and ablations demonstrate positive scaling trends, strong cross-lingual transfer when training on diverse preferences, and the importance of multi-criteria training for reliable code reward modeling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Themis-CodeRewardBench, a benchmark evaluating code reward models across five preference dimensions and eight programming languages, profiling over 50 existing RMs. It then presents Themis-CodePreference, an open dataset of more than 350k code preference pairs, used to train Themis-RM models (600M to 32B parameters) for multilingual multi-criteria scoring. Experiments claim positive scaling trends, strong cross-lingual transfer from diverse preferences, and the necessity of multi-criteria training for reliable performance beyond functional correctness.
Significance. If the results hold, this work provides a substantial open resource for code reward modeling, extending beyond execution-based feedback to multi-criteria evaluation in a multilingual setting. The scale of the preference dataset and the suite of trained models represent a clear contribution, with demonstrated scaling and transfer effects offering practical value for post-training alignment in code generation.
major comments (2)
- [Benchmark Construction] Benchmark Construction (likely §3 or equivalent): Themis-CodeRewardBench is built from the same preference collection pipeline as Themis-CodePreference. This creates a circularity risk where any systematic biases in the (possibly LLM-generated) preferences could inflate both training performance and benchmark scores. External grounding against human judgments or downstream execution-based metrics for code generation tasks is needed to validate that the benchmark reliably measures real-world RM utility.
- [Experiments and Ablations] Experiments and Ablations (likely §5): While positive scaling trends and benefits of multi-criteria training are reported, the manuscript must provide more granular details on ablation controls, including exact comparison setups between multi-criteria and single-criterion models, statistical significance of cross-lingual transfer results, and how the five criteria are balanced during training. Without these, the claim that multi-criteria training is 'important for reliable code reward modeling' remains under-supported.
minor comments (2)
- [Abstract/Introduction] The abstract and introduction should explicitly list the five preference dimensions (criteria) rather than referring to them generically, to improve immediate clarity for readers.
- [Model Training] Model size notation (600M, 32B) and parameter counts should be used consistently in tables and text; minor inconsistencies in reporting training hyperparameters across model scales would benefit from standardization.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback on our manuscript. The two major comments raise important points about potential circularity in benchmark construction and the need for more detailed experimental controls. We address each below and have incorporated revisions to strengthen the paper.
read point-by-point responses
-
Referee: [Benchmark Construction] Benchmark Construction (likely §3 or equivalent): Themis-CodeRewardBench is built from the same preference collection pipeline as Themis-CodePreference. This creates a circularity risk where any systematic biases in the (possibly LLM-generated) preferences could inflate both training performance and benchmark scores. External grounding against human judgments or downstream execution-based metrics for code generation tasks is needed to validate that the benchmark reliably measures real-world RM utility.
Authors: We appreciate this concern regarding potential circularity. The benchmark and training dataset do share the same preference collection pipeline, but the benchmark instances were explicitly held out and disjoint from the training pairs to avoid leakage. To address the need for external grounding, we have added new experiments in the revised manuscript comparing Themis-RM scores against human judgments on a sampled subset of the benchmark (with inter-annotator agreement reported) as well as downstream code generation performance using execution-based metrics on tasks like HumanEval and MBPP. These additions help validate that the benchmark captures meaningful RM utility beyond pipeline-specific biases. revision: yes
-
Referee: [Experiments and Ablations] Experiments and Ablations (likely §5): While positive scaling trends and benefits of multi-criteria training are reported, the manuscript must provide more granular details on ablation controls, including exact comparison setups between multi-criteria and single-criterion models, statistical significance of cross-lingual transfer results, and how the five criteria are balanced during training. Without these, the claim that multi-criteria training is 'important for reliable code reward modeling' remains under-supported.
Authors: We agree that additional granularity is required to fully support the claims. In the revised manuscript, we have expanded Section 5 with: (i) precise ablation setups detailing matched data volumes and training steps for multi-criteria versus single-criterion models; (ii) statistical significance results using paired t-tests and bootstrap confidence intervals for the cross-lingual transfer experiments; and (iii) explicit details on criterion balancing, implemented via equal-proportion sampling across the five dimensions during training. These revisions provide stronger empirical support for the necessity of multi-criteria training. revision: yes
Circularity Check
No significant circularity in empirical data collection and training pipeline
full rationale
The paper's core claims rest on compiling a new benchmark (Themis-CodeRewardBench) for profiling existing RMs, then collecting a large independent preference dataset (Themis-CodePreference with >350k pairs) to train new multilingual RMs (Themis-RM) from 600M to 32B parameters. Experiments report scaling trends, cross-lingual transfer, and benefits of multi-criteria training. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the derivation; the benchmark and dataset are presented as distinct artifacts without reducing evaluation metrics to training inputs by construction. This is standard empirical ML work whose results remain falsifiable against external downstream metrics.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearWe train Themis-RM models ... using the Bradley-Terry reward modeling objective on preference tuples ... L = -E[...] log σ(rθ(pc,yc) - rθ(pr,yr)) + λ·log pθ(yc|...) + μ·(rθ...)^2
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearpositive scaling trends, strong cross-lingual transfer ... importance of multi-criteria training
Reference graph
Works this paper leans on
-
[1]
doi: 10.1109/CLOUD62652.2024.00052
IEEE, 2024. doi: 10.1109/CLOUD62652.2024.00052. URLhttps://doi.org/10.1109/CLOUD62652. 2024.00052. Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. In Lun-Wei Ku, Andre Martins, and ...
-
[2]
URLhttps://openreview.net/forum?id=UunCPtPOlZ
OpenReview.net, 2025. URLhttps://openreview.net/forum?id=UunCPtPOlZ. Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Dinesh Manocha, Furong Huang, Amrit S. Bedi, and Mengdi Wang. Maxmin-rlhf: Alignment with diverse human preferences. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Fel...
work page 2025
-
[3]
Improving code generation by training with natural language feedback
URLhttps://proceedings.mlr.press/v235/chakraborty24b.html. Angelica Chen, Jérémy Scheurer, Tomasz Korbak, Jon Ander Campos, Jun Shern Chan, Samuel R. Bowman, Kyunghyun Cho, and Ethan Perez. Improving code generation by training with natural language feedback. CoRR, abs/2303.16749, 2023a. doi: 10.48550/ARXIV.2303.16749. URLhttps://doi.org/10.48550/arXiv. 2...
-
[4]
URLhttps://doi.org/10.48550/arXiv.2408.13855
doi: 10.48550/ARXIV.2408.13855. URLhttps://doi.org/10.48550/arXiv.2408.13855. Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe RLHF: safe reinforcement learning from human feedback. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe...
-
[5]
doi: 10.48550/ARXIV.2312.09244. URLhttps://doi.org/10.48550/arXiv.2312.09244. Yanai Elazar, Akshita Bhagia, Ian Magnusson, Abhilasha Ravichander, Dustin Schwenk, Alane Suhr, Evan Pete Walsh, Dirk Groeneveld, Luca Soldaini, Sameer Singh, Hannaneh Hajishirzi, Noah A. Smith, and Jesse Dodge. What’s in my big data? InThe Twelfth International Conference on Le...
-
[6]
URLhttps://doi.org/10.24963/ijcai.2017/656
doi: 10.24963/IJCAI.2017/656. URLhttps://doi.org/10.24963/ijcai.2017/656. Zhiyu Fan, Kirill Vasilevski, Dayi Lin, Boyuan Chen, Yihao Chen, Zhiqing Zhong, Jie M. Zhang, Pinjia He, and Ahmed E. Hassan. Swe-effi: Re-evaluating software AI agent system effectiveness under resource constraints.CoRR, abs/2509.09853, 2025. doi: 10.48550/ARXIV.2509.09853. URL htt...
-
[7]
URLhttps://openreview.net/forum?id=cbttLtO94Q
OpenReview.net, 2025. URLhttps://openreview.net/forum?id=cbttLtO94Q. Yanjun Fu, Ethan Baker, and Yizheng Chen. Constrained decoding for secure code generation.CoRR, abs/2405.00218, 2024. doi: 10.48550/ARXIV.2405.00218. URLhttps://doi.org/10.48550/arXiv.2405. 00218. Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. I...
-
[8]
URLhttps://openreview.net/forum?id=Sx038qxjek. Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.CoRR, abs/2507.17746, 2025. doi: 10.48550/ARXIV. 2507.17746. URLhttps://doi.org/10.48550/arXiv.2507.17746. Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Do...
work page internal anchor Pith review doi:10.48550/arxiv 2025
-
[9]
Mohammed Kharma, Soohyeon Choi, Mohammed AlKhanafseh, and David Mohaisen
URLhttps://doi.org/10.18653/v1/2023.findings-emnlp.410. Mohammed Kharma, Soohyeon Choi, Mohammed AlKhanafseh, and David Mohaisen. Security and quality in llm-generated code: A multi-language, multi-model analysis.CoRR, abs/2502.01853, 2025. doi: 10.48550/ARXIV.2502.01853. URLhttps://doi.org/10.48550/arXiv.2502.01853. Seungone Kim, Jamin Shin, Yejin Choi, ...
-
[10]
OpenReview.net, 2024b. URLhttps://openreview.net/forum?id=PXD3FAVHJT. Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Vinayak Bhalerao, Christopher L. Buckley, Jason Phang, Samuel R. Bowman, and Ethan Perez. Pretraining language models with human preferences. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan...
-
[11]
G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment
OpenReview.net, 2025e. URLhttps://openreview.net/forum?id=88AS5MQnmC. Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP...
-
[12]
doi: 10.18653/V1/2025.EMNLP-MAIN.162
Association for Computational Linguistics, 2025. doi: 10.18653/V1/2025.EMNLP-MAIN.162. URL https://doi.org/10.18653/v1/2025.emnlp-main.162. JeffreyJianMa, MiladHashemi, AmirYazdanbakhsh, KevinSwersky, OfirPress, EnhuiLi, VijayJanapaReddi, and Parthasarathy Ranganathan. Swe-fficiency: Can language models optimize real-world repositories on real workloads?C...
-
[13]
URLhttps://doi.org/10.48550/arXiv.2311.07215
doi: 10.48550/ARXIV.2311.07215. URLhttps://doi.org/10.48550/arXiv.2311.07215. Niklas Muennighoff, Qian Liu, Armel Randy Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. Octopack: Instruction tuning code large language models. InThe Twelfth International Conference on Learning Representat...
-
[14]
Asleep at the keyboard? Assessing the security of github copilot’s code contributions.Commun
doi: 10.1145/3610721. URLhttps://doi.org/10.1145/3610721. Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.CoRR, abs/1910.00177, 2019. URL http://arxiv.org/abs/ 1910.00177. Yun Peng, Akhilesh Deepak Gotmare, Michael R. Lyu, Caiming Xiong, Silvio Savarese, and D...
-
[15]
doi: 10.48550/ARXIV.2503.23829. URLhttps://doi.org/10.48550/arXiv.2503.23829. Zhiqing Sun, Yikang Shen, Hongxin Zhang, Qinhong Zhou, Zhenfang Chen, David Daniel Cox, Yiming Yang, and Chuang Gan. SALMON: self-alignment with instructable reward models. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, ...
-
[16]
Gokul Swamy, Sanjiban Choudhury, Wen Sun, Zhiwei Steven Wu, and J
URLhttps://openreview.net/forum?id=xJbsmB8UMx. Gokul Swamy, Sanjiban Choudhury, Wen Sun, Zhiwei Steven Wu, and J. Andrew Bagnell. All roads lead to likelihood: The value of reinforcement learning in fine-tuning.CoRR, abs/2503.01067, 2025. doi: 10.48550/ARXIV.2503.01067. URLhttps://doi.org/10.48550/arXiv.2503.01067. Fahim Tajwar, Anikait Singh, Archit Shar...
-
[17]
Gemma 2: Improving Open Language Models at a Practical Size
OpenReview.net, 2025. URLhttps://openreview.net/forum?id=G0dksFayVq. Leitian Tao, Xiang Chen, Tong Yu, Tung Mai, Ryan A. Rossi, Yixuan Li, and Saayan Mitra. Codelutra: Boosting LLM code generation via preference-guided refinement.Trans. Mach. Learn. Res., 2025, 2025. URLhttps://openreview.net/forum?id=IGsEgWM4to. Gemma Team. Gemma 2: Improving open langua...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.00118 2025
-
[18]
URLhttps://doi.org/10.48550/arXiv.2504.04699
doi: 10.48550/ARXIV.2504.04699. URLhttps://doi.org/10.48550/arXiv.2504.04699. Chenxi Whitehouse, Tianlu Wang, Ping Yu, Xian Li, Jason Weston, Ilia Kulikov, and Swarnadeep Saha. J1: incentivizing thinking in llm-as-a-judge via reinforcement learning.CoRR, abs/2505.10320, 2025. doi: 10.48550/ARXIV.2505.10320. URLhttps://doi.org/10.48550/arXiv.2505.10320. Ge...
-
[19]
URLhttps://openreview.net/forum?id=Pnk7vMbznK
OpenReview.net, 2025c. URLhttps://openreview.net/forum?id=Pnk7vMbznK. Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.),Findings of the Association for Computational Ling...
-
[20]
doi: 10.18653/V1/2024.FINDINGS-EMNLP.382
Association for Computational Linguistics, 2024e. doi: 10.18653/V1/2024.FINDINGS-EMNLP.382. URLhttps://doi.org/10.18653/v1/2024.findings-emnlp.382. 37 Weiqing Yang, Hanbin Wang, Zhenghao Liu, Xinze Li, Yukun Yan, Shuo Wang, Yu Gu, Minghe Yu, Zhiyuan Liu, and Ge Yu. COAST: enhancing the code debugging ability of llms through communicative agent based data ...
-
[21]
Towards a Unified Multi-Dimensional Evaluator for Text Generation
Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.EMNLP-MAIN.131. URL https://doi.org/10.18653/v1/2022.emnlp-main.131. Changzhi Zhou, Xinyu Zhang, Dandan Song, Xiancai Chen, Wanli Gu, Huipeng Ma, Yuhang Tian, Mengdi Zhang, and Linmei Hu. Refinecoder: Iterative improving of large language models via adaptive critique refinement for cod...
-
[22]
We first ensure that all samples inThemis-GeneralPreference and Themis-CodePreference are no longer than 2560 and 4096 tokens, respectively.7 Subsequently, we filter out samples with trivial code responses whose syntax tree is shallower than3 levels deep. Additionally, we ensure that all GitHub commit preference data we train on is sourced no later thanMarch 2019
work page 2019
-
[23]
We next leverage the GlotLID (Kargaran et al., 2023) language classifier to discard samples with non- English prompts, followed by filtering out samples with prompt perplexities greater than1200, as measured by a KenLM (Heafield, 2011) model trained on the OSCAR EN corpus (Abadji et al., 2022)
work page 2023
-
[24]
Next, we run a dataset-level (i.e.,Themis-GeneralPreference and Themis-CodePreference separately) near-deduplication step using a MinHash (Broder, 1997) filter with a shingle size of20 and a similarity threshold of 0.75. Finally, following prior work (Brown et al., 2020; Elazar et al., 2024), we decontaminate our training data by removing any sample whose...
work page 1997
-
[25]
The change may or may not also contain unrelated edits that are not specific to {{criteria}}
The change doesn't improve the code's {{criteria}} or degrades it overall. The change may or may not also contain unrelated edits that are not specific to {{criteria}}. The change may also introduce other issues or bugs unrelated to {{criteria}}
-
[26]
The change may or may not also contain unrelated edits that are not specific to {{criteria}}
The code change is unnecessary and does not have any discernible effect on the code's {{criteria}}, but does not degrade its {{criteria}} either. The change may or may not also contain unrelated edits that are not specific to {{criteria}}
-
[27]
The code change makes the code slightly better with respect to {{criteria}} but largely leaves it the same. The change might also contain unnecessary edits unrelated to {{criteria}}, but the majority of the changes are specific to {{criteria}}
-
[28]
The code change makes the code significantly better with respect to {{criteria}}. Sporadic edits unrelated to {{criteria}} may exist, but the majority of the changes are specific to {{criteria}}. The change does not introduce any new issues or bugs unrelated to {{criteria}}
-
[29]
The code change greatly improves the code's {{criteria}}, making it a must−have feature or addition. The change is also well implemented and specific, i.e., not a generic suggestion that could apply to any codebase. The incidence of unnecessary edits that are unrelated to {{criteria}} is minimal or non−existent in the change. The change does not introduce...
-
[30]
Is consistent in meaning with the provided reference solutions
-
[31]
Would sufficiently identifiably lead a mid−tier to experienced developer to plausibly converge on either of the reference solutions (EXAMPLE1 or EXAMPLE2) with equal likelihood
-
[32]
Resembles a {{content_style}} in its level of detail, complexity, structure, and style
-
[33]
Provide the instruction you craft between [INSTRUCTION] and [\INSTRUCTION] tags
Is free of any direct or indirect references to the reference solutions or the specific code constructs used in them, and does not copy the description verbatim. Provide the instruction you craft between [INSTRUCTION] and [\INSTRUCTION] tags. Listing 3: The inverse instruction creation prompts for crafting realistic queries for code change pairs mined fro...
-
[34]
It must be a modification of the reference solution that introduces only functional, logical , and algorithmic bugs
-
[35]
However, you must try to maintain the code's surface−level structure as much as possible
The introduction of small syntax and grammatical errors is also allowed. However, you must try to maintain the code's surface−level structure as much as possible
-
[36]
The buggy code must not allude to the original problem or the reference solution in any way. The problem statement and the reference solution are provided to you as part of the input, but you must not use them in your output
-
[37]
The buggy code must not allude to the introduced bugs in any way. Variables, functions, classes, and other identifiers should not be named in a way that suggests the presence of bugs. Similarly, the comments and documentation should not hint at the bugs
-
[38]
The addition of new features or the removal of existing ones is out of scope for this task
-
[39]
The introduction of security vulnerabilities, memory leaks, or other non−functional bugs is out of scope for this task. Below is a validated (PROBLEM, REFERENCE_SOLUTION) pair that you can use to generate the buggy code snippet. The problem is enclosed between the tags [PROBLEM] and [\PROBLEM]. The reference solution is enclosed between the tags [REFERENC...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.