pith. sign in

arxiv: 2605.31058 · v1 · pith:PBFG4PGVnew · submitted 2026-05-29 · 💻 cs.CL · cs.SE

Combinatorial Synthesis: Scaling Code RLVR via Atomic Decomposition and Recombination

Pith reviewed 2026-06-28 23:06 UTC · model grok-4.3

classification 💻 cs.CL cs.SE
keywords atomic decompositioncode task synthesisRLVRcombinatorial synthesisLLM codingverifiable rewardsdata synthesisreinforcement learning
0
0 comments X

The pith

Decomposing code tasks into atoms and recombining them generates harder, more original training data for RLVR.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Atomic Decomposition and Recombination (ADR) to solve the shortage of sufficiently challenging verifiable code tasks that limit scaling of Reinforcement Learning with Verifiable Rewards for LLMs. It breaks existing tasks into atomic elements while keeping verifiability intact, then recombines those elements under controlled rules to form new tasks. Experiments show the resulting tasks exceed prior heuristic methods in originality, difficulty, diversity, and test quality. Training LLMs with ADR data produces larger gains in coding performance than baselines, and the gains appear consistently across algorithmic programming, tool usage, and data science.

Core claim

ADR generates verifiable code tasks via decomposition into atomic elements and controlled recombination, thereby enabling the generation of genuinely novel and challenging verifiable code tasks that deliver greater improvements in code ability across RLVR in diverse downstream domains.

What carries the argument

Atomic Decomposition and Recombination (ADR), the mechanism that decomposes code tasks into atomic elements while preserving verifiability and then recombines them to produce new tasks.

If this is right

  • ADR tasks exhibit higher originality, difficulty, diversity, and test quality than tasks from existing baselines.
  • RLVR training on ADR data produces consistent larger gains in code ability than training on baseline data.
  • The gains hold across algorithmic programming, tool usage, and data science domains.
  • The method replaces heuristic seed expansions with systematic combinatorial synthesis for task generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If atomic decomposition generalizes, the same recombination strategy could be applied to other verifiable domains such as mathematical proofs or scientific experiment design.
  • Varying the recombination rules might let practitioners target specific difficulty bands without new human seeds.
  • The approach could be combined with automated difficulty estimators to prioritize tasks near a model's current competence edge.
  • Longer training runs might reveal whether the diversity of ADR tasks reduces overfitting compared with narrower heuristic sets.

Load-bearing premise

Code tasks can be decomposed into atomic elements while preserving verifiability and that controlled recombination will reliably produce genuinely novel and more challenging tasks rather than invalid or trivial ones.

What would settle it

A large-scale check showing that most recombined tasks are either unverifiable by the reward function, solvable at the same rate as the seed tasks, or fail to improve downstream RLVR performance would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.31058 by Boxi Cao, Boxi Yu, Hongyu Lin, Jialun Cao, Jiasheng Zheng, Le Sun, Xianpei Han, Yaojie Lu, Yuzhong Zhang.

Figure 1
Figure 1. Figure 1: Overview of the Atomic Decomposition and Recombination (ADR) framework. ADR [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: t-SNE visualization of data density coverage about ADR and KodCode data, both derived from the same seed data. To validate the effectiveness of Info-Guided Element Schema Optimization (ESO) in Step 1, we examine whether optimized element schemas improve synthe￾sized data quality. Specifically, we randomly sample 100 instances from the TACO dataset as seed data. For each iteration, we generate 100 problems … view at source ↗
Figure 3
Figure 3. Figure 3: Pass@8 (%) performance on LCB￾v5 for Qwen2.5-Coder-7B-Instruct. Previous synthetic data methods, constrained by heuristic expansions of real-world data, fail to surpass original data performance. For exam￾ple, on LCB-v5 and LCB-v6, Educational Instruct achieves an average of 20.66% on Qwen2.5-Coder￾7B-Instruct, substantially underperforming TACO (23.23%), while the strongest baseline, KodCode, reaches 23.1… view at source ↗
Figure 4
Figure 4. Figure 4: RL training dynamics of ADR and baseline datasets based on Qwen2.5-Coder-7B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Prompt Template for Test Quality Metric. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt Template for Step 1: Element Extraction in ADR (algorithmic task). You are an expert in algorithmic problem design . Your task is to analyze the given Story Background and use the three provided combinations as inspiration , then construct one new and coherent set of four elements ( Core Algorithm Idea , Story Background , Strategy Diversity , Difficulty Level ) . The output should follow the format… view at source ↗
Figure 7
Figure 7. Figure 7: Prompt Template for Step 2: Controlled Recombination in ADR (algorithmic task). You are an expert problem setter and algorithmist . Using only the provided Problem Framework , generate a single , self - contained , original , and challenging algorithm problem suitable for programming contests or practice platforms . Follow these rules carefully : 1. ** Framework fidelity with creative flexibility **: The p… view at source ↗
Figure 8
Figure 8. Figure 8: Prompt Template for Step 3: Problem Synthesis in ADR (algorithmic task). ## Task You are given an algorithm problem . Your task is to generate both the ‘ solution code ‘ and the ‘ test case generator code ‘ for that algorithm problem . ## Output format <| Solution Begin | > [ Solution Code in Python ] <| Solution End | > <| Test Case Generator Begin | > [ Test Case Generator in Python ] <| Test Case Genera… view at source ↗
Figure 9
Figure 9. Figure 9: Prompt Template for Step 4: Execution-grounded Validation in ADR (algorithmic task). You are given a programming problem and a reference correct solution . Your task is to generate exactly 5 distinct ** near - miss solutions **. A near - miss solution is : - Logically plausible and well - structured - Likely to pass many simple or random test cases - Incorrect due to subtle flaws ( e . g . , edge cases , o… view at source ↗
Figure 10
Figure 10. Figure 10: Prompt Template for Step 5: Adversarial Solution Space Refinement in ADR. You are given : - A programming problem - A correct reference solution 21 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt Template for Step 5: Adversarial Solution Space Refinement in ADR. Please analyze the following tool - calling code problem according to the guidelines below . The output should follow the format below ( do not add any other notes or explanations ) : < answer > Computational Objective : Tool Dependency Set : Processing Logic Constraints : Input Interface : Output Specification : </ answer > ### Gui… view at source ↗
Figure 12
Figure 12. Figure 12: Prompt Template for Step 1: Element Extraction in ADR (tool usage task). You are an expert in tool - calling code task abstraction and schema - level task design . You are given : - One randomly sampled Computational Objective - One randomly sampled Tool Dependency Set - Three reference sets , each consisting of : - Processing Logic Constraints - Input Interface - Output Specification Your task is to desi… view at source ↗
Figure 13
Figure 13. Figure 13: Prompt Template for Step 2: Controlled Recombination in ADR (tool usage task). You are an expert in ** tool - calling code task synthesis **. You are given a complete task specification expressed as ** five abstract task elements **: * Computational Objective * Tool Dependency Set * Processing Logic Constraints * Input Interface * Output Specification Your task is to ** synthesize a single , complete tool… view at source ↗
Figure 14
Figure 14. Figure 14: Prompt Template for Step 3: Problem Synthesis in ADR (tool usage task). ## Task You are given a tool - calling code problem . Your task is to generate both the ‘ solution code ‘ and the ‘ test code ‘ in pytest for that problem . ## Output format <| Solution Begin | > [ Solution Code in Python ] <| Solution End | > <| Test Code Begin |> [ Test Code in Pytest ] <| Test Code End | > ## Example <| Solution Be… view at source ↗
Figure 15
Figure 15. Figure 15: Prompt Template for Step 4: Execution-grounded Validation in ADR (tool usage task). Please analyze the following data science problem according to the guidelines below . The output should follow the format below ( do not add any other notes or explanations ) : < answer > Data Schema : Task Goal : Output Contract : Implementation Environment : Behavioral Constraints : </ answer > ### Guidelines 1. Data Sch… view at source ↗
Figure 16
Figure 16. Figure 16: Prompt Template for Step 1: Element Extraction in ADR (data science task). You are an expert in data science code task abstraction and schema - level task design . You are given : - One randomly sampled Task Goal - One randomly sampled Data Schema - Three reference sets , each consisting of : - Output Contract - Implementation Environment - Behavioral Constraints Your task is to design a completely new da… view at source ↗
Figure 17
Figure 17. Figure 17: Prompt Template for Step 2: Controlled Recombination in ADR (data science task). You are an expert in ** data science code task synthesis **. You are given a complete task specification expressed as ** five abstract task elements **: * Data Schema * Task Goal * Output Contract * Implementation Environment * Behavioral Constraints Your task is to ** synthesize a single , complete data science code problem … view at source ↗
Figure 18
Figure 18. Figure 18: Prompt Template for Step 3: Problem Synthesis in ADR (data science task). You are an expert in ** data science code task synthesis **. You are given a complete task specification expressed as ** five abstract task elements **: * Data Schema * Task Goal * Output Contract * Implementation Environment * Behavioral Constraints 31 [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Prompt Template for Step 4: Execution-grounded Validation in ADR (data science task). You are an expert in code task abstraction and problem schema design . You are given : 1. A specified code task type 2. Three concrete example tasks of this type 32 [PITH_FULL_IMAGE:figures/full_fig_p032_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Prompt Template for Info-Guided Element Schema Optimization (initialize element schema) in ADR. You are an expert in { task_type } design and representation learning . Your task is to improve the ELEMENT SCHEMA used to synthesize { task_type } problems . ### Current element schema { old_element_schema } ### Your goal The Schema has been decomposed into a set of elements , and for each element you are prov… view at source ↗
Figure 21
Figure 21. Figure 21: Prompt Template for Info-Guided Element Schema Optimization (optimize schema based on the information theory metrics) in ADR. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_21.png] view at source ↗
read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as the cornerstone for shaping the remarkable coding abilities of Large Language Models (LLMs). However, the scalability of RLVR is severely constrained by the scarcity of sufficiently challenging verifiable code tasks that target near the model's edge of competence. Prior studies often rely on heuristic seed expansions for data synthesis, which severely limits both novelty and difficulty. Consequently, the training value of such data fails to scale proportionally with the size of its synthesis. To this end, we propose Atomic Decomposition and Recombination (ADR), a novel framework that generates verifiable code tasks via decomposition into atomic elements and controlled recombination, thereby enabling the generation of genuinely novel and challenging verifiable code tasks. Experiments and analysis demonstrate that ADR achieves superior originality, difficulty, diversity, and test quality over existing baselines, and consistently delivers greater improvements in code ability across RLVR in diverse downstream domains, including algorithmic programming, tool usage, and data science. Our work sheds light on a new paradigm for novel code task synthesis and scalable RLVR training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Atomic Decomposition and Recombination (ADR), a framework that decomposes existing code tasks into atomic elements and recombines them under controlled conditions to synthesize novel, verifiable code tasks for Reinforcement Learning with Verifiable Rewards (RLVR). It claims that ADR-generated tasks exhibit superior originality, difficulty, diversity, and test quality relative to heuristic baselines, and that RLVR training on these tasks yields larger gains in downstream code abilities across algorithmic programming, tool usage, and data science domains.

Significance. If the experimental results hold, ADR would offer a principled, scalable route to the high-quality, edge-of-competence data that currently limits RLVR for code; the combinatorial paradigm could generalize beyond code and reduce reliance on hand-crafted or heuristically expanded seeds.

major comments (2)
  1. [Abstract] Abstract: the central empirical claim—that ADR achieves superior originality, difficulty, diversity, and test quality and delivers greater downstream gains—is asserted without any reported metrics, baselines, statistical tests, or sample sizes, preventing assessment of whether the data support the stated improvements.
  2. The weakest assumption (that atomic decomposition preserves verifiability and that recombination reliably yields non-trivial, valid tasks) is load-bearing for the entire synthesis pipeline, yet the provided text supplies no quantitative validation (e.g., rejection rates, human verification scores, or comparison of task solvability distributions) that would confirm the assumption holds at scale.
minor comments (1)
  1. The title refers to 'Combinatorial Synthesis' while the body uses 'Atomic Decomposition and Recombination (ADR)'; a brief clarification of the relationship between the two phrases would aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central empirical claim—that ADR achieves superior originality, difficulty, diversity, and test quality and delivers greater downstream gains—is asserted without any reported metrics, baselines, statistical tests, or sample sizes, preventing assessment of whether the data support the stated improvements.

    Authors: We agree that the abstract would be strengthened by including key quantitative results. While the body of the paper reports the relevant metrics, baselines, and comparisons, the abstract currently summarizes at a high level. In revision we will add concise numerical highlights (e.g., relative gains in originality and downstream performance) together with the primary baselines used. revision: yes

  2. Referee: The weakest assumption (that atomic decomposition preserves verifiability and that recombination reliably yields non-trivial, valid tasks) is load-bearing for the entire synthesis pipeline, yet the provided text supplies no quantitative validation (e.g., rejection rates, human verification scores, or comparison of task solvability distributions) that would confirm the assumption holds at scale.

    Authors: We acknowledge the need for explicit quantitative support of this core assumption. The current manuscript contains qualitative discussion and some validity checks, but lacks the requested aggregate statistics. We will add a dedicated analysis section or table reporting rejection rates, any human verification results, and solvability distributions across the generated tasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes the ADR framework for generating verifiable code tasks through atomic decomposition and recombination, with all central claims (superior originality, difficulty, diversity, and downstream RLVR improvements) resting on reported experimental comparisons against baselines. No equations, fitted parameters, or self-referential definitions appear in the abstract or described methods; the derivation chain consists of an empirical pipeline whose outputs are externally validated rather than reducing to inputs by construction. Self-citations, if present, are not load-bearing for the core claims, which remain falsifiable via independent replication of the task-generation and RLVR training results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Review based on abstract only; ledger populated from stated premises in the abstract.

axioms (2)
  • domain assumption Verifiable code tasks admit decomposition into atomic elements that preserve verifiability.
    Foundational premise for the ADR generation process described in the abstract.
  • domain assumption Controlled recombination of atomic elements produces novel and more challenging tasks.
    Core mechanism claimed to overcome limitations of heuristic synthesis.
invented entities (1)
  • Atomic Decomposition and Recombination (ADR) framework no independent evidence
    purpose: Generate novel verifiable code tasks for RLVR
    New synthesis method introduced to address data scarcity.

pith-pipeline@v0.9.1-grok · 5739 in / 1200 out tokens · 21069 ms · 2026-06-28T23:06:10.374771+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

103 extracted references · 24 canonical work pages · 17 internal anchors

  1. [1]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  2. [2]

    Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

    Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, et al. Towards large reasoning models: A survey of reinforced reasoning with large language models.arXiv preprint arXiv:2501.09686, 2025

  3. [3]

    A Survey of Reinforcement Learning for Large Reasoning Models

    Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827, 2025

  4. [4]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  5. [5]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

  6. [6]

    Agents in software engineering: Survey, landscape, and vision.Automated Software Engineering, 32(2):70, 2025

    Yanlin Wang, Wanjun Zhong, Yanxian Huang, Ensheng Shi, Min Yang, Jiachi Chen, Hui Li, Yuchi Ma, Qianxiang Wang, and Zibin Zheng. Agents in software engineering: Survey, landscape, and vision.Automated Software Engineering, 32(2):70, 2025

  7. [7]

    Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

    Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, et al. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245, 2025

  8. [8]

    Position: Will we run out of data? limits of llm scaling based on human-generated data

    Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn. Position: Will we run out of data? limits of llm scaling based on human-generated data. InInternational Conference on Machine Learning, pages 49523–49544. PMLR, 2024

  9. [9]

    Absolute Zero: Reinforced Self-play Reasoning with Zero Data

    Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335, 2025

  10. [10]

    Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

  11. [11]

    ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

    Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models. arXiv preprint arXiv:2505.24864, 2025

  12. [12]

    Rl grokking recipe: How does rl unlock and transfer new algorithms in llms? arXiv preprint arXiv:2509.21016, 2025

    Yiyou Sun, Yuhan Cao, Pohao Huang, Haoyue Bai, Hannaneh Hajishirzi, Nouha Dziri, and Dawn Song. Rl grokking recipe: How does rl unlock and transfer new algorithms in llms? arXiv preprint arXiv:2509.21016, 2025

  13. [13]

    On the interplay of pre-training, mid-training, and rl on reasoning language models.arXiv preprint arXiv:2512.07783, 2025

    Charlie Zhang, Graham Neubig, and Xiang Yue. On the interplay of pre-training, mid-training, and rl on reasoning language models.arXiv preprint arXiv:2512.07783, 2025

  14. [14]

    Rlpr: Extrapolating rlvr to general domains without verifiers

    Tianyu Yu, Bo Ji, Shouli Wang, Shu Yao, Zefan Wang, Ganqu Cui, Lifan Yuan, Ning Ding, Yuan Yao, Zhiyuan Liu, et al. Rlpr: Extrapolating rlvr to general domains without verifiers. arXiv preprint arXiv:2506.18254, 2025

  15. [15]

    Opencoder: The open cookbook for top-tier code large language models

    Siming Huang, Tianhao Cheng, Jason Klein Liu, Weidi Xu, Jiaran Hao, Liuyihan Song, Yang Xu, Jian Yang, Jiaheng Liu, Chenchen Zhang, et al. Opencoder: The open cookbook for top-tier code large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 33167–33193, 2025

  16. [16]

    Magicoder: empower- ing code generation with oss-instruct

    Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: empower- ing code generation with oss-instruct. InInternational Conference on Machine Learning, pages 52632–52657, 2024. 10

  17. [17]

    Code alpaca: An instruction-following llama model for code generation

    Sahil Chaudhary. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca, 2023

  18. [18]

    ACECODER: Acing coder RL via automated test-case synthesis

    Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, and Wenhu Chen. ACECODER: Acing coder RL via automated test-case synthesis. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

  19. [19]

    KodCode: A di- verse, challenging, and verifiable synthetic dataset for coding

    Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. KodCode: A di- verse, challenging, and verifiable synthetic dataset for coding. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, 2025

  20. [20]

    Wizardcoder: Empowering code large language models with evol-instruct

    Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with evol-instruct. InICLR, 2024

  21. [21]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290, 2025

  22. [22]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

  23. [23]

    SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

    Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892, 2025

  24. [24]

    Rlef: Grounding code llms in execution feedback with reinforcement learning

    Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, and Gabriel Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning. InInternational Conference on Machine Learning, 2025

  25. [25]

    Crossing the reward bridge: Expanding rl with verifiable rewards across diverse domains.arXiv preprint arXiv:2503.23829, 2025

    Yi Su, Dian Yu, Linfeng Song, Juntao Li, Haitao Mi, Zhaopeng Tu, Min Zhang, and Dong Yu. Crossing the reward bridge: Expanding rl with verifiable rewards across diverse domains.arXiv preprint arXiv:2503.23829, 2025

  26. [26]

    Rubrics as rewards: Reinforcement learning beyond verifiable domains

    Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean M Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains. InNeurIPS 2025 Workshop on Efficient Reasoning, 2025

  27. [27]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025

  28. [28]

    Autocode: Llms as problem setters for competitive programming.arXiv preprint arXiv:2510.12803, 2025

    Shang Zhou, Zihan Zheng, Kaiyuan Liu, Zeyu Shen, Zerui Cheng, Zexing Chen, Hansen He, Jianzhu Yao, Huanzhi Mao, Qiuyang Mang, et al. Autocode: Llms as problem setters for competitive programming.arXiv preprint arXiv:2510.12803, 2025

  29. [29]

    Unicode: A framework for generating high quality competitive coding problems.arXiv preprint arXiv:2510.17868, 2025

    Xinyue Zheng, Haowei Lin, Shaofei Cai, Zilong Zheng, and Yitao Liang. Unicode: A framework for generating high quality competitive coding problems.arXiv preprint arXiv:2510.17868, 2025

  30. [30]

    Visualizing data using t-sne.Journal of machine learning research, 9(Nov):2579–2605, 2008

    Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(Nov):2579–2605, 2008

  31. [31]

    Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023

    Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023

  32. [32]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jia- jun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024. 11

  33. [33]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  34. [34]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  35. [35]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

  36. [36]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  37. [37]

    Livecodebench: Holistic and contamination free evaluation of large language models for code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations

  38. [38]

    Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions

    Terry Yue Zhuo, Vu Minh Chien, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions. InThe Thirteenth International Conference on Learning Representations

  39. [39]

    Ds-1000: A natural and reliable benchmark for data science code generation

    Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen- tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark for data science code generation. InInternational Conference on Machine Learning, pages 18319–18345. PMLR, 2023

  40. [40]

    Measuring coding challenge competence with apps

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with apps. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)

  41. [41]

    Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022

  42. [42]

    Codeforces Dataset, 2022

    Jur1cek. Codeforces Dataset, 2022. 12 A Additional Experimental Setups A.1 Evaluation of Synthetic Data Quality

  43. [43]

    We obtain data representa- tions using theall-MiniLM-L6-v2embedding model and compute cosine similarity

    Originality: We select the PrimeIntellect/verifiable-coding-problems as the ref- erence dataset R, which contains 144,169 problems spanning diverse sources, including Apps [40], CodeContests [41], Codeforces [42], and TACO [31]. We obtain data representa- tions using theall-MiniLM-L6-v2embedding model and compute cosine similarity

  44. [44]

    We choose the non-thinking mode

    Difficulty: We choose Qwen/Qwen3-4B, Qwen/Qwen3-8B, and Qwen/Qwen3-14B [34] as the representative model setM. We choose the non-thinking mode

  45. [45]

    For each problem, we then identify its nearest neighbor in the embedding space and record the corresponding nearest-neighbor distance

    Diversity: We obtain data representations using the all-MiniLM-L6-v2 embedding model and compute the Euclidean distance between each pair of problems. For each problem, we then identify its nearest neighbor in the embedding space and record the corresponding nearest-neighbor distance. We compute the coefficient of variation (CV) of all nearest- neighbor d...

  46. [46]

    input ": inputs , // List of stdin strings

    Test Quality: We evaluate test case quality using an LLM-as-a-Judge framework, considering both test case diversity and edge coverage. The corresponding prompt is shown in Figure 5. A.2 ADR-synthesized Data in RLVR Experiments For algorithm tasks, we follow the ADR paradigm and synthesize 5,000 training data using DeepSeek-V3.2 [ 35]. Specifically, we sel...

  47. [47]

    Do not reuse or merge their specific ideas

    Learn the s t r u c t u r a l patterns , not the content - Extract from the examples their level of detail , r ea son in g style , and the way elements relate to each other . Do not reuse or merge their specific ideas

  48. [48]

    - The Strategy Div er si ty must c o r r e s p o n d to the a l g o r i t h m i c str uc tu re implied by the Core Alg or it hm Idea

    Preserve internal c oh ere nc e - Ensure the four g en er ate d elements n atu ra ll y support each other : - The Story B a c k g r o u n d should o r g a n i c a l l y in tr od uc e the c o n s t r a i n t s that motivate the Core Al go ri th m Idea . - The Strategy Div er si ty must c o r r e s p o n d to the a l g o r i t h m i c str uc tu re implied b...

  49. [49]

    Maintain o r i g i n a l i t y and avoid c on fl ic ts - Your output must be a fully new c o n s t r u c t i o n - no copying from examples - and must avoid internal c o n t r a d i c t i o n s in constraints , methods , or c o m p l e x i t y a s s u m p t i o n s

  50. [50]

    You are an expert problem setter and a l g o r i t h m i s t

    Ensure problem - level richness - The new c o m b i n a t i o n should have enough st ru ct ure and c o m p l e x i t y to support a m e a n i n g f u l al go ri thm problem , with : - Non - trivial de cis io ns or c o n s t r a i n t s - Multiple pl au si ble solution a p p r o a c h e s - Clear reasons for the d i f f i c u l t y c l a s s i f i c a t i...

  51. [51]

    You may in tr od uce new scenarios , examples , or problem twists only within the limits of the framework , in order to make the problem original and engaging

    ** F ra mew or k fidelity with creative f l e x i b i l i t y **: The problem must strictly respect the content , constraints , and core setup defined in the Problem Fr am ew or k . You may in tr od uce new scenarios , examples , or problem twists only within the limits of the framework , in order to make the problem original and engaging . Do not alter o...

  52. [52]

    The problem should en cou ra ge diverse solution s t r a t e g i e s and subtle a l g o r i t h m i c thinking

    ** O r i g i n a l i t y and a l g o r i t h m i c cha ll en ge **: Within the fr am ew or k boundaries , design the problem so that it is non - trivial and requires m e a n i n g f u l a l g o r i t h m i c re as on in g . The problem should en cou ra ge diverse solution s t r a t e g i e s and subtle a l g o r i t h m i c thinking

  53. [53]

    wait "

    ** No meta - comments or re as on in g steps **: Exclude all internal deliberation , step - by - step reasoning , and self - r e f e r e n t i a l phrases . Avoid using terms such as " wait " " let ’ s " or any similar language that i nd ic ate s thinking

  54. [54]

    The model ’ s entire output must be exactly the filled template , with no extra text or c o m m e n t a r y outside the template

    ** Strict output format **: Produce only a fully c om pl ete d Problem Template block . The model ’ s entire output must be exactly the filled template , with no extra text or c o m m e n t a r y outside the template

  55. [55]

    __main__

    ** Well - sp ec if ie d c o n s t r a i n t s **: Ensure input / output and formal c o n s t r a i n t s are complete , and the intended a l g o r i t h m i c c o m p l e x i t y is clear . ### Problem Fr am ew or k { fr am ew or k } ### Problem Template ‘‘‘ ** Problem Title :** [ Your problem title ] ** Tags :** [ Comma - se pa rat ed topics / tags ] ** ...

  56. [56]

    Each near - miss solution should be written as full e x e c u t a b l e code

  57. [57]

    Each solution must fail for at least one non - trivial or a d v e r s a r i a l input

  58. [58]

    Avoid re pe at ing the same error pattern

    The mistakes should be diverse . Avoid re pe at ing the same error pattern

  59. [59]

    Do NOT e x p l i c i t l y state what the bug is in the code

  60. [60]

    Do NOT include e x p l a n a t i o n s inside the code

  61. [61]

    Output the sol ut io ns as a numbered list from 1 to 5. ### Problem d e s c r i p t i o n : { problem } ### Correct re fe re nc e solution : ‘‘‘ python { r e f e r e n c e _ s o l u t i o n } ‘‘‘ Figure 10: Prompt Template forStep 5: Adversarial Solution Space Refinementin ADR. You are given : - A p r o g r a m m i n g problem - A correct re fer en ce sol...

  62. [62]

    Analyze the common and uncommon w e a k n e s s e s likely present in the near - miss so lu ti on s

  63. [63]

    Design test cases that s p e c i f i c a l l y target : - Edge cases - Boundary c o n d i t i o n s - Rare corner s ce na ri os - Stress limits ( size , value ranges , ordering , s tr uc tu re ) - Implicit a s s u m p t i o n s likely made by in co rr ec t s ol ut io ns

  64. [64]

    The ge ne ra to r should be general and reusable , not h ard co de d for a single bug

  65. [65]

    Do NOT e x p l i c i t l y r efe re nc e i n d i v i d u a l near - miss s ol ut ion s in the g en er at or logic

  66. [66]

    Output the improved test case g en er at or as e x e c u t a b l e code or clear p s e u d o c o d e . ### Problem d e s c r i p t i o n : { problem } ### Correct re fe re nc e solution : ‘‘‘ python { r e f e r e n c e _ s o l u t i o n } ‘‘‘ ### Near - miss so lu ti ons : { n e a r _ m i s s _ s o l u t i o n s } ### Current test case g en er ato r : ‘‘‘...

  67. [67]

    * ** Role :** Defines * what * the task f u n d a m e n t a l l y does ; without it , the task has no semantic goal

    C o m p u t a t i o n a l Obj ec ti ve 22 * ** D e f i n i t i o n :** An abstract s p e c i f i c a t i o n of the primary c o m p u t a t i o n or t r a n s f o r m a t i o n to be pe rf or me d on data , i n d e p e n d e n t of i m p l e m e n t a t i o n details . * ** Role :** Defines * what * the task f u n d a m e n t a l l y does ; without it , t...

  68. [68]

    Tool D e p e n d e n c y Set * ** D e f i n i t i o n :** The set of external libraries , modules , or tools that must be imported and used to a c c o m p l i s h the task . * ** Role :** D i s t i n g u i s h e s this task type as a * tool - calling * problem ; ensures the task d e m o n s t r a t e s use of specific ut il it ie s beyond core language c ...

  69. [69]

    * ** Role :** Ensures the task e xer ci se s p a r t i c u l a r patterns ( e

    P r o c e s s i n g Logic C o n s t r a i n t s * ** D e f i n i t i o n :** High - level rules or required o p e r a t i o n s that c on str ai n how the c o m p u t a t i o n must be carried out , without p r e s c r i b i n g exact code . * ** Role :** Ensures the task e xer ci se s p a r t i c u l a r patterns ( e . g . , shu ff li ng before computing...

  70. [70]

    * ** Role :** E s t a b l i s h e s how external data enters the task ; ne ce ss ar y for invoking the c o m p u t a t i o n

    Input In te rf ac e * ** D e f i n i t i o n :** A formal d e s c r i p t i o n of the function inputs , i nc lu din g pa ram et er names , types , defaults , and c o n s t r a i n t s . * ** Role :** E s t a b l i s h e s how external data enters the task ; ne ce ss ar y for invoking the c o m p u t a t i o n . * ** Va ri ati on Axes :** * Number of p a ...

  71. [71]

    * ** Role :** Defines task c o m p l e t i o n criteria ; without it , c o r r e c t n e s s cannot be ev al ua ted

    Output S p e c i f i c a t i o n * ** D e f i n i t i o n :** A precise d e s c r i p t i o n of the expected output type , structure , and semantic meaning . * ** Role :** Defines task c o m p l e t i o n criteria ; without it , c o r r e c t n e s s cannot be ev al ua ted . * ** Va ri ati on Axes :** * Output data type ( float , dict , list , etc .) * S...

  72. [73]

    ** Five - element c o m p l e t e n e s s ** You must generate ** exactly five elements ** , one for each of the f ol lo wi ng : * C o m p u t a t i o n a l Ob je ct iv e * Tool D e p e n d e n c y Set * P r o c e s s i n g Logic C o n s t r a i n t s * Input I nt er fa ce * Output S p e c i f i c a t i o n

  73. [74]

    * P r o c e s s i n g Logic C o n s t r a i n t s must ** m e a n i n g f u l l y co ns tr ain ** how the tools are used

    ** C o n s i s t e n c y c o n s t r a i n t s ** * The C o m p u t a t i o n a l Ob jec ti ve must be ** a c h i e v a b l e ** using the Tool D e p e n d e n c y Set . * P r o c e s s i n g Logic C o n s t r a i n t s must ** m e a n i n g f u l l y co ns tr ain ** how the tools are used . * Input I nt er fa ce must provide ** s u f f i c i e n t i n f ...

  74. [75]

    ** Novel r e c o m b i n a t i o n ** 24 * Treat the three re fe ren ce sets as ** design signals ** , not te mp la te s . * The res ul ti ng element set should be p la us ib ly g e n e r a t a b l e by r e c o m b i n i n g ideas , but ** must not align exactly with any r ef ere nc e along more than one element **

  75. [76]

    * If tools were removed , the task should lose its defining ch ar ac te r

    ** Tool - calling emphasis ** * The Tool D e p e n d e n c y Set must play a ** non - trivial role ** in enabling or shaping the task . * If tools were removed , the task should lose its defining ch ar ac te r . ### Given C o m p u t a t i o n a l O bj ec tiv e { c o m p u t a t i o n a l _ o b j e c t i v e } ### Given Tool D e p e n d e n c y Set { t o ...

  76. [79]

    * Removing the tools should break the task

    ** M an dat or y Tool Usage ** * The task must i n h e r e n t l y require the s pe ci fi ed tools . * Removing the tools should break the task

  77. [80]

    25 * Do not rename , reorder , or omit sections

    ** Template Ex act ne ss ** * Follow the template str uc tu re and section order exactly . 25 * Do not rename , reorder , or omit sections

  78. [81]

    \"\" Return the sum of a and b . \

    ** Header Code A ut ho ri ty ** * Use the header code verbatim . * All required imports and the function s ig na tu re must appear there , and only there . ### Output Template ** Problem D e s c r i p t i o n :** < Concise natural - language d e s c r i p t i o n i n s t a n t i a t i n g the C o m p u t a t i o n a l O bj ect iv e and P r o c e s s i n g...

  79. [82]

    Data Schema * ** D e f i n i t i o n :** A precise , abstract d e s c r i p t i o n of the input data the code will consume : its origin ( file , DB , in - memory ) , high - level st ruc tu re ( table / array / tensor / graph ) , d i m e n s i o n s or shape signatures , named fields or column types , and any d i s t r i b u t i o n a l or ordering p r o ...

  80. [83]

    reorder rows by index list

    Task Goal * ** D e f i n i t i o n :** A concise , testable d e s c r i p t i o n of the transformation , computation , or analysis to perform on the input data . It d es cr ib es the expected semantic outcome ( e . g . , " reorder rows by index list " , " no rma li ze each column in - place " , " reshape a 1 - D sequence into a 2 - D matrix ") . * ** Rol...

Showing first 80 references.