Recognition: 2 theorem links
· Lean TheoremPrimal Generation, Dual Judgment: Self-Training from Test-Time Scaling
Pith reviewed 2026-05-13 02:21 UTC · model grok-4.3
The pith
Self-training on ranking its own code candidates improves both judgment quality and single-sample generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DuST samples candidate programs from the model's own distribution, executes them in a sandbox, keeps only mixed-success groups, and optimizes the model with GRPO to rank candidates by correctness. The objective is purely discriminative and never directly rewards correct program generation. Across five models from 4B to 30B parameters, this procedure raises judgment NDCG, single-sample pass@1, and Best-of-4 accuracy on LiveCodeBench, with the trained model's single rollout equaling the base model's Best-of-4 performance.
What carries the argument
The dual judgment space, in which the model learns relative correctness structure across its own plausible attempts rather than isolated pass/fail outcomes.
If this is right
- Best-of-4 test-time scaling performance improves consistently across two model families and three scales.
- Judgment quality rises by measurable amounts such as +6.2 NDCG on LiveCodeBench v6.
- Single-sample pass@1 and Best-of-4 accuracy both increase, so one rollout matches the base model's best-of-four.
- SFT on the ranking data improves judgment without lifting generation, isolating the role of on-policy RL.
- The gains appear without any direct reward for producing correct programs.
Where Pith is reading between the lines
- The same dual-training pattern could be tested on other domains that supply automatic verification, such as mathematical proof steps.
- If comparative signals from test-time sampling are sufficient to drive generation gains, models may need less external preference data for alignment.
- Integrating dual judgment training into the pre-training loop might reduce the inference cost of best-of-n sampling at deployment time.
- The separation between judgment gains from SFT and generation gains from RL suggests that future systems could train separate but coupled judgment and generation heads.
Load-bearing premise
That the ranking signal learned from judging multiple self-generated candidates transfers into better generation specifically because on-policy RL is used, not merely because ranking data is available.
What would settle it
An ablation that applies supervised fine-tuning to the identical ranking labels and measures whether single-sample generation improves by the same margin as with GRPO.
read the original abstract
Code generation is typically trained in the primal space of programs: a model produces a candidate solution and receives sparse execution feedback, often a single pass/fail bit. Test-time scaling enriches the inference procedure by sampling multiple candidates and judging among them, but the comparative information this process reveals is discarded after inference. We argue that this information defines a dual judgment space that provides a far richer training signal: the model learns not from an isolated success or failure, but from the relative correctness structure across its own plausible attempts, identifying which succeed, which fail, and what distinguishes them. We introduce DuST (Dual Self-Training), a framework for self-training from the dual judgment space. DuST samples candidate programs from the model's own distribution, labels them through sandbox execution, retains groups containing both successes and failures, and trains the model to rank candidates by execution correctness using GRPO. The objective is purely discriminative: the model is never directly rewarded for generating correct programs. Dual self-training improves both judgment and generation. Across five models spanning two families and three scales (4B to 30B), DuST consistently improves Best-of-4 test-time scaling on LiveCodeBench. For Qwen3-30B-Thinking on LiveCodeBench v6, judgment quality improves by +6.2 NDCG, single-sample pass@1 improves by +3.1, and Best-of-4 accuracy improves by +4.1. The trained model's single rollout matches the base model's Best-of-4 performance. SFT on the same ranking data improves judgment without improving generation, confirming that on-policy RL is the mechanism that transfers dual-space learning back into primal generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DuST, a self-training framework for code generation models that samples candidate programs from the model's distribution, labels them via sandbox execution, retains groups with both successes and failures, and applies GRPO to train the model to rank candidates by relative correctness in a dual judgment space. The central claim is that this purely discriminative training improves both judgment quality (NDCG) and primal generation (pass@1 and Best-of-4), with an SFT ablation on identical ranking data showing that on-policy RL (GRPO) is required to transfer gains to generation; results are reported across five models on LiveCodeBench, including +6.2 NDCG, +3.1 pass@1, and +4.1 Best-of-4 for Qwen3-30B-Thinking.
Significance. If the results hold with adequate controls and statistical reporting, the work provides evidence that comparative ranking signals derived from test-time scaling can be leveraged for self-improvement in generation without any direct correctness reward, distinguishing it from standard RLHF or SFT approaches. The consistent gains across model scales and the ablation isolating the on-policy mechanism are strengths that could inform future integration of inference-time compute into training.
major comments (2)
- [Experimental Evaluation] Experimental Evaluation section: the reported gains (e.g., +3.1 pass@1 and +4.1 Best-of-4 for Qwen3-30B-Thinking on LiveCodeBench v6) are presented without variance estimates, standard deviations, or results across multiple random seeds, which is load-bearing for assessing whether the improvements are reliable rather than due to run-specific variance.
- [Ablation Studies] Ablation Studies section: while the SFT ablation is used to argue that GRPO (rather than ranking data alone) transfers the dual signal to generation, the manuscript does not detail whether group retention criteria, data filtering, or sampling temperature are held exactly identical between SFT and GRPO runs; any difference would undermine the isolation of the on-policy mechanism.
minor comments (1)
- [Abstract] The abstract states results for 'five models spanning two families and three scales (4B to 30B)' but does not name the models or exact parameter counts; adding this would improve immediate clarity without affecting the central claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on experimental reporting and ablation clarity. These comments help strengthen the manuscript's rigor. We address each point below and commit to revisions that incorporate the suggestions where feasible.
read point-by-point responses
-
Referee: [Experimental Evaluation] Experimental Evaluation section: the reported gains (e.g., +3.1 pass@1 and +4.1 Best-of-4 for Qwen3-30B-Thinking on LiveCodeBench v6) are presented without variance estimates, standard deviations, or results across multiple random seeds, which is load-bearing for assessing whether the improvements are reliable rather than due to run-specific variance.
Authors: We agree that variance estimates and multi-seed results would improve confidence in the reported gains. Due to the high computational cost of training and evaluating models up to 30B parameters, our primary results reflect single runs. However, the improvements are consistent in direction and magnitude across five models spanning two families and three scales, which provides supporting evidence of reliability beyond a single run. In the revised manuscript, we will add explicit statements noting the single-run nature for the largest models and include standard deviations from additional multi-seed experiments conducted on the smaller models (4B and 8B variants). This constitutes a partial revision that addresses the core concern without requiring full re-execution of all large-scale experiments. revision: partial
-
Referee: [Ablation Studies] Ablation Studies section: while the SFT ablation is used to argue that GRPO (rather than ranking data alone) transfers the dual signal to generation, the manuscript does not detail whether group retention criteria, data filtering, or sampling temperature are held exactly identical between SFT and GRPO runs; any difference would undermine the isolation of the on-policy mechanism.
Authors: The data generation process is identical for the SFT and GRPO conditions: both use the same base model sampling temperature, the same group retention rule (only groups containing at least one success and one failure), and the same filtering steps to produce the ranking dataset. The sole controlled difference is the training objective (SFT versus GRPO) applied to this shared dataset. We will revise the Ablation Studies section to explicitly document these shared hyperparameters and preprocessing steps, thereby clarifying that the on-policy mechanism is isolated as the variable of interest. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents an empirical self-training method (DuST) that samples programs, labels them via execution, and applies GRPO to learn ranking from dual judgment signals. Central claims rest on held-out benchmark results (LiveCodeBench) and an explicit SFT ablation showing that ranking data alone improves judgment but not generation, while GRPO transfers the signal to primal generation. No derivation, equation, or first-principles claim reduces to fitted parameters or self-citations by construction; the work is a standard empirical ML study whose results are externally measurable and not tautological.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Sandbox execution feedback accurately distinguishes correct from incorrect code candidates
- domain assumption On-policy RL on ranking objectives transfers dual-space learning into improved primal generation
invented entities (1)
-
Dual judgment space
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearDuST samples candidate programs from the model's own distribution, labels them through sandbox execution, retains groups containing both successes and failures, and trains the model to rank candidates by execution correctness using GRPO. The objective is purely discriminative.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclearSFT on the same ranking data improves judgment without improving generation, confirming that on-policy RL is the mechanism that transfers dual-space learning back into primal generation.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.