arxiv: 2605.11299 · v2 · submitted 2026-05-11 · 💻 cs.LG · cs.CL· cs.SE

Recognition: 2 theorem links

· Lean Theorem

Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling

Jiawei Han, Richard Bai, Ronan Collobert, Ruixiang Zhang, Yizhe Zhang, Yizhu Jiao

Pith reviewed 2026-05-13 02:21 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.SE

keywords dual self-trainingtest-time scalingcode generationreinforcement learningGRPOLiveCodeBenchjudgment qualityself-training

0 comments

The pith

Self-training on ranking its own code candidates improves both judgment quality and single-sample generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Code generation models normally receive only sparse pass-or-fail signals on individual programs. The paper argues that sampling multiple candidates at test time creates comparative information about relative correctness that can be turned into a richer training signal. DuST retains groups containing both correct and incorrect attempts, then trains the model to rank them by execution outcome using on-policy reinforcement learning. This dual-space training raises judgment metrics and also lifts generation performance, so that one sample from the trained model matches the base model's best-of-four accuracy. An ablation shows that supervised fine-tuning on the same ranking data improves judgment but leaves generation unchanged, indicating that the on-policy RL step is what transfers the signal back to primal generation.

Core claim

DuST samples candidate programs from the model's own distribution, executes them in a sandbox, keeps only mixed-success groups, and optimizes the model with GRPO to rank candidates by correctness. The objective is purely discriminative and never directly rewards correct program generation. Across five models from 4B to 30B parameters, this procedure raises judgment NDCG, single-sample pass@1, and Best-of-4 accuracy on LiveCodeBench, with the trained model's single rollout equaling the base model's Best-of-4 performance.

What carries the argument

The dual judgment space, in which the model learns relative correctness structure across its own plausible attempts rather than isolated pass/fail outcomes.

If this is right

Best-of-4 test-time scaling performance improves consistently across two model families and three scales.
Judgment quality rises by measurable amounts such as +6.2 NDCG on LiveCodeBench v6.
Single-sample pass@1 and Best-of-4 accuracy both increase, so one rollout matches the base model's best-of-four.
SFT on the ranking data improves judgment without lifting generation, isolating the role of on-policy RL.
The gains appear without any direct reward for producing correct programs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dual-training pattern could be tested on other domains that supply automatic verification, such as mathematical proof steps.
If comparative signals from test-time sampling are sufficient to drive generation gains, models may need less external preference data for alignment.
Integrating dual judgment training into the pre-training loop might reduce the inference cost of best-of-n sampling at deployment time.
The separation between judgment gains from SFT and generation gains from RL suggests that future systems could train separate but coupled judgment and generation heads.

Load-bearing premise

That the ranking signal learned from judging multiple self-generated candidates transfers into better generation specifically because on-policy RL is used, not merely because ranking data is available.

What would settle it

An ablation that applies supervised fine-tuning to the identical ranking labels and measures whether single-sample generation improves by the same margin as with GRPO.

read the original abstract

Code generation is typically trained in the primal space of programs: a model produces a candidate solution and receives sparse execution feedback, often a single pass/fail bit. Test-time scaling enriches the inference procedure by sampling multiple candidates and judging among them, but the comparative information this process reveals is discarded after inference. We argue that this information defines a dual judgment space that provides a far richer training signal: the model learns not from an isolated success or failure, but from the relative correctness structure across its own plausible attempts, identifying which succeed, which fail, and what distinguishes them. We introduce DuST (Dual Self-Training), a framework for self-training from the dual judgment space. DuST samples candidate programs from the model's own distribution, labels them through sandbox execution, retains groups containing both successes and failures, and trains the model to rank candidates by execution correctness using GRPO. The objective is purely discriminative: the model is never directly rewarded for generating correct programs. Dual self-training improves both judgment and generation. Across five models spanning two families and three scales (4B to 30B), DuST consistently improves Best-of-4 test-time scaling on LiveCodeBench. For Qwen3-30B-Thinking on LiveCodeBench v6, judgment quality improves by +6.2 NDCG, single-sample pass@1 improves by +3.1, and Best-of-4 accuracy improves by +4.1. The trained model's single rollout matches the base model's Best-of-4 performance. SFT on the same ranking data improves judgment without improving generation, confirming that on-policy RL is the mechanism that transfers dual-space learning back into primal generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes DuST, a self-training framework for code generation models that samples candidate programs from the model's distribution, labels them via sandbox execution, retains groups with both successes and failures, and applies GRPO to train the model to rank candidates by relative correctness in a dual judgment space. The central claim is that this purely discriminative training improves both judgment quality (NDCG) and primal generation (pass@1 and Best-of-4), with an SFT ablation on identical ranking data showing that on-policy RL (GRPO) is required to transfer gains to generation; results are reported across five models on LiveCodeBench, including +6.2 NDCG, +3.1 pass@1, and +4.1 Best-of-4 for Qwen3-30B-Thinking.

Significance. If the results hold with adequate controls and statistical reporting, the work provides evidence that comparative ranking signals derived from test-time scaling can be leveraged for self-improvement in generation without any direct correctness reward, distinguishing it from standard RLHF or SFT approaches. The consistent gains across model scales and the ablation isolating the on-policy mechanism are strengths that could inform future integration of inference-time compute into training.

major comments (2)

[Experimental Evaluation] Experimental Evaluation section: the reported gains (e.g., +3.1 pass@1 and +4.1 Best-of-4 for Qwen3-30B-Thinking on LiveCodeBench v6) are presented without variance estimates, standard deviations, or results across multiple random seeds, which is load-bearing for assessing whether the improvements are reliable rather than due to run-specific variance.
[Ablation Studies] Ablation Studies section: while the SFT ablation is used to argue that GRPO (rather than ranking data alone) transfers the dual signal to generation, the manuscript does not detail whether group retention criteria, data filtering, or sampling temperature are held exactly identical between SFT and GRPO runs; any difference would undermine the isolation of the on-policy mechanism.

minor comments (1)

[Abstract] The abstract states results for 'five models spanning two families and three scales (4B to 30B)' but does not name the models or exact parameter counts; adding this would improve immediate clarity without affecting the central claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on experimental reporting and ablation clarity. These comments help strengthen the manuscript's rigor. We address each point below and commit to revisions that incorporate the suggestions where feasible.

read point-by-point responses

Referee: [Experimental Evaluation] Experimental Evaluation section: the reported gains (e.g., +3.1 pass@1 and +4.1 Best-of-4 for Qwen3-30B-Thinking on LiveCodeBench v6) are presented without variance estimates, standard deviations, or results across multiple random seeds, which is load-bearing for assessing whether the improvements are reliable rather than due to run-specific variance.

Authors: We agree that variance estimates and multi-seed results would improve confidence in the reported gains. Due to the high computational cost of training and evaluating models up to 30B parameters, our primary results reflect single runs. However, the improvements are consistent in direction and magnitude across five models spanning two families and three scales, which provides supporting evidence of reliability beyond a single run. In the revised manuscript, we will add explicit statements noting the single-run nature for the largest models and include standard deviations from additional multi-seed experiments conducted on the smaller models (4B and 8B variants). This constitutes a partial revision that addresses the core concern without requiring full re-execution of all large-scale experiments. revision: partial
Referee: [Ablation Studies] Ablation Studies section: while the SFT ablation is used to argue that GRPO (rather than ranking data alone) transfers the dual signal to generation, the manuscript does not detail whether group retention criteria, data filtering, or sampling temperature are held exactly identical between SFT and GRPO runs; any difference would undermine the isolation of the on-policy mechanism.

Authors: The data generation process is identical for the SFT and GRPO conditions: both use the same base model sampling temperature, the same group retention rule (only groups containing at least one success and one failure), and the same filtering steps to produce the ranking dataset. The sole controlled difference is the training objective (SFT versus GRPO) applied to this shared dataset. We will revise the Ablation Studies section to explicitly document these shared hyperparameters and preprocessing steps, thereby clarifying that the on-policy mechanism is isolated as the variable of interest. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical self-training method (DuST) that samples programs, labels them via execution, and applies GRPO to learn ranking from dual judgment signals. Central claims rest on held-out benchmark results (LiveCodeBench) and an explicit SFT ablation showing that ranking data alone improves judgment but not generation, while GRPO transfers the signal to primal generation. No derivation, equation, or first-principles claim reduces to fitted parameters or self-citations by construction; the work is a standard empirical ML study whose results are externally measurable and not tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that sandbox execution provides reliable binary correctness labels and that on-policy ranking training can transfer comparative judgment into generative capability. No explicit free parameters or new physical entities are introduced in the abstract.

axioms (2)

domain assumption Sandbox execution feedback accurately distinguishes correct from incorrect code candidates
Invoked when labeling sampled programs for training groups.
domain assumption On-policy RL on ranking objectives transfers dual-space learning into improved primal generation
Central to the claim that GRPO succeeds where SFT does not.

invented entities (1)

Dual judgment space no independent evidence
purpose: Conceptual space providing relative correctness structure across candidate programs
Introduced as the richer training signal extracted from test-time scaling.

pith-pipeline@v0.9.0 · 5623 in / 1624 out tokens · 50693 ms · 2026-05-13T02:21:35.391205+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
DuST samples candidate programs from the model's own distribution, labels them through sandbox execution, retains groups containing both successes and failures, and trains the model to rank candidates by execution correctness using GRPO. The objective is purely discriminative.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear
SFT on the same ranking data improves judgment without improving generation, confirming that on-policy RL is the mechanism that transfers dual-space learning back into primal generation.