pith. sign in

arxiv: 2601.12186 · v3 · pith:V76DHFFEnew · submitted 2026-01-17 · 💻 cs.SE · cs.AI

Aletheia: What Makes RLVR For Code Verifiers Tick?

classification 💻 cs.SE cs.AI
keywords trainingcoderlvrthinkingverifierverifiersacrossanalysis
0
0 comments X
read the original abstract

Multi-domain thinking verifiers trained via Reinforcement Learning with Verifiable Rewards (RLVR) are a cornerstone of modern post-training. However, their adoption in code generation has lagged behind that of execution feedback due to the prohibitive costs of the full RLVR pipeline. In this work, we ablate three primary choices along the performance-cost trade-off in RLVR: intermediate thinking traces, learning from negative samples, and on-policy training. We introduce Aletheia, a controlled, execution-grounded testbed to facilitate a contamination-free analysis of code verifier training recipes across disparate model sizes and covariate shifts across two common verifier application scenarios. Our analysis reveals that the optimal training recipe is scale-dependent: on-policy learning is the primary performance driver for small verifiers, whereas the thinking budget becomes the most vital factor at larger scales. While leveraging negative samples has a consistent impact on top-1 selection accuracy across sizes, their contribution to ranking reconstruction increases monotonically with scale and plays a key role in stabilizing training at large sizes. Our Pareto optimality analysis demonstrates that eliminating on-policy training at larger model scales yields a verifier that performs comparably to the full RLVR recipe. Furthermore, we find that eschewing thinking traces serves as a compute-efficient strategy at lower budgets, offering a strong trade-off between training cost and verifier accuracy. Ultimately, our work provides the empirical foundation necessary to efficiently deploy robust code verifiers, thereby enabling their wider adoption in post-training pipelines for large code generation models.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration

    cs.LG 2026-04 unverdicted novelty 7.0

    NExt accelerates RLVR training for LLMs by nonlinearly extrapolating low-rank parameter trajectories extracted from LoRA runs.

  2. Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

    cs.CL 2026-05 unverdicted novelty 6.0

    On-policy distillation gains efficiency from early foresight in module focus and update directions, enabling EffOPD to accelerate training 3x with comparable performance.

  3. Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

    cs.CL 2026-05 unverdicted novelty 6.0

    On-policy distillation gains efficiency from early foresight in module allocation and low-rank update directions, enabling EffOPD to accelerate training by 3x via adaptive extrapolation without extra modules or tuning.

  4. Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

    cs.CL 2026-05 unverdicted novelty 5.0

    On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.