arXiv preprint arXiv:2509.21016 , year=

RL Grokking Recipe: How Does RL Unlock, Transfer New Algorithms in LLMs? , author= · 2025 · arXiv 2509.21016

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

The Hidden Bias of Process Reward Models:PRISM for Rewarding the Right Reasoning

cs.LG · 2026-06-08 · unverdicted · novelty 7.0

PRISM is a contrastive, policy-aware training framework for process reward models that reduces false positives by 22% on PRMBench and boosts downstream accuracy up to 33% in Best-of-N selection by learning reliable relative comparisons instead of pointwise labels.

Combinatorial Synthesis: Scaling Code RLVR via Atomic Decomposition and Recombination

cs.CL · 2026-05-29 · unverdicted · novelty 6.0

ADR generates novel verifiable code tasks via atomic decomposition and recombination, outperforming heuristic baselines in originality, difficulty, and downstream RLVR gains across coding domains.

VeriGate: Verifier-Gated Step-Level Supervision for GRPO

cs.LG · 2026-05-28 · unverdicted · novelty 6.0

VeriGate adds verifier-gated step-level supervision to GRPO via cumulated PRM rewards and group-normalized token advantages, raising accuracy 20% and 12% on 1.5B and 7B models on MATH and six benchmarks.

citing papers explorer

Showing 2 of 2 citing papers after filters.

The Hidden Bias of Process Reward Models:PRISM for Rewarding the Right Reasoning cs.LG · 2026-06-08 · unverdicted · none · ref 11
PRISM is a contrastive, policy-aware training framework for process reward models that reduces false positives by 22% on PRMBench and boosts downstream accuracy up to 33% in Best-of-N selection by learning reliable relative comparisons instead of pointwise labels.
VeriGate: Verifier-Gated Step-Level Supervision for GRPO cs.LG · 2026-05-28 · unverdicted · none · ref 15
VeriGate adds verifier-gated step-level supervision to GRPO via cumulated PRM rewards and group-normalized token advantages, raising accuracy 20% and 12% on 1.5B and 7B models on MATH and six benchmarks.

arXiv preprint arXiv:2509.21016 , year=

fields

years

verdicts

representative citing papers

citing papers explorer