TTRL gains are reinterpreted as mostly sharpening rather than learning, with an identified extinction window causing net corruption; TTRL-Guard mitigates via FRS, MPS, and RCSU for improved pass@1.
arXiv preprint arXiv:2512.13106 , year =
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 3years
2026 3verdicts
UNVERDICTED 3representative citing papers
GeoMin uses geometric distribution modeling on labeled data to assess self-reward reliability, enabling better performance in semi-supervised RLVR with only 10% of typical annotations.
RLAVR uses the Corrective Advantage Gap metric and CARE policy to actively acquire ground-truth labels for key samples, stabilizing RLVR training and boosting performance with limited annotation budgets.
citing papers explorer
-
Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting
TTRL gains are reinterpreted as mostly sharpening rather than learning, with an identified extinction window causing net corruption; TTRL-Guard mitigates via FRS, MPS, and RCSU for improved pass@1.
-
GeoMin: Data-Efficient Semi-Supervised RLVR via Geometric Distribution Modeling
GeoMin uses geometric distribution modeling on labeled data to assess self-reward reliability, enabling better performance in semi-supervised RLVR with only 10% of typical annotations.
-
When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards
RLAVR uses the Corrective Advantage Gap metric and CARE policy to actively acquire ground-truth labels for key samples, stabilizing RLVR training and boosting performance with limited annotation budgets.