arXiv preprint arXiv:2512.13106 , year =

TraPO: A Semi-Supervised Reinforcement Learning Framework for Boosting LLM Reasoning , author= · 2025 · arXiv 2512.13106

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

representative citing papers

Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting

cs.LG · 2026-05-19 · unverdicted · novelty 7.0

TTRL gains are reinterpreted as mostly sharpening rather than learning, with an identified extinction window causing net corruption; TTRL-Guard mitigates via FRS, MPS, and RCSU for improved pass@1.

GeoMin: Data-Efficient Semi-Supervised RLVR via Geometric Distribution Modeling

cs.LG · 2026-06-03 · unverdicted · novelty 5.0

GeoMin uses geometric distribution modeling on labeled data to assess self-reward reliability, enabling better performance in semi-supervised RLVR with only 10% of typical annotations.

When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards

cs.LG · 2026-05-25 · unverdicted · novelty 5.0

RLAVR uses the Corrective Advantage Gap metric and CARE policy to actively acquire ground-truth labels for key samples, stabilizing RLVR training and boosting performance with limited annotation budgets.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting cs.LG · 2026-05-19 · unverdicted · none · ref 5
TTRL gains are reinterpreted as mostly sharpening rather than learning, with an identified extinction window causing net corruption; TTRL-Guard mitigates via FRS, MPS, and RCSU for improved pass@1.
GeoMin: Data-Efficient Semi-Supervised RLVR via Geometric Distribution Modeling cs.LG · 2026-06-03 · unverdicted · none · ref 18
GeoMin uses geometric distribution modeling on labeled data to assess self-reward reliability, enabling better performance in semi-supervised RLVR with only 10% of typical annotations.
When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards cs.LG · 2026-05-25 · unverdicted · none · ref 44
RLAVR uses the Corrective Advantage Gap metric and CARE policy to actively acquire ground-truth labels for key samples, stabilizing RLVR training and boosting performance with limited annotation budgets.

arXiv preprint arXiv:2512.13106 , year =

fields

years

verdicts

representative citing papers

citing papers explorer