SCOPE uses step-wise confidence and dynamic subgroups to create finer pseudo-labels in test-time RL, delivering 13.1% relative gains on AIME 2025 over majority-voting baselines.
Specifically, we generate 16 re- sponses (4 for models with 32k context) per ques- tion using a temperature of 0.6 and a top-p value of 0.95
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Beyond Majority Voting: Towards Fine-grained and More Reliable Reward Signal for Test-Time Reinforcement Learning
SCOPE uses step-wise confidence and dynamic subgroups to create finer pseudo-labels in test-time RL, delivering 13.1% relative gains on AIME 2025 over majority-voting baselines.