The paper introduces Random-Reset Policy Optimization (RRPO) and Self-Reset Policy Optimization (SRPO) that use resets to enable more precise credit assignment in RL for language model reasoning, with SRPO outperforming GRPO and RRPO across benchmarks.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Credit Assignment with Resets in Language Model Reasoning
The paper introduces Random-Reset Policy Optimization (RRPO) and Self-Reset Policy Optimization (SRPO) that use resets to enable more precise credit assignment in RL for language model reasoning, with SRPO outperforming GRPO and RRPO across benchmarks.