HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchmarks over DAPO.
Why is rlhf alignment shallow? a gradient analysis.arXiv preprint arXiv: 2603.04851
2 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 2verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
Gradient-based proportional LoRA rank allocation under GRPO reduces accuracy by 4.5 points versus uniform allocation because GRPO gradients are flatter across layers and non-uniform ranks amplify importance differences.
citing papers explorer
-
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control
HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchmarks over DAPO.
-
Gradient-Based LoRA Rank Allocation Under GRPO: An Empirical Study
Gradient-based proportional LoRA rank allocation under GRPO reduces accuracy by 4.5 points versus uniform allocation because GRPO gradients are flatter across layers and non-uniform ranks amplify importance differences.