VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.
Theoretical guarantees on the best-of-n alignment policy
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.LG 2verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
Entropy minimization on self-generated outputs elicits strong reasoning in pretrained LLMs, matching or exceeding supervised RL methods on benchmarks.
citing papers explorer
-
Reinforcement Learning via Value Gradient Flow
VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.
-
The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning
Entropy minimization on self-generated outputs elicits strong reasoning in pretrained LLMs, matching or exceeding supervised RL methods on benchmarks.