The choice of closeness measure in diffusion reward alignment determines the computational primitives and tractable reward classes, with linear exponential tilts sufficing for KL with convex rewards and proximal oracles for Wasserstein with concave or low-dimensional Lipschitz rewards.
Rl with kl penalties is better viewed as bayesian inference
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.
Binary rewards make the set of reward-maximizing policies infinite in policy gradients; KL control selects the filtered base model but misspecification drives collapse to concentrated valid outputs instead.
Synthetic measurements show that gold-standard performance degrades according to distinct functional forms when optimizing proxy reward models via RL or best-of-n, with coefficients scaling smoothly by reward model parameter count.
Post-training reweights a pretrained model's behavior distribution either within its existing accessible support (elicitation) or by expanding that support (creation), with both SFT and RL acting as free-energy minimization under different signals.
One KL-difference identity plus non-negativity of KL derives convexity of the log-partition function, Gibbs variational principle, Pythagorean theorems, and tilting formulas for exponential families.
citing papers explorer
-
On Distinguishing Capability Elicitation from Capability Creation in Post-Training: A Free-Energy Perspective
Post-training reweights a pretrained model's behavior distribution either within its existing accessible support (elicitation) or by expanding that support (creation), with both SFT and RL acting as free-energy minimization under different signals.