URL https://arxiv

RL with KL penalties is better viewed as Bayesian inference · 2022 · arXiv 2205.11275

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

representative citing papers

The tractability landscape of diffusion alignment: regularization, rewards, and computational primitives

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

The choice of closeness measure in diffusion reward alignment determines the computational primitives and tractable reward classes, with linear exponential tilts sufficing for KL with convex rewards and proximal oracles for Wasserstein with concave or low-dimensional Lipschitz rewards.

Reinforcement Learning via Value Gradient Flow

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.

Binary Rewards and Reinforcement Learning: Fundamental Challenges

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

Binary rewards make the set of reward-maximizing policies infinite in policy gradients; KL control selects the filtered base model but misspecification drives collapse to concentrated valid outputs instead.

On Distinguishing Capability Elicitation from Capability Creation in Post-Training: A Free-Energy Perspective

cs.AI · 2026-05-08 · unverdicted · novelty 5.0

Post-training reweights a pretrained model's behavior distribution either within its existing accessible support (elicitation) or by expanding that support (creation), with both SFT and RL acting as free-energy minimization under different signals.

Exponential families from a single KL identity

cs.LG · 2026-04-30 · accept · novelty 5.0

One KL-difference identity plus non-negativity of KL derives convexity of the log-partition function, Gibbs variational principle, Pythagorean theorems, and tilting formulas for exponential families.

citing papers explorer

Showing 5 of 5 citing papers.

The tractability landscape of diffusion alignment: regularization, rewards, and computational primitives cs.LG · 2026-05-12 · unverdicted · none · ref 20
The choice of closeness measure in diffusion reward alignment determines the computational primitives and tractable reward classes, with linear exponential tilts sufficing for KL with convex rewards and proximal oracles for Wasserstein with concave or low-dimensional Lipschitz rewards.
Reinforcement Learning via Value Gradient Flow cs.LG · 2026-04-15 · unverdicted · none · ref 31
VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.
Binary Rewards and Reinforcement Learning: Fundamental Challenges cs.LG · 2026-05-04 · unverdicted · none · ref 11
Binary rewards make the set of reward-maximizing policies infinite in policy gradients; KL control selects the filtered base model but misspecification drives collapse to concentrated valid outputs instead.
On Distinguishing Capability Elicitation from Capability Creation in Post-Training: A Free-Energy Perspective cs.AI · 2026-05-08 · unverdicted · none · ref 30
Post-training reweights a pretrained model's behavior distribution either within its existing accessible support (elicitation) or by expanding that support (creation), with both SFT and RL acting as free-energy minimization under different signals.
Exponential families from a single KL identity cs.LG · 2026-04-30 · accept · none · ref 15
One KL-difference identity plus non-negativity of KL derives convexity of the log-partition function, Gibbs variational principle, Pythagorean theorems, and tilting formulas for exponential families.

URL https://arxiv

fields

years

verdicts

representative citing papers

citing papers explorer