arXiv preprint arXiv:1705.07798 , year=

A unified view of entropy-regularized markov decision processes , author= · 2017 · arXiv 1705.07798

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

representative citing papers

Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

TRIRL enables explicit dual-ascent IRL via trust-region local policy updates that guarantee monotonic improvement without full RL solves per iteration, outperforming prior imitation methods by 2.4x aggregate IQM and recovering generalizable rewards.

Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.

Planning in entropy-regularized Markov decision processes and games

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

SmoothCruiser achieves O~(1/epsilon^4) problem-independent sample complexity for value estimation in entropy-regularized MDPs and games via a generative model.

A Single Deep Preference-Conditioned Policy for Learning Pareto Coverage Sets

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

A single preference-conditioned policy achieves unique and Lipschitz-continuous Pareto coverage in multi-objective MDPs via a new mirror-descent policy iteration algorithm with O(1/k) convergence.

POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

POETS uses compute-efficient LLM policy ensembles to implicitly perform KL-regularized Thompson sampling, delivering O(sqrt(T gamma_T)) regret bounds and state-of-the-art sample efficiency in scientific discovery tasks such as protein search and quantum circuit design.

Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models

cs.LG · 2026-04-19 · unverdicted · novelty 6.0

Reward Score Matching unifies reward-based fine-tuning for flow and diffusion models by recasting alignment as score matching to a value-guided target.

citing papers explorer

Showing 6 of 6 citing papers.

Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates cs.LG · 2026-05-10 · unverdicted · none · ref 49
TRIRL enables explicit dual-ascent IRL via trust-region local policy updates that guarantee monotonic improvement without full RL solves per iteration, outperforming prior imitation methods by 2.4x aggregate IQM and recovering generalizable rewards.
Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability cs.LG · 2026-05-09 · unverdicted · none · ref 28
The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.
Planning in entropy-regularized Markov decision processes and games cs.LG · 2026-04-21 · unverdicted · none · ref 20
SmoothCruiser achieves O~(1/epsilon^4) problem-independent sample complexity for value estimation in entropy-regularized MDPs and games via a generative model.
A Single Deep Preference-Conditioned Policy for Learning Pareto Coverage Sets cs.LG · 2026-05-09 · unverdicted · none · ref 44
A single preference-conditioned policy achieves unique and Lipschitz-continuous Pareto coverage in multi-objective MDPs via a new mirror-descent policy iteration algorithm with O(1/k) convergence.
POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles cs.LG · 2026-05-08 · unverdicted · none · ref 110
POETS uses compute-efficient LLM policy ensembles to implicitly perform KL-regularized Thompson sampling, delivering O(sqrt(T gamma_T)) regret bounds and state-of-the-art sample efficiency in scientific discovery tasks such as protein search and quantum circuit design.
Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models cs.LG · 2026-04-19 · unverdicted · none · ref 29
Reward Score Matching unifies reward-based fine-tuning for flow and diffusion models by recasting alignment as score matching to a value-guided target.

arXiv preprint arXiv:1705.07798 , year=

fields

years

verdicts

representative citing papers

citing papers explorer