TRIRL enables explicit dual-ascent IRL via trust-region local policy updates that guarantee monotonic improvement without full RL solves per iteration, outperforming prior imitation methods by 2.4x aggregate IQM and recovering generalizable rewards.
arXiv preprint arXiv:1705.07798 , year=
6 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 6years
2026 6verdicts
UNVERDICTED 6representative citing papers
The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.
SmoothCruiser achieves O~(1/epsilon^4) problem-independent sample complexity for value estimation in entropy-regularized MDPs and games via a generative model.
A single preference-conditioned policy achieves unique and Lipschitz-continuous Pareto coverage in multi-objective MDPs via a new mirror-descent policy iteration algorithm with O(1/k) convergence.
POETS uses compute-efficient LLM policy ensembles to implicitly perform KL-regularized Thompson sampling, delivering O(sqrt(T gamma_T)) regret bounds and state-of-the-art sample efficiency in scientific discovery tasks such as protein search and quantum circuit design.
Reward Score Matching unifies reward-based fine-tuning for flow and diffusion models by recasting alignment as score matching to a value-guided target.
citing papers explorer
-
Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates
TRIRL enables explicit dual-ascent IRL via trust-region local policy updates that guarantee monotonic improvement without full RL solves per iteration, outperforming prior imitation methods by 2.4x aggregate IQM and recovering generalizable rewards.
-
Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability
The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.
-
Planning in entropy-regularized Markov decision processes and games
SmoothCruiser achieves O~(1/epsilon^4) problem-independent sample complexity for value estimation in entropy-regularized MDPs and games via a generative model.
-
A Single Deep Preference-Conditioned Policy for Learning Pareto Coverage Sets
A single preference-conditioned policy achieves unique and Lipschitz-continuous Pareto coverage in multi-objective MDPs via a new mirror-descent policy iteration algorithm with O(1/k) convergence.
-
POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles
POETS uses compute-efficient LLM policy ensembles to implicitly perform KL-regularized Thompson sampling, delivering O(sqrt(T gamma_T)) regret bounds and state-of-the-art sample efficiency in scientific discovery tasks such as protein search and quantum circuit design.
-
Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models
Reward Score Matching unifies reward-based fine-tuning for flow and diffusion models by recasting alignment as score matching to a value-guided target.