CPQL adapts the multi-step Peng's Q(λ) operator for conservative offline value estimation, achieving performance guarantees and empirical gains over single-step baselines on D4RL while supporting offline-to-online fine-tuning.
Extreme q-learning: Maxent rl without entropy
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.LG 6roles
background 2polarities
background 2representative citing papers
Spectral Souping learns offline specialized policies for fine-grained preferences and merges them online using a discovered universal spectral representation for efficient LLM alignment.
Injecting RTG into states outside the autoregressive sequence yields shorter, more efficient Decision Transformers that outperform the original on offline RL tasks.
FAN simplifies expressive flow policies and distributional critics in offline RL via single-iteration behavior regularization and single-sample noise conditioning to claim SOTA performance with lower training and inference time.
Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.
IDQL generalizes IQL into an actor-critic framework and uses diffusion policies for robust policy extraction, outperforming prior offline RL methods.
citing papers explorer
-
Peng's Q($\lambda$) for Conservative Value Estimation in Offline Reinforcement Learning
CPQL adapts the multi-step Peng's Q(λ) operator for conservative offline value estimation, achieving performance guarantees and empirical gains over single-step baselines on D4RL while supporting offline-to-online fine-tuning.
-
Spectral Souping: A Unified Framework for Online Preference Alignment
Spectral Souping learns offline specialized policies for fine-grained preferences and merges them online using a discovered universal spectral representation for efficient LLM alignment.
-
Beyond Autoregressive RTG: Conditioning via Injection Outside Sequential Modeling in Decision Transformer
Injecting RTG into states outside the autoregressive sequence yields shorter, more efficient Decision Transformers that outperform the original on offline RL tasks.
-
Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning
FAN simplifies expressive flow policies and distributional critics in offline RL via single-iteration behavior regularization and single-sample noise conditioning to claim SOTA performance with lower training and inference time.
-
Fisher Decorator: Refining Flow Policy via a Local Transport Map
Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.
-
IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies
IDQL generalizes IQL into an actor-critic framework and uses diffusion policies for robust policy extraction, outperforming prior offline RL methods.