CPQL adapts the multi-step Peng's Q(λ) operator for conservative offline value estimation, achieving performance guarantees and empirical gains over single-step baselines on D4RL while supporting offline-to-online fine-tuning.
The in-sample softmax for offline reinforcement learning.arXiv preprint arXiv:2302.14372,
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
SPAR anchors policy learning to a frozen BC policy for residual rectification and introduces latent self-imitation to eliminate manifold drift, achieving SOTA on D4RL.
citing papers explorer
-
Peng's Q($\lambda$) for Conservative Value Estimation in Offline Reinforcement Learning
CPQL adapts the multi-step Peng's Q(λ) operator for conservative offline value estimation, achieving performance guarantees and empirical gains over single-step baselines on D4RL while supporting offline-to-online fine-tuning.
-
SPAR: Support-Preserving Action Rectification
SPAR anchors policy learning to a frozen BC policy for residual rectification and introduces latent self-imitation to eliminate manifold drift, achieving SOTA on D4RL.