arxiv: 2605.08202 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning

Chen Ye, Guang Chen, Hang Yu, Hongtu Zhou, Junqiao Zhao, Qingjun Wang, Yanping Zhao, Ziqiao Wang

Pith reviewed 2026-05-12 02:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords offline reinforcement learningout-of-distribution detectiondiffusion modelsselective regularizationpolicy optimizationvalue estimationoffline RL benchmarks

0 comments

The pith

Diffusion models detect out-of-distribution actions in offline RL and selectively regularize them instead of applying uniform penalties.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a framework called DOSER to solve overestimation of unseen actions in offline reinforcement learning. It trains two diffusion models on the behavior policy and state distribution, then uses single-step denoising reconstruction error to flag out-of-distribution actions. During optimization, the method evaluates predicted next states to decide which OOD actions to suppress and which to encourage. It proves the approach forms a gamma-contraction with a unique fixed point and bounded values, plus an asymptotic performance bound relative to the optimum under approximation errors. Experiments across standard benchmarks show consistent gains over prior penalization methods, especially when the training data is suboptimal.

Core claim

DOSER trains two diffusion models to capture the behavior policy and state distribution, using single-step denoising reconstruction error as a reliable OOD indicator. During policy optimization, it distinguishes between beneficial and detrimental OOD actions by evaluating predicted transitions, selectively suppressing risky actions while encouraging exploration of high-potential ones. It proves that DOSER is a gamma-contraction admitting a unique fixed point with bounded value estimates and provides an asymptotic performance guarantee relative to the optimal policy under model approximation and OOD detection errors.

What carries the argument

The DOSER framework: dual diffusion models combined with single-step denoising reconstruction error for OOD detection and transition evaluation for selective regularization during policy updates.

If this is right

DOSER is a gamma-contraction and therefore admits a unique fixed point with bounded value estimates.
An asymptotic performance guarantee holds relative to the optimal policy under bounded model approximation and OOD detection errors.
The method attains higher returns than prior penalization approaches across extensive offline RL benchmarks.
Gains are largest on suboptimal datasets where uniform penalization overly restricts exploration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The selective mechanism may allow offline policies to exploit limited data more effectively than purely conservative methods.
The diffusion-based OOD signal could be adapted to detect distribution shifts in other sequential decision problems.
Combining the selective regularization with model-based planning might further improve sample efficiency in low-data regimes.

Load-bearing premise

The single-step denoising reconstruction error accurately identifies OOD actions and that evaluating predicted transitions can reliably separate beneficial from detrimental OOD actions without introducing new errors that break the contraction or performance bounds.

What would settle it

A benchmark run or controlled experiment in which the denoising reconstruction error shows no reliable correlation with actual out-of-distribution status, or in which DOSER produces lower returns than uniform penalization baselines on standard offline RL datasets.

Figures

Figures reproduced from arXiv: 2605.08202 by Chen Ye, Guang Chen, Hang Yu, Hongtu Zhou, Junqiao Zhao, Qingjun Wang, Yanping Zhao, Ziqiao Wang.

**Figure 2.** Figure 2: Overview of the proposed method: (a) Diffusion-based OOD action detection, (b) Inte [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: OOD detection experiments on 1D navigation task, where a higher OOD detection metric [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of OOD action detection performance between CVAE-based reconstruction [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation study on hyperparameters for halfcheetah tasks. We compare different OOD detection thresholds in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Toy environment and ground truth Q-function heatmap visualization. [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: Diffusion-based reconstruction error distribution across datasets. Diffusion models were [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: Diffusion-based reconstruction error distributions on original ID datasets and synthetic [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗

**Figure 9.** Figure 9: ROC curves for diffusion-based OOD detection under different noise scales. [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗

**Figure 10.** Figure 10: Correlation analysis between diffusion-based reconstruction error and negative log [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗

**Figure 11.** Figure 11: Proportions of different action types during policy optimization. [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗

**Figure 12.** Figure 12: Q-value distributions for different action types. [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗

**Figure 13.** Figure 13: Sensitivity analysis of the dynamics model error. [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗

**Figure 14.** Figure 14: Sensitivity analysis of the number of critic networks. [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗

**Figure 15.** Figure 15: Sensitivity analysis of the compensation target weight [PITH_FULL_IMAGE:figures/full_fig_p031_15.png] view at source ↗

**Figure 16.** Figure 16: Sensitivity analysis of the number of sampled in-distribution actions [PITH_FULL_IMAGE:figures/full_fig_p032_16.png] view at source ↗

**Figure 17.** Figure 17: Sensitivity analysis of the value of Qmin. C.6.5 Q-VALUE LOWER BOUND Qmin DOSER employs a lower bound Qmin when penalizing detrimental OOD actions. In our main experiments, this value is not treated as a tunable hyperparameter. Instead, it is derived directly from the environment dynamics as Qmin = Rmin 1−γ , which corresponds to the standard minimum achievable return under the given discount factor γ. Fo… view at source ↗

**Figure 18.** Figure 18: Comparison of DOSER with and without ensemble-guided gating. [PITH_FULL_IMAGE:figures/full_fig_p033_18.png] view at source ↗

**Figure 19.** Figure 19: Proportions of different action types with ensemble-guided gating. [PITH_FULL_IMAGE:figures/full_fig_p033_19.png] view at source ↗

**Figure 20.** Figure 20: Learning curves of the component ablation study on Gym-MuJoCo tasks. [PITH_FULL_IMAGE:figures/full_fig_p035_20.png] view at source ↗

**Figure 21.** Figure 21: Learning curves on Adroit tasks. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_21.png] view at source ↗

**Figure 22.** Figure 22: Learning curves on Gym-MuJoCo expert and random tasks. [PITH_FULL_IMAGE:figures/full_fig_p036_22.png] view at source ↗

read the original abstract

Offline reinforcement learning (RL) faces a critical challenge of overestimating the value of out-of-distribution (OOD) actions. Existing methods mitigate this issue by penalizing unseen samples, yet they fail to accurately identify OOD actions and may suppress beneficial exploration beyond the behavioral support. Although several methods have been proposed to differentiate OOD samples with distinct properties, they typically rely on restrictive assumptions about the data distribution and remain limited in discrimination ability. To address this problem, we propose DOSER (Diffusion-based OOD Detection and Selective Regularization), a novel framework that goes beyond uniform penalization. DOSER trains two diffusion models to capture the behavior policy and state distribution, using single-step denoising reconstruction error as a reliable OOD indicator. During policy optimization, it further distinguishes between beneficial and detrimental OOD actions by evaluating predicted transitions, selectively suppressing risky actions while encouraging exploration of high-potential ones. Theoretically, we prove that DOSER is a $\gamma$-contraction and therefore admits a unique fixed point with bounded value estimates. We further provide an asymptotic performance guarantee relative to the optimal policy under model approximation and OOD detection errors. Across extensive offline RL benchmarks, DOSER consistently attains superior performance to prior methods, especially on suboptimal datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DOSER shifts offline RL OOD handling from uniform penalties to diffusion-based detection plus selective regularization, but the contraction proof and single-step error reliability are the parts that need checking.

read the letter

The main point is that this paper trains two diffusion models—one on the behavior policy and one on the state distribution—then uses single-step denoising reconstruction error to flag OOD actions. During optimization it further splits those OOD actions by looking at predicted transitions, suppressing the risky ones while allowing exploration of the promising ones. That selective step is the actual departure from the penalization literature cited in the abstract. It directly targets the practical failure mode where blanket penalties kill useful actions that sit just outside the dataset support, especially on the suboptimal data that real applications usually provide. The claim of a γ-contraction with unique fixed point and an asymptotic guarantee under model and detection errors is stated cleanly, and the benchmarks are said to show consistent gains over prior methods on those harder datasets. Those are the parts that earn credit. The soft spots sit in the same places the stress-test note flags. Single-step denoising error is a coarse proxy; in high-dimensional or multimodal action spaces it can mislabel actions whose longer-horizon likelihood would correctly mark them OOD. If those misclassifications happen, the selective term injects a perturbation whose effect on the contraction constant is not obviously bounded by the paper’s stated error terms. Without the full derivation it is impossible to see whether the Lipschitz constant of the selective operator is absorbed inside γ or whether the asymptotic guarantee survives realistic detection error rates. The abstract also gives no quantitative ablation numbers or controls, so the empirical superiority is asserted rather than demonstrated here. This is for people working on offline RL reliability and distribution-shift handling. A reader who already follows diffusion models in RL or who needs methods that avoid over-penalizing on suboptimal logs will find the method description and the selective-regularization idea useful. It deserves a serious referee because the core idea is distinct, the problem is central, and the claims are concrete enough to be checked and revised.

Referee Report

2 major / 2 minor

Summary. The paper proposes DOSER, a framework for offline RL that trains two diffusion models (one for the behavior policy and one for the state distribution) and uses single-step denoising reconstruction error as an OOD indicator. During optimization, it evaluates predicted transitions to classify OOD actions as beneficial or detrimental, selectively suppressing risky actions while encouraging high-potential ones. The central claims are that the resulting operator is a γ-contraction (hence has a unique fixed point with bounded values) and that an asymptotic performance guarantee holds relative to the optimal policy under model approximation and OOD detection errors; empirically, DOSER outperforms prior methods on offline RL benchmarks, especially suboptimal datasets.

Significance. If the contraction and performance guarantees hold with the stated error bounds, DOSER would offer a substantive improvement over uniform penalization approaches by allowing controlled exploration of beneficial OOD actions. The diffusion-based OOD detection and selective regularization are technically novel for the offline RL setting and could be impactful on datasets with partial coverage. The empirical superiority claim, if supported by ablations and quantitative results, would strengthen the case for moving beyond simple conservatism.

major comments (2)

[Abstract and theoretical analysis] Abstract and theoretical analysis section: The claim that DOSER defines a γ-contraction (and therefore admits a unique fixed point) requires that the selective regularization term—driven by single-step denoising reconstruction errors and predicted-transition evaluations—does not increase the Lipschitz constant beyond γ. The manuscript does not appear to derive an explicit bound showing that misclassification errors from the single-step proxy (which can be coarse in multimodal or high-dimensional spaces) are absorbed within the original contraction factor; without this, the perturbation remains uncontrolled even under the paper's stated model/OOD error assumptions. This is load-bearing for both the fixed-point existence and the asymptotic guarantee.
[Theoretical analysis] Theoretical analysis section (performance guarantee): The asymptotic guarantee relative to the optimal policy is stated to hold under model approximation and OOD detection errors, but the derivation appears to treat the selective term's effect on value estimates as bounded without showing how the sign-dependent regularization (suppress vs. encourage) interacts with the error terms. If the OOD indicator can flip the sign of the perturbation on a non-negligible fraction of actions, the guarantee may not follow from standard error-propagation arguments.

minor comments (2)

[Abstract] The abstract asserts 'superior performance' and 'extensive benchmarks' but provides no quantitative deltas, dataset list, or ablation controls; these details should be summarized with effect sizes and statistical significance to allow readers to assess the strength of the empirical claim.
[Method and notation] Notation for the two diffusion models and the precise definition of the selective regularization operator (including how predicted transitions are computed and thresholded) should be introduced earlier and used consistently in the method and theory sections to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of DOSER's selective regularization over uniform penalization. We address the two major comments on the theoretical analysis below, providing clarifications from the manuscript and indicating where revisions will strengthen the exposition.

read point-by-point responses

Referee: The claim that DOSER defines a γ-contraction (and therefore admits a unique fixed point) requires that the selective regularization term—driven by single-step denoising reconstruction errors and predicted-transition evaluations—does not increase the Lipschitz constant beyond γ. The manuscript does not appear to derive an explicit bound showing that misclassification errors from the single-step proxy (which can be coarse in multimodal or high-dimensional spaces) are absorbed within the original contraction factor; without this, the perturbation remains uncontrolled even under the paper's stated model/OOD error assumptions.

Authors: We appreciate the referee's careful scrutiny of the contraction proof. The theoretical analysis shows that the selective regularization term contributes a perturbation whose sup-norm is bounded by the OOD detection error ε (via the single-step denoising reconstruction error and transition prediction). Under the paper's assumptions, this perturbation is absorbed such that the composite operator remains a γ-contraction when ε < (1-γ)/2, following standard arguments for approximate Bellman operators. To make the absorption of misclassification errors explicit—particularly for the single-step proxy in multimodal settings—we will add a supporting lemma deriving the Lipschitz bound on the selective term and its dependence on the stated error assumptions. revision: partial
Referee: The asymptotic guarantee relative to the optimal policy is stated to hold under model approximation and OOD detection errors, but the derivation appears to treat the selective term's effect on value estimates as bounded without showing how the sign-dependent regularization (suppress vs. encourage) interacts with the error terms. If the OOD indicator can flip the sign of the perturbation on a non-negligible fraction of actions, the guarantee may not follow from standard error-propagation arguments.

Authors: We agree that the interaction of sign-dependent regularization with error terms merits clearer derivation. The performance guarantee proof bounds the total perturbation via the triangle inequality after separating beneficial (encouraged) and detrimental (suppressed) OOD actions, with the sign determined by the predicted-transition evaluation whose accuracy is controlled by the model approximation error. The regularization magnitude is scaled by the denoising error, limiting the effect of any sign flips. We will revise the proof to insert an intermediate step explicitly showing that the propagated error remains O(ε + δ) (where δ is the OOD detection error) without the sign dependence invalidating the bound. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper trains diffusion models on behavior policy and state distribution data, defines an OOD indicator via single-step denoising error, applies selective regularization during policy optimization, and then proves the resulting operator is a γ-contraction with unique fixed point plus an asymptotic guarantee expressed in terms of model and detection errors. These steps follow standard RL contraction arguments once the operator is explicitly defined; the error terms are treated as exogenous bounds rather than quantities fitted inside the same equations. No self-definitional reduction, fitted input renamed as prediction, or load-bearing self-citation chain appears in the abstract or claimed theoretical results. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on two domain assumptions about the diffusion models and the transition predictor; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption Diffusion models trained on the behavior policy and state distribution yield a single-step denoising reconstruction error that reliably flags OOD actions.
This is the core OOD indicator used throughout training and policy optimization.
domain assumption Evaluating the predicted next state under an OOD action can distinguish beneficial from detrimental actions without introducing errors that break the contraction property.
This assumption enables the selective (rather than uniform) regularization step.

pith-pipeline@v0.9.0 · 5542 in / 1484 out tokens · 51969 ms · 2026-05-12T02:09:16.920628+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
Theorem 1 (Contraction mapping property). ... ||T_DOSER Q1 − T_DOSER Q2||_∞ ≤ γ||Q1 − Q2||_∞. By the Banach fixed-point theorem...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
using single-step denoising reconstruction error as a reliable OOD indicator

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 10 internal anchors

[1]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Offline reinforcement learning: Tutorial, review, and perspectives on open problems , author=. arXiv preprint arXiv:2005.01643 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2005
[2]

International conference on machine learning , pages=

Off-policy deep reinforcement learning without exploration , author=. International conference on machine learning , pages=. 2019 , organization=

work page 2019
[3]

Advances in neural information processing systems , volume=

Stabilizing off-policy q-learning via bootstrapping error reduction , author=. Advances in neural information processing systems , volume=

work page
[4]

Behavior Regularized Offline Reinforcement Learning

Behavior regularized offline reinforcement learning , author=. arXiv preprint arXiv:1911.11361 , year=

work page internal anchor Pith review arXiv 1911
[5]

Advances in neural information processing systems , volume=

A minimalist approach to offline reinforcement learning , author=. Advances in neural information processing systems , volume=

work page
[6]

Offline Reinforcement Learning with Implicit Q-Learning

Offline reinforcement learning with implicit q-learning , author=. arXiv preprint arXiv:2110.06169 , year=

work page internal anchor Pith review arXiv
[7]

Auto-Encoding Variational Bayes

Auto-encoding variational bayes , author=. arXiv preprint arXiv:1312.6114 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Diffusion policies as an expressive policy class for offline reinforcement learning.arXiv preprint arXiv:2208.06193,

Diffusion policies as an expressive policy class for offline reinforcement learning , author=. arXiv preprint arXiv:2208.06193 , year=

work page arXiv
[9]

Advances in neural information processing systems , volume=

Conservative q-learning for offline reinforcement learning , author=. Advances in neural information processing systems , volume=

work page
[10]

arXiv preprint arXiv:2105.08140 , year=

Uncertainty weighted actor-critic for offline reinforcement learning , author=. arXiv preprint arXiv:2105.08140 , year=

work page arXiv
[11]

arXiv preprint arXiv:2202.11566 , year=

Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning , author=. arXiv preprint arXiv:2202.11566 , year=

work page arXiv
[12]

Advances in Neural Information Processing Systems , volume=

Supported value regularization for offline reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[13]

arXiv preprint arXiv:2212.04607 , year=

Confidence-conditioned value functions for offline reinforcement learning , author=. arXiv preprint arXiv:2212.04607 , year=

work page arXiv
[14]

IEEE Transactions on Neural Networks and Learning Systems , year=

De-Pessimism Offline Reinforcement Learning via Value Compensation , author=. IEEE Transactions on Neural Networks and Learning Systems , year=

work page
[15]

IEEE Transactions on Neural Networks and Learning Systems , year=

ACL-QL: Adaptive Conservative Level in Q -Learning for Offline Reinforcement Learning , author=. IEEE Transactions on Neural Networks and Learning Systems , year=

work page
[16]

1998 , publisher=

Reinforcement learning: An introduction , author=. 1998 , publisher=

work page 1998
[17]

Reinforcement learning: State-of-the-art , pages=

Batch reinforcement learning , author=. Reinforcement learning: State-of-the-art , pages=. 2012 , publisher=

work page 2012
[18]

International conference on machine learning , pages=

Deep unsupervised learning using nonequilibrium thermodynamics , author=. International conference on machine learning , pages=. 2015 , organization=

work page 2015
[19]

Advances in neural information processing systems , volume=

Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=

work page
[20]

Score-Based Generative Modeling through Stochastic Differential Equations

Score-based generative modeling through stochastic differential equations , author=. arXiv preprint arXiv:2011.13456 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2011
[21]

Advances in neural information processing systems , volume=

Elucidating the design space of diffusion-based generative models , author=. Advances in neural information processing systems , volume=

work page
[22]

Neural computation , volume=

A connection between score matching and denoising autoencoders , author=. Neural computation , volume=. 2011 , publisher=

work page 2011
[23]

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

Idql: Implicit q-learning as an actor-critic method with diffusion policies , author=. arXiv preprint arXiv:2304.10573 , year=

work page internal anchor Pith review arXiv
[24]

International Conference on Machine Learning , pages=

Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[25]

Ofﬂine reinforcement learning via high-ﬁdelity generative behavior modeling

Offline reinforcement learning via high-fidelity generative behavior modeling , author=. arXiv preprint arXiv:2209.14548 , year=

work page arXiv
[26]

arXiv preprint arXiv:2310.07297 , year=

Score regularized policy optimization through diffusion behavior , author=. arXiv preprint arXiv:2310.07297 , year=

work page arXiv
[27]

Advances in Neural Information Processing Systems , volume=

Diffusion policies creating a trust region for offline reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[28]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

D4rl: Datasets for deep data-driven reinforcement learning , author=. arXiv preprint arXiv:2004.07219 , year=

work page internal anchor Pith review arXiv 2004
[29]

Advances in neural information processing systems , volume=

Likelihood ratios for out-of-distribution detection , author=. Advances in neural information processing systems , volume=

work page
[30]

Advances in neural information processing systems , volume=

Glow: Generative flow with invertible 1x1 convolutions , author=. Advances in neural information processing systems , volume=

work page
[31]

Deep anomaly detection with outlier exposure.arXiv preprint arXiv:1812.04606, 2018

Deep anomaly detection with outlier exposure , author=. arXiv preprint arXiv:1812.04606 , year=

work page arXiv
[32]

Do Deep Generative Models Know What They Don't Know?

Do deep generative models know what they don't know? , author=. arXiv preprint arXiv:1810.09136 , year=

work page Pith review arXiv
[33]

arXiv preprint arXiv:1909.11480 , year=

Input complexity and out-of-distribution detection with likelihood-based generative models , author=. arXiv preprint arXiv:1909.11480 , year=

work page arXiv 1909
[34]

W., and Lakshminarayanan, B.: Detecting out-of-distribution inputs to deep generative models using typicality, arXiv preprint arXiv:1906.02994,

Detecting out-of-distribution inputs to deep generative models using typicality , author=. arXiv preprint arXiv:1906.02994 , year=

work page arXiv 1906
[35]

arXiv preprint arXiv:1812.02765 (2018)

Improving reconstruction autoencoder out-of-distribution detection with mahalanobis distance , author=. arXiv preprint arXiv:1812.02765 , year=

work page arXiv
[36]

International conference on learning representations , year=

Deep autoencoding gaussian mixture model for unsupervised anomaly detection , author=. International conference on learning representations , year=

work page
[37]

Outlier detection using autoencoders , author=

work page
[38]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Denoising diffusion models for out-of-distribution detection , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[39]

arXiv preprint arXiv:2412.03258 , year=

Learning on one mode: Addressing multi-modality in offline reinforcement learning , author=. arXiv preprint arXiv:2412.03258 , year=

work page arXiv
[40]

Advances in neural information processing systems , volume=

Simple and scalable predictive uncertainty estimation using deep ensembles , author=. Advances in neural information processing systems , volume=

work page
[41]

Advances in Neural Information Processing Systems , volume=

Mopo: Model-based offline policy optimization , author=. Advances in Neural Information Processing Systems , volume=

work page
[42]

international conference on machine learning , pages=

Dropout as a bayesian approximation: Representing model uncertainty in deep learning , author=. international conference on machine learning , pages=. 2016 , organization=

work page 2016
[43]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

DARL: distance-aware uncertainty estimation for offline reinforcement learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[44]

arXiv preprint arXiv:2405.20555 , year=

Diffusion actor-critic: Formulating constrained policy iteration as diffusion noise regression for offline reinforcement learning , author=. arXiv preprint arXiv:2405.20555 , year=

work page arXiv
[45]

Adam: A Method for Stochastic Optimization

A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , volume=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

SGDR: Stochastic Gradient Descent with Warm Restarts

Sgdr: Stochastic gradient descent with warm restarts , author=. arXiv preprint arXiv:1608.03983 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Sur les op

Banach, Stefan , journal=. Sur les op. 1922 , publisher=

work page 1922
[48]

Soft Actor-Critic Algorithms and Applications

Soft actor-critic algorithms and applications , author=. arXiv preprint arXiv:1812.05905 , year=

work page internal anchor Pith review arXiv
[49]

Proceedings 2001 international conference on image processing (Cat

One-class SVM for learning in image retrieval , author=. Proceedings 2001 international conference on image processing (Cat. No. 01CH37205) , volume=. 2001 , organization=

work page 2001
[50]

International conference on machine learning , pages=

Is pessimism provably efficient for offline rl? , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[51]

International conference on machine learning , pages=

Deep structured energy based models for anomaly detection , author=. International conference on machine learning , pages=. 2016 , organization=

work page 2016
[52]

arXiv preprint arXiv:2005.02359 , year=

Classification-based anomaly detection for general data , author=. arXiv preprint arXiv:2005.02359 , year=

work page arXiv 2005
[53]

Machine Learning , volume=

Regularisation of neural networks by enforcing lipschitz continuity , author=. Machine Learning , volume=. 2021 , publisher=

work page 2021
[54]

SIAM Journal on Control and Optimization , volume=

Finite linear programming approximations of constrained discounted Markov decision processes , author=. SIAM Journal on Control and Optimization , volume=. 2013 , publisher=

work page 2013
[55]

Uncertainty in Artificial Intelligence , pages=

Deterministic policy gradient: Convergence analysis , author=. Uncertainty in Artificial Intelligence , pages=. 2022 , organization=

work page 2022
[56]

International conference on machine learning , pages=

Policy regularization with dataset constraint for offline reinforcement learning , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023
[57]

arXiv preprint arXiv:2405.19909 , year=

Adaptive advantage-guided policy regularization for offline reinforcement learning , author=. arXiv preprint arXiv:2405.19909 , year=

work page arXiv