arxiv: 2604.17919 · v2 · submitted 2026-04-20 · 💻 cs.LG · cs.RO

Recognition: unknown

Fisher Decorator: Refining Flow Policy via a Local Transport Map

Xiaoyuan Cheng , Haoyu Wang , Wenxuan Yuan , Ziyan Wang , Zonghao Chen , Li Zeng , Zhuo Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:23 UTC · model grok-4.3

classification 💻 cs.LG cs.RO

keywords offline reinforcement learningflow matchinglocal transport mapFisher information matrixpolicy refinementanisotropic regularizationKL-constrained objective

0 comments

The pith

Modeling flow policy refinement as a local transport map with a Fisher-information quadratic yields controllable error near the optimal solution in offline RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Flow-based offline reinforcement learning methods parameterize policies via flow matching yet suffer from a geometric mismatch: their L2 regularization is isotropic and ignores the anisotropic structure of the behavioral policy manifold. The paper reframes policy improvement as an initial flow policy plus a small residual displacement, which acts as a local transport map between distributions. Differentiating the density under this map produces a quadratic approximation to the KL-regularized objective whose curvature is supplied by the Fisher information matrix. The flow velocity already encodes the required score function, turning the problem into a tractable anisotropic quadratic program. If the claim holds, the optimality gap left by prior isotropic bounds shrinks to a provable, controllable size inside a neighborhood of the true optimum.

Core claim

The optimality gap in earlier flow policies arises from their isotropic L2 upper bound on the 2-Wasserstein distance. In contrast, the local transport map formulation induces a density transformation whose first-order effect is captured exactly by a Fisher-information quadratic form. Optimizing under the corresponding quadratic constraint keeps the solution inside a neighborhood where the approximation error remains controllable, directly addressing the misalignment between isotropic regularization and the anisotropic data geometry.

What carries the argument

The local transport map formed by a base flow policy augmented by a residual displacement, whose effect on the induced density is approximated quadratically by the Fisher information matrix extracted from the flow's score function.

If this is right

Optimization directions become density-sensitive and aligned with the behavioral manifold rather than isotropic.
The approximation error is bounded and controllable by restricting updates to the small-residual neighborhood.
The resulting policy achieves state-of-the-art performance on standard offline RL benchmarks.
The framework explains why previous L2-based flow methods systematically misalign gradients.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The local-transport perspective could be applied to refine other score-based or flow-based generative models outside RL.
Tracking residual size during training offers a built-in diagnostic for when the quadratic approximation begins to degrade.
Density-aware quadratic constraints of this form may improve regularization in broader generative modeling tasks.

Load-bearing premise

The local quadratic approximation to the KL objective stays accurate whenever the residual displacement is small enough that higher-order density changes remain negligible.

What would settle it

On a simple benchmark, measure the gap between the quadratic approximation and the true KL divergence while systematically increasing the size of the residual displacement; the gap should remain bounded only inside the predicted small-displacement neighborhood.

Figures

Figures reproduced from arXiv: 2604.17919 by Haoyu Wang, Li Zeng, Wenxuan Yuan, Xiaoyuan Cheng, Zhuo Sun, Ziyan Wang, Zonghao Chen.

**Figure 2.** Figure 2: Comparison of flow policy refinement paradigms. (Left) Paradigm 1: flow [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Isotropic vs. anisotropic policy refinement. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Offline-to-online fine-tuning performance. We [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Ablations on perturbed time tε. 6. Can the hyperparameter “perturbed time tε” be determined from first principles, rather than through heuristic tuning? YES. We provide a principled characterization of the optimal perturbation by analyzing the trade-off between approximation bias and numerical error (see Appendix C.3). This yields an optimal scaling ε ∗ ∼ O(δ 1/4 FP32), which depends on both machine preci… view at source ↗

**Figure 6.** Figure 6: Overview of benchmark tasks. Our evaluation spans a diverse set of environments from [PITH_FULL_IMAGE:figures/full_fig_p029_6.png] view at source ↗

**Figure 7.** Figure 7: Additional examples of isotropic and anisotropic policy refinement. Panels (a) and (c) [PITH_FULL_IMAGE:figures/full_fig_p032_7.png] view at source ↗

read the original abstract

Recent advances in flow-based offline reinforcement learning (RL) have achieved strong performance by parameterizing policies via flow matching. However, they still face critical trade-offs among expressiveness, optimality, and efficiency. In particular, existing flow policies interpret the $L_2$ regularization as an upper bound of the 2-Wasserstein distance ($W_2$), which can be problematic in offline settings. This issue stems from a fundamental geometric mismatch: the behavioral policy manifold is inherently anisotropic, whereas the $L_2$ (or upper bound of $W_2$) regularization is isotropic and density-insensitive, leading to systematically misaligned optimization directions. To address this, we revisit offline RL from a geometric perspective and show that policy refinement can be formulated as a local transport map: an initial flow policy augmented by a residual displacement. By analyzing the induced density transformation, we derive a local quadratic approximation of the KL-constrained objective governed by the Fisher information matrix, enabling a tractable anisotropic optimization formulation. By leveraging the score function embedded in the flow velocity, we obtain a corresponding quadratic constraint for efficient optimization. Our results reveal that the optimality gap in prior methods arises from their isotropic approximation. In contrast, our framework achieves a controllable approximation error within a provable neighborhood of the optimal solution. Extensive experiments demonstrate state-of-the-art performance across diverse offline RL benchmarks. See project page: https://github.com/ARC0127/Fisher-Decorator.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames flow-policy refinement in offline RL as a local transport map and derives an anisotropic Fisher quadratic constraint from the embedded score, but the controllable-error claim lacks explicit remainder bounds on the Taylor expansion.

read the letter

The main point is that they treat the policy update as a small residual displacement on top of a base flow policy, then use the score function already inside the flow velocity to build a quadratic approximation to the KL objective via the Fisher information matrix. This gives an anisotropic constraint instead of the usual isotropic L2 penalty that ignores the geometry of the behavioral manifold. The framing and the direct tie to the embedded score look new relative to the flow-matching RL papers they cite. The geometric diagnosis of why isotropic regularizers misalign optimization directions is straightforward and useful to have on paper. Their experiments report state-of-the-art numbers on standard offline RL benchmarks, which supplies the main practical evidence that the change can lift performance. The soft spot is the central claim of controllable approximation error inside a provable neighborhood. The derivation rests on a second-order Taylor expansion of the KL term, yet the abstract supplies no bound on the cubic or higher remainder terms and no concrete radius expressed in the Fisher metric or flow curvature. Without that, the neighborhood is only local in the asymptotic sense; once the residual displacement grows large enough to improve the policy, the quadratic model can lose fidelity and the claimed advantage over isotropic methods becomes harder to guarantee. The full derivations may address this, but the sketch leaves it open. This paper is for researchers already using flow or score-based policies in offline RL who want a more density-aware regularizer. A reader who works on geometric constraints or score-function properties would get concrete value from the construction even if they later tighten the error analysis themselves. I would send it for peer review. The idea is substantive enough to deserve referee scrutiny on the approximation details and the experimental controls.

Referee Report

2 major / 2 minor

Summary. The paper proposes the Fisher Decorator method for refining flow-based policies in offline RL. It formulates policy improvement as augmenting an initial flow policy with a residual displacement (a local transport map), analyzes the induced density transformation to obtain a local quadratic approximation of the KL-constrained objective governed by the Fisher information matrix, and leverages the embedded score function of the flow velocity to derive a corresponding quadratic constraint. This is positioned as addressing the geometric mismatch between isotropic L2 regularization (an upper bound on W2) and the anisotropic behavioral policy manifold, with the claim that the resulting optimality gap is controllable within a provable neighborhood of the optimum. Experiments report state-of-the-art results on standard offline RL benchmarks.

Significance. If the local quadratic approximation is shown to have controllable error with an explicit neighborhood, the work would provide a geometrically principled alternative to isotropic regularization in flow-matching offline RL. The technical device of extracting the quadratic constraint directly from the flow velocity's score function is a strength that could improve both efficiency and alignment with the policy manifold's anisotropy. This could influence subsequent work on regularized flow policies by emphasizing density-sensitive, Fisher-based local models over L2 penalties.

major comments (2)

[Abstract] Abstract: the central claim that 'our framework achieves a controllable approximation error within a provable neighborhood of the optimal solution' is load-bearing for the contrast with prior isotropic methods, yet the derivation supplies only the second-order Fisher-matrix term without an explicit remainder bound (e.g., via third-derivative Lipschitz constants of the log-density or score function) or a concrete radius expressed in the Fisher metric. Without this, it is impossible to verify that the residual displacements chosen to improve the policy remain inside the region where the quadratic model is faithful.
[Method derivation] The local quadratic approximation of the KL objective (governed by the Fisher matrix induced by the flow velocity's score) is presented as following from density transformation analysis, but the manuscript does not state the precise conditions under which the cubic and higher terms are negligible relative to the quadratic term for finite displacements; this leaves open whether the operating regime exits the valid neighborhood.

minor comments (2)

The abstract refers to a project page for code; the main text should include at least a brief reproducibility statement or pseudocode for the quadratic-constraint optimization step.
Notation for the residual displacement and the induced density transformation should be introduced with a clear diagram or equation reference early in the method section to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our paper. We address the major concerns regarding the theoretical guarantees of our local quadratic approximation below. We will revise the manuscript to include explicit bounds and conditions as suggested.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'our framework achieves a controllable approximation error within a provable neighborhood of the optimal solution' is load-bearing for the contrast with prior isotropic methods, yet the derivation supplies only the second-order Fisher-matrix term without an explicit remainder bound (e.g., via third-derivative Lipschitz constants of the log-density or score function) or a concrete radius expressed in the Fisher metric. Without this, it is impossible to verify that the residual displacements chosen to improve the policy remain inside the region where the quadratic model is faithful.

Authors: We agree with the referee that an explicit remainder bound would make the claim more rigorous. The derivation in the paper uses the second-order Taylor expansion of the transformed density, leading to the Fisher quadratic term. Under the assumption that the log-density has bounded third derivatives (Lipschitz continuous Hessian), the remainder can be bounded using standard Taylor remainder theorems. We will add a new proposition in the method section that provides this bound and specifies the radius in the Fisher metric. This will clarify the provable neighborhood and ensure residual displacements stay within it. We will also update the abstract if necessary to reference this. revision: yes
Referee: [Method derivation] The local quadratic approximation of the KL objective (governed by the Fisher matrix induced by the flow velocity's score) is presented as following from density transformation analysis, but the manuscript does not state the precise conditions under which the cubic and higher terms are negligible relative to the quadratic term for finite displacements; this leaves open whether the operating regime exits the valid neighborhood.

Authors: We acknowledge that the precise conditions for neglecting higher-order terms are not explicitly stated. The analysis assumes small residual displacements for the local map. To address this, we will include a discussion in the method section on the validity conditions, such as requiring the displacement norm in the Fisher metric to be sufficiently small relative to the inverse of the Lipschitz constant of the third derivatives of the log-density. This ensures the cubic terms remain negligible compared to the quadratic term. We will also provide guidance on how the optimization procedure keeps the displacements within this regime. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation uses standard density transformation and Fisher quadratic without reduction to inputs

full rationale

The paper derives its local quadratic approximation of the KL objective from analysis of the induced density transformation under a residual displacement transport map, using the flow velocity's embedded score function to obtain the Fisher-governed form. This follows standard second-order Taylor expansion around the base policy and does not reduce by construction to a fitted parameter, self-definition, or self-citation chain. No equations in the abstract or description equate the final result to its inputs tautologically, nor rename known patterns or smuggle ansatzes via prior self-work. The controllable approximation error claim rests on the local neighborhood assumption rather than circular logic, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only; the ledger is populated from the stated geometric assumptions and approximation steps. No explicit free parameters or invented entities are named, but the local quadratic regime is an unverified modeling choice.

axioms (2)

domain assumption behavioral policy manifold is inherently anisotropic
Invoked to motivate the mismatch with isotropic L2 regularization
ad hoc to paper local quadratic approximation of KL objective governed by Fisher matrix is valid near the current policy
Central modeling step that enables the tractable anisotropic formulation

pith-pipeline@v0.9.0 · 5577 in / 1331 out tokens · 23659 ms · 2026-05-10T05:23:52.016616+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

84 extracted references · 38 canonical work pages · 16 internal anchors

[1]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

work page internal anchor Pith review arXiv 2005
[2]

A closer look at offline rl agents.Advances in Neural Information Processing Systems, 35:8591–8604, 2022

Yuwei Fu, Di Wu, and Benoit Boulet. A closer look at offline rl agents.Advances in Neural Information Processing Systems, 35:8591–8604, 2022

2022
[3]

A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE transactions on neural networks and learning systems, 35(8):10237–10257, 2023

Rafael Figueiredo Prudencio, Marcos ROA Maximo, and Esther Luna Colombini. A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE transactions on neural networks and learning systems, 35(8):10237–10257, 2023

2023
[4]

Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024
[5]

Ogbench: Benchmarking offline goal-conditioned rl.arXiv preprint arXiv:2410.20092,

Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. Ogbench: Benchmarking offline goal-conditioned rl.arXiv preprint arXiv:2410.20092, 2024

work page arXiv 2024
[6]

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies.arXiv preprint arXiv:2304.10573, 2023

work page internal anchor Pith review arXiv 2023
[7]

arXiv preprint arXiv:2310.07297 , year=

Huayu Chen, Cheng Lu, Zhengyi Wang, Hang Su, and Jun Zhu. Score regularized policy optimization through diffusion behavior.arXiv preprint arXiv:2310.07297, 2023

work page arXiv 2023
[8]

Planning with Diffusion for Flexible Behavior Synthesis

Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis.arXiv preprint arXiv:2205.09991, 2022

work page internal anchor Pith review arXiv 2022
[9]

Zhang, W

Shiyuan Zhang, Weitong Zhang, and Quanquan Gu. Energy-weighted flow matching for offline reinforcement learning.arXiv preprint arXiv:2503.04975, 2025

work page arXiv 2025
[10]

Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning

Cheng Lu, Huayu Chen, Jianfei Chen, Hang Su, Chongxuan Li, and Jun Zhu. Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning. InInternational Conference on Machine Learning, pages 22825–22855. PMLR, 2023

2023
[11]

Efficient diffusion policies for offline reinforcement learning.Advances in Neural Information Processing Systems, 36:67195–67212, 2023

Bingyi Kang, Xiao Ma, Chao Du, Tianyu Pang, and Shuicheng Yan. Efficient diffusion policies for offline reinforcement learning.Advances in Neural Information Processing Systems, 36:67195–67212, 2023

2023
[12]

Diffusion-based reinforcement learning via q-weighted variational policy optimization

Shutong Ding, Ke Hu, Zhenhao Zhang, Kan Ren, Weinan Zhang, Jingyi Yu, Jingya Wang, and Ye Shi. Diffusion-based reinforcement learning via q-weighted variational policy optimization. Advances in Neural Information Processing Systems, 37:53945–53968, 2024

2024
[13]

Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning.arXiv preprint arXiv:2208.06193, 2022. 10

work page internal anchor Pith review arXiv 2022
[14]

Ding and C

Zihan Ding and Chi Jin. Consistency models as a rich and efficient policy class for reinforcement learning.arXiv preprint arXiv:2309.16984, 2023

work page arXiv 2023
[15]

Is value learning really the main bottleneck in offline rl?Advances in Neural Information Processing Systems, 37:79029–79056, 2024

Seohong Park, Kevin Frans, Sergey Levine, and Aviral Kumar. Is value learning really the main bottleneck in offline rl?Advances in Neural Information Processing Systems, 37:79029–79056, 2024

2024
[16]

arXiv preprint arXiv:2510.08218 , year=

Nicolas Espinosa-Dice, Kiante Brantley, and Wen Sun. Expressive value learning for scalable offline reinforcement learning.arXiv preprint arXiv:2510.08218, 2025

work page arXiv 2025
[17]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022

work page internal anchor Pith review arXiv 2022
[18]

Mean Flows for One-step Generative Modeling

Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447, 2025

work page internal anchor Pith review arXiv 2025
[19]

Flow q-learning, 2025

Seohong Park, Qiyang Li, and Sergey Levine. Flow q-learning, 2025

2025
[20]

Reform: Reflected flows for on-support offline rl via noise manipulation

Songyuan Zhang, Oswin So, HM Ahmad, Eric Yang Yu, Matthew Cleaveland, Mitchell Black, and Chuchu Fan. Reform: Reflected flows for on-support offline rl via noise manipulation. arXiv preprint arXiv:2602.05051, 2026

work page arXiv 2026
[21]

Thanh Nguyen and Chang D. Yoo. One-step flow q-learning: Addressing the diffusion policy bottleneck in offline reinforcement learning, 2026

2026
[22]

Springer, 2009

Cédric Villani et al.Optimal transport: old and new, volume 338. Springer, 2009

2009
[23]

DeFlow : Decoupling manifold modeling and value maximization for offline policy extraction

Zhancun Mu. Deflow: Decoupling manifold modeling and value maximization for offline policy extraction.arXiv preprint arXiv:2601.10471, 2026

work page arXiv 2026
[24]

Nonparametric instrumental variable regression with observed covariates.arXiv preprint arXiv:2511.19404, 2025

Zikai Shen, Zonghao Chen, Dimitri Meunier, Ingo Steinwart, Arthur Gretton, and Zhu Li. Nonparametric instrumental variable regression with observed covariates.arXiv preprint arXiv:2511.19404, 2025

work page arXiv 2025
[25]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

1998
[26]

Improving and generalizing flow-based generative models with minibatch optimal transport

Alexander Tong, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Kilian Fatras, Guy Wolf, and Yoshua Bengio. Conditional flow matching: Simulation-free dynamic optimal transport.arXiv preprint arXiv:2302.00482, 2(3), 2023

work page internal anchor Pith review arXiv 2023
[27]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review arXiv 2022
[29]

Building Normalizing Flows with Stochastic Interpolants

Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants.arXiv preprint arXiv:2209.15571, 2022

work page internal anchor Pith review arXiv 2022
[30]

Springer Science & Business Media, 2012

Serge Lang.Differential and Riemannian manifolds. Springer Science & Business Media, 2012

2012
[31]

Flow Matching Guide and Code

Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky TQ Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow matching guide and code.arXiv preprint arXiv:2412.06264, 2024

work page internal anchor Pith review arXiv 2024
[32]

Springer, 2016

Shun-ichi Amari.Information geometry and its applications. Springer, 2016

2016
[33]

Springer, 2025

Sinho Chewi, Jonathan Niles-Weed, and Philippe Rigollet.Statistical optimal transport. Springer, 2025

2025
[34]

Normalizing flows for probabilistic modeling and inference.Journal of Machine Learning Research, 22(57):1–64, 2021

George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference.Journal of Machine Learning Research, 22(57):1–64, 2021. 11

2021
[35]

Springer, 2006

Terence Tao.Analysis, volume 1. Springer, 2006

2006
[36]

Springer, 2006

Jorge Nocedal and Stephen J Wright.Numerical optimization. Springer, 2006

2006
[37]

Revisiting the minimalist approach to offline reinforcement learning.Advances in Neural Information Processing Systems, 36:11592–11620, 2023

Denis Tarasov, Vladislav Kurenkov, Alexander Nikulin, and Sergey Kolesnikov. Revisiting the minimalist approach to offline reinforcement learning.Advances in Neural Information Processing Systems, 36:11592–11620, 2023

2023
[38]

D5rl: Diverse datasets for data- driven deep reinforcement learning.arXiv preprint arXiv:2408.08441, 2024

Rafael Rafailov, Kyle Hatch, Anikait Singh, Laura Smith, Aviral Kumar, Ilya Kostrikov, Philippe Hansen-Estruch, Victor Kolev, Philip Ball, Jiajun Wu, et al. D5rl: Diverse datasets for data- driven deep reinforcement learning.arXiv preprint arXiv:2408.08441, 2024

work page arXiv 2024
[39]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning.arXiv preprint arXiv:2004.07219, 2020

work page internal anchor Pith review arXiv 2004
[40]

Offline Reinforcement Learning with Implicit Q-Learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning.arXiv preprint arXiv:2110.06169, 2021

work page internal anchor Pith review arXiv 2021
[41]

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

Nair Ashvin, Dalal Murtaza, Gupta Abhishek, and L Sergey. Accelerating online reinforcement learning with offline datasets.CoRR, vol. abs/2006.09359, 2020

work page internal anchor Pith review arXiv 2006
[42]

Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning.Advances in Neural Information Processing Systems, 36:62244–62269, 2023

Mitsuhiko Nakamoto, Simon Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning.Advances in Neural Information Processing Systems, 36:62244–62269, 2023

2023
[43]

Efficient online reinforcement learning with offline data

Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning, pages 1577–1594. PMLR, 2023

2023
[44]

arXiv preprint arXiv:2302.08560 , year=

Harshit Sikchi, Qinqing Zheng, Amy Zhang, and Scott Niekum. Dual rl: Unification and new methods for reinforcement and imitation learning.arXiv preprint arXiv:2302.08560, 2023

work page arXiv 2023
[45]

A minimalist approach to offline reinforcement learning

Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145, 2021

2021
[46]

Corl: Research-oriented deep offline reinforcement learning library.Advances in Neural Information Processing Systems, 36:30997–31020, 2023

Denis Tarasov, Alexander Nikulin, Dmitry Akimov, Vladislav Kurenkov, and Sergey Kolesnikov. Corl: Research-oriented deep offline reinforcement learning library.Advances in Neural Information Processing Systems, 36:30997–31020, 2023

2023
[47]

Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems, 33:1179– 1191, 2020

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems, 33:1179– 1191, 2020

2020
[48]

-A"), 1e6 (

Haoran Xu, Li Jiang, Jianxiong Li, Zhuoran Yang, Zhaoran Wang, Victor Wai Kin Chan, and Xianyuan Zhan. Offline rl with no ood actions: In-sample learning via implicit value regularization.arXiv preprint arXiv:2303.15810, 2023

work page arXiv 2023
[49]

Extreme q-learning: Maxent rl without entropy

Divyansh Garg, Joey Hejna, Matthieu Geist, and Stefano Ermon. Extreme q-learning: Maxent rl without entropy.arXiv preprint arXiv:2301.02328, 2023

work page arXiv 2023
[50]

Mopo: Model-based offline policy optimization.Advances in neural information processing systems, 33:14129–14142, 2020

Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization.Advances in neural information processing systems, 33:14129–14142, 2020

2020
[51]

Anti- exploration by random network distillation

Alexander Nikulin, Vladislav Kurenkov, Denis Tarasov, and Sergey Kolesnikov. Anti- exploration by random network distillation. InInternational conference on machine learning, pages 26228–26244. PMLR, 2023

2023
[52]

Offline reinforcement learning as one big sequence modeling problem.Advances in neural information processing systems, 34:1273– 1286, 2021

Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem.Advances in neural information processing systems, 34:1273– 1286, 2021. 12

2021
[53]

Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021

2021
[54]

Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble

Seunghyun Lee, Younggyo Seo, Kimin Lee, Pieter Abbeel, and Jinwoo Shin. Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble. InConference on Robot Learning, pages 1702–1712. PMLR, 2022

2022
[55]

arXiv preprint arXiv:2210.06718 , year=

Yuda Song, Yifei Zhou, Ayush Sekhari, J Andrew Bagnell, Akshay Krishnamurthy, and Wen Sun. Hybrid rl: Using both offline and online data can make rl efficient.arXiv preprint arXiv:2210.06718, 2022

work page arXiv 2022
[56]

Deep unsuper- vised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015

2015
[57]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020
[58]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011
[59]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

2024
[60]

Reasoning with latent diffusion in offline reinforcement learning.arXiv preprint arXiv:2309.06599,

Siddarth Venkatraman, Shivesh Khaitan, Ravi Tej Akella, John Dolan, Jeff Schneider, and Glen Berseth. Reasoning with latent diffusion in offline reinforcement learning.arXiv preprint arXiv:2309.06599, 2023

work page arXiv 2023
[61]

Safe and stable control via lyapunov- guided diffusion models.arXiv preprint arXiv:2509.25375, 2025

Xiaoyuan Cheng, Xiaohang Tang, and Yiming Yang. Safe and stable control via lyapunov- guided diffusion models.arXiv preprint arXiv:2509.25375, 2025

work page arXiv 2025
[62]

Efficient online reinforcement learning for diffusion policy.arXiv preprint arXiv:2502.00361, 2025

Haitong Ma, Tianyi Chen, Kai Wang, Na Li, and Bo Dai. Efficient online reinforcement learning for diffusion policy.arXiv preprint arXiv:2502.00361, 2025

work page arXiv 2025
[63]

Maximum entropy reinforcement learning with diffusion policy.arXiv preprint arXiv:2502.11612,

Xiaoyi Dong, Jian Cheng, and Xi Sheryl Zhang. Maximum entropy reinforcement learning with diffusion policy.arXiv preprint arXiv:2502.11612, 2025

work page arXiv 2025
[64]

arXiv preprint arXiv:2503.09817 , year=

Jesse Farebrother, Matteo Pirotta, Andrea Tirinzoni, Rémi Munos, Alessandro Lazaric, and Ahmed Touati. Temporal difference flows.arXiv preprint arXiv:2503.09817, 2025

work page arXiv 2025
[65]

Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

2024
[66]

Training Agents Inside of Scalable World Models

Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025

work page internal anchor Pith review arXiv 2025
[67]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016
[68]

Residual policy learning.arXiv preprint arXiv:1812.06298, 2018

Tom Silver, Kelsey Allen, Josh Tenenbaum, and Leslie Kaelbling. Residual policy learning. arXiv preprint arXiv:1812.06298, 2018

work page arXiv 2018
[69]

From imitation to refinement-residual rl for precise assembly

Lars Ankile, Anthony Simeonov, Idan Shenfeld, Marcel Torne, and Pulkit Agrawal. From imitation to refinement-residual rl for precise assembly. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 01–08. IEEE, 2025. 13

2025
[70]

Self-improving vision-language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091, 2025

Wenli Xiao, Haotian Lin, Andy Peng, Haoru Xue, Tairan He, Yuqi Xie, Fengyuan Hu, Jimmy Wu, Zhengyi Luo, Linxi Fan, et al. Self-improving vision-language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091, 2025

work page arXiv 2025
[71]

arXiv preprint arXiv:2412.13630 , year=

Xiu Yuan, Tongzhou Mu, Stone Tao, Yunhao Fang, Mengke Zhang, and Hao Su. Policy decora- tor: Model-agnostic online refinement for large policy model.arXiv preprint arXiv:2412.13630, 2024

work page arXiv 2024
[72]

Courier Corporation, 2007

Alfréd Rényi.Probability theory. Courier Corporation, 2007

2007
[73]

Lectures on logarithmic sobolev inequalities

Alice Guionnet and Bogusław Zegarlinksi. Lectures on logarithmic sobolev inequalities. In Séminaire de probabilités XXXVI, pages 1–134. Springer, 2004

2004
[74]

Springer, 2014

Herbert Federer.Geometric measure theory. Springer, 2014

2014
[75]

Springer Science & Business Media, 2009

John Neuberger.Sobolev gradients and differential equations. Springer Science & Business Media, 2009

2009
[76]

Springer, 2011

Haim Brezis and Haim Brézis.Functional analysis, Sobolev spaces and partial differential equations, volume 2. Springer, 2011

2011
[77]

CreateSpace Independent Publishing Platform Scotts Valley, CA, USA, 2012

Alan Macdonald.Vector and geometric calculus, volume 12. CreateSpace Independent Publishing Platform Scotts Valley, CA, USA, 2012

2012
[78]

Towards a unified analysis of neural networks in nonparametric instrumental variable regression: Optimization and generalization.arXiv preprint arXiv:2511.14710, 2025

Zonghao Chen, Atsushi Nitanda, Arthur Gretton, and Taiji Suzuki. Towards a unified analy- sis of neural networks in nonparametric instrumental variable regression: Optimization and generalization.arXiv preprint arXiv:2511.14710, 2025. 14 Notation Notation Meaning aaction sstate sscore function ttime step rreward function vvelocity field (or flow vector) A...

work page arXiv 2025
[79]

Let A ⊆R d be the action space

Global Cancellation of Density CurvatureWe first establish that the total curvature of any valid probability density integrates to zero over its support. Let A ⊆R d be the action space. By the definition of the Laplacian operator, we have ∇2 aπβ =∇ ·(∇ aπβ). According to the Divergence Theorem (or Stokes’ Theorem) [77], the integral overAcan be converted ...
[80]

collapsing

Physical Insight: Avoiding Distributional ShiftIn our residual policy setting, we refine the behavior policy πβ via a displacement field δ(s, a). The curvature term ∇2 aπβ πβ represents local volumetric deformation—the degree to which the action space is locally compressed or stretched. Equation (43) shows that the expectation of this curvature over the e...

Showing first 80 references.