pith. machine review for the scientific record. sign in

arxiv: 2604.17919 · v2 · submitted 2026-04-20 · 💻 cs.LG · cs.RO

Recognition: unknown

Fisher Decorator: Refining Flow Policy via a Local Transport Map

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:23 UTC · model grok-4.3

classification 💻 cs.LG cs.RO
keywords offline reinforcement learningflow matchinglocal transport mapFisher information matrixpolicy refinementanisotropic regularizationKL-constrained objective
0
0 comments X

The pith

Modeling flow policy refinement as a local transport map with a Fisher-information quadratic yields controllable error near the optimal solution in offline RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Flow-based offline reinforcement learning methods parameterize policies via flow matching yet suffer from a geometric mismatch: their L2 regularization is isotropic and ignores the anisotropic structure of the behavioral policy manifold. The paper reframes policy improvement as an initial flow policy plus a small residual displacement, which acts as a local transport map between distributions. Differentiating the density under this map produces a quadratic approximation to the KL-regularized objective whose curvature is supplied by the Fisher information matrix. The flow velocity already encodes the required score function, turning the problem into a tractable anisotropic quadratic program. If the claim holds, the optimality gap left by prior isotropic bounds shrinks to a provable, controllable size inside a neighborhood of the true optimum.

Core claim

The optimality gap in earlier flow policies arises from their isotropic L2 upper bound on the 2-Wasserstein distance. In contrast, the local transport map formulation induces a density transformation whose first-order effect is captured exactly by a Fisher-information quadratic form. Optimizing under the corresponding quadratic constraint keeps the solution inside a neighborhood where the approximation error remains controllable, directly addressing the misalignment between isotropic regularization and the anisotropic data geometry.

What carries the argument

The local transport map formed by a base flow policy augmented by a residual displacement, whose effect on the induced density is approximated quadratically by the Fisher information matrix extracted from the flow's score function.

If this is right

  • Optimization directions become density-sensitive and aligned with the behavioral manifold rather than isotropic.
  • The approximation error is bounded and controllable by restricting updates to the small-residual neighborhood.
  • The resulting policy achieves state-of-the-art performance on standard offline RL benchmarks.
  • The framework explains why previous L2-based flow methods systematically misalign gradients.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The local-transport perspective could be applied to refine other score-based or flow-based generative models outside RL.
  • Tracking residual size during training offers a built-in diagnostic for when the quadratic approximation begins to degrade.
  • Density-aware quadratic constraints of this form may improve regularization in broader generative modeling tasks.

Load-bearing premise

The local quadratic approximation to the KL objective stays accurate whenever the residual displacement is small enough that higher-order density changes remain negligible.

What would settle it

On a simple benchmark, measure the gap between the quadratic approximation and the true KL divergence while systematically increasing the size of the residual displacement; the gap should remain bounded only inside the predicted small-displacement neighborhood.

Figures

Figures reproduced from arXiv: 2604.17919 by Haoyu Wang, Li Zeng, Wenxuan Yuan, Xiaoyuan Cheng, Zhuo Sun, Ziyan Wang, Zonghao Chen.

Figure 1
Figure 1. Figure 1: Geometric interpretation of offline policy optimization. (Left) Each action [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of flow policy refinement paradigms. (Left) Paradigm 1: flow [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Isotropic vs. anisotropic policy refinement. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Offline-to-online fine-tuning performance. We [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablations on perturbed time tε. 6. Can the hyperparameter “perturbed time tε” be determined from first principles, rather than through heuristic tuning? YES. We provide a principled characterization of the optimal pertur￾bation by analyzing the trade-off between approximation bias and numerical error (see Appendix C.3). This yields an optimal scaling ε ∗ ∼ O(δ 1/4 FP32), which depends on both machine preci… view at source ↗
Figure 6
Figure 6. Figure 6: Overview of benchmark tasks. Our evaluation spans a diverse set of environments from [PITH_FULL_IMAGE:figures/full_fig_p029_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Additional examples of isotropic and anisotropic policy refinement. Panels (a) and (c) [PITH_FULL_IMAGE:figures/full_fig_p032_7.png] view at source ↗
read the original abstract

Recent advances in flow-based offline reinforcement learning (RL) have achieved strong performance by parameterizing policies via flow matching. However, they still face critical trade-offs among expressiveness, optimality, and efficiency. In particular, existing flow policies interpret the $L_2$ regularization as an upper bound of the 2-Wasserstein distance ($W_2$), which can be problematic in offline settings. This issue stems from a fundamental geometric mismatch: the behavioral policy manifold is inherently anisotropic, whereas the $L_2$ (or upper bound of $W_2$) regularization is isotropic and density-insensitive, leading to systematically misaligned optimization directions. To address this, we revisit offline RL from a geometric perspective and show that policy refinement can be formulated as a local transport map: an initial flow policy augmented by a residual displacement. By analyzing the induced density transformation, we derive a local quadratic approximation of the KL-constrained objective governed by the Fisher information matrix, enabling a tractable anisotropic optimization formulation. By leveraging the score function embedded in the flow velocity, we obtain a corresponding quadratic constraint for efficient optimization. Our results reveal that the optimality gap in prior methods arises from their isotropic approximation. In contrast, our framework achieves a controllable approximation error within a provable neighborhood of the optimal solution. Extensive experiments demonstrate state-of-the-art performance across diverse offline RL benchmarks. See project page: https://github.com/ARC0127/Fisher-Decorator.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes the Fisher Decorator method for refining flow-based policies in offline RL. It formulates policy improvement as augmenting an initial flow policy with a residual displacement (a local transport map), analyzes the induced density transformation to obtain a local quadratic approximation of the KL-constrained objective governed by the Fisher information matrix, and leverages the embedded score function of the flow velocity to derive a corresponding quadratic constraint. This is positioned as addressing the geometric mismatch between isotropic L2 regularization (an upper bound on W2) and the anisotropic behavioral policy manifold, with the claim that the resulting optimality gap is controllable within a provable neighborhood of the optimum. Experiments report state-of-the-art results on standard offline RL benchmarks.

Significance. If the local quadratic approximation is shown to have controllable error with an explicit neighborhood, the work would provide a geometrically principled alternative to isotropic regularization in flow-matching offline RL. The technical device of extracting the quadratic constraint directly from the flow velocity's score function is a strength that could improve both efficiency and alignment with the policy manifold's anisotropy. This could influence subsequent work on regularized flow policies by emphasizing density-sensitive, Fisher-based local models over L2 penalties.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'our framework achieves a controllable approximation error within a provable neighborhood of the optimal solution' is load-bearing for the contrast with prior isotropic methods, yet the derivation supplies only the second-order Fisher-matrix term without an explicit remainder bound (e.g., via third-derivative Lipschitz constants of the log-density or score function) or a concrete radius expressed in the Fisher metric. Without this, it is impossible to verify that the residual displacements chosen to improve the policy remain inside the region where the quadratic model is faithful.
  2. [Method derivation] The local quadratic approximation of the KL objective (governed by the Fisher matrix induced by the flow velocity's score) is presented as following from density transformation analysis, but the manuscript does not state the precise conditions under which the cubic and higher terms are negligible relative to the quadratic term for finite displacements; this leaves open whether the operating regime exits the valid neighborhood.
minor comments (2)
  1. The abstract refers to a project page for code; the main text should include at least a brief reproducibility statement or pseudocode for the quadratic-constraint optimization step.
  2. Notation for the residual displacement and the induced density transformation should be introduced with a clear diagram or equation reference early in the method section to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our paper. We address the major concerns regarding the theoretical guarantees of our local quadratic approximation below. We will revise the manuscript to include explicit bounds and conditions as suggested.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'our framework achieves a controllable approximation error within a provable neighborhood of the optimal solution' is load-bearing for the contrast with prior isotropic methods, yet the derivation supplies only the second-order Fisher-matrix term without an explicit remainder bound (e.g., via third-derivative Lipschitz constants of the log-density or score function) or a concrete radius expressed in the Fisher metric. Without this, it is impossible to verify that the residual displacements chosen to improve the policy remain inside the region where the quadratic model is faithful.

    Authors: We agree with the referee that an explicit remainder bound would make the claim more rigorous. The derivation in the paper uses the second-order Taylor expansion of the transformed density, leading to the Fisher quadratic term. Under the assumption that the log-density has bounded third derivatives (Lipschitz continuous Hessian), the remainder can be bounded using standard Taylor remainder theorems. We will add a new proposition in the method section that provides this bound and specifies the radius in the Fisher metric. This will clarify the provable neighborhood and ensure residual displacements stay within it. We will also update the abstract if necessary to reference this. revision: yes

  2. Referee: [Method derivation] The local quadratic approximation of the KL objective (governed by the Fisher matrix induced by the flow velocity's score) is presented as following from density transformation analysis, but the manuscript does not state the precise conditions under which the cubic and higher terms are negligible relative to the quadratic term for finite displacements; this leaves open whether the operating regime exits the valid neighborhood.

    Authors: We acknowledge that the precise conditions for neglecting higher-order terms are not explicitly stated. The analysis assumes small residual displacements for the local map. To address this, we will include a discussion in the method section on the validity conditions, such as requiring the displacement norm in the Fisher metric to be sufficiently small relative to the inverse of the Lipschitz constant of the third derivatives of the log-density. This ensures the cubic terms remain negligible compared to the quadratic term. We will also provide guidance on how the optimization procedure keeps the displacements within this regime. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation uses standard density transformation and Fisher quadratic without reduction to inputs

full rationale

The paper derives its local quadratic approximation of the KL objective from analysis of the induced density transformation under a residual displacement transport map, using the flow velocity's embedded score function to obtain the Fisher-governed form. This follows standard second-order Taylor expansion around the base policy and does not reduce by construction to a fitted parameter, self-definition, or self-citation chain. No equations in the abstract or description equate the final result to its inputs tautologically, nor rename known patterns or smuggle ansatzes via prior self-work. The controllable approximation error claim rests on the local neighborhood assumption rather than circular logic, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only; the ledger is populated from the stated geometric assumptions and approximation steps. No explicit free parameters or invented entities are named, but the local quadratic regime is an unverified modeling choice.

axioms (2)
  • domain assumption behavioral policy manifold is inherently anisotropic
    Invoked to motivate the mismatch with isotropic L2 regularization
  • ad hoc to paper local quadratic approximation of KL objective governed by Fisher matrix is valid near the current policy
    Central modeling step that enables the tractable anisotropic formulation

pith-pipeline@v0.9.0 · 5577 in / 1331 out tokens · 23659 ms · 2026-05-10T05:23:52.016616+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

84 extracted references · 38 canonical work pages · 16 internal anchors

  1. [1]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

  2. [2]

    A closer look at offline rl agents.Advances in Neural Information Processing Systems, 35:8591–8604, 2022

    Yuwei Fu, Di Wu, and Benoit Boulet. A closer look at offline rl agents.Advances in Neural Information Processing Systems, 35:8591–8604, 2022

  3. [3]

    A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE transactions on neural networks and learning systems, 35(8):10237–10257, 2023

    Rafael Figueiredo Prudencio, Marcos ROA Maximo, and Esther Luna Colombini. A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE transactions on neural networks and learning systems, 35(8):10237–10257, 2023

  4. [4]

    Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  5. [5]

    Ogbench: Benchmarking offline goal-conditioned rl.arXiv preprint arXiv:2410.20092,

    Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. Ogbench: Benchmarking offline goal-conditioned rl.arXiv preprint arXiv:2410.20092, 2024

  6. [6]

    IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

    Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies.arXiv preprint arXiv:2304.10573, 2023

  7. [7]

    arXiv preprint arXiv:2310.07297 , year=

    Huayu Chen, Cheng Lu, Zhengyi Wang, Hang Su, and Jun Zhu. Score regularized policy optimization through diffusion behavior.arXiv preprint arXiv:2310.07297, 2023

  8. [8]

    Planning with Diffusion for Flexible Behavior Synthesis

    Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis.arXiv preprint arXiv:2205.09991, 2022

  9. [9]

    Zhang, W

    Shiyuan Zhang, Weitong Zhang, and Quanquan Gu. Energy-weighted flow matching for offline reinforcement learning.arXiv preprint arXiv:2503.04975, 2025

  10. [10]

    Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning

    Cheng Lu, Huayu Chen, Jianfei Chen, Hang Su, Chongxuan Li, and Jun Zhu. Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning. InInternational Conference on Machine Learning, pages 22825–22855. PMLR, 2023

  11. [11]

    Efficient diffusion policies for offline reinforcement learning.Advances in Neural Information Processing Systems, 36:67195–67212, 2023

    Bingyi Kang, Xiao Ma, Chao Du, Tianyu Pang, and Shuicheng Yan. Efficient diffusion policies for offline reinforcement learning.Advances in Neural Information Processing Systems, 36:67195–67212, 2023

  12. [12]

    Diffusion-based reinforcement learning via q-weighted variational policy optimization

    Shutong Ding, Ke Hu, Zhenhao Zhang, Kan Ren, Weinan Zhang, Jingyi Yu, Jingya Wang, and Ye Shi. Diffusion-based reinforcement learning via q-weighted variational policy optimization. Advances in Neural Information Processing Systems, 37:53945–53968, 2024

  13. [13]

    Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

    Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning.arXiv preprint arXiv:2208.06193, 2022. 10

  14. [14]

    Ding and C

    Zihan Ding and Chi Jin. Consistency models as a rich and efficient policy class for reinforcement learning.arXiv preprint arXiv:2309.16984, 2023

  15. [15]

    Is value learning really the main bottleneck in offline rl?Advances in Neural Information Processing Systems, 37:79029–79056, 2024

    Seohong Park, Kevin Frans, Sergey Levine, and Aviral Kumar. Is value learning really the main bottleneck in offline rl?Advances in Neural Information Processing Systems, 37:79029–79056, 2024

  16. [16]

    arXiv preprint arXiv:2510.08218 , year=

    Nicolas Espinosa-Dice, Kiante Brantley, and Wen Sun. Expressive value learning for scalable offline reinforcement learning.arXiv preprint arXiv:2510.08218, 2025

  17. [17]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022

  18. [18]

    Mean Flows for One-step Generative Modeling

    Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447, 2025

  19. [19]

    Flow q-learning, 2025

    Seohong Park, Qiyang Li, and Sergey Levine. Flow q-learning, 2025

  20. [20]

    Reform: Reflected flows for on-support offline rl via noise manipulation

    Songyuan Zhang, Oswin So, HM Ahmad, Eric Yang Yu, Matthew Cleaveland, Mitchell Black, and Chuchu Fan. Reform: Reflected flows for on-support offline rl via noise manipulation. arXiv preprint arXiv:2602.05051, 2026

  21. [21]

    Thanh Nguyen and Chang D. Yoo. One-step flow q-learning: Addressing the diffusion policy bottleneck in offline reinforcement learning, 2026

  22. [22]

    Springer, 2009

    Cédric Villani et al.Optimal transport: old and new, volume 338. Springer, 2009

  23. [23]

    DeFlow : Decoupling manifold modeling and value maximization for offline policy extraction

    Zhancun Mu. Deflow: Decoupling manifold modeling and value maximization for offline policy extraction.arXiv preprint arXiv:2601.10471, 2026

  24. [24]

    Nonparametric instrumental variable regression with observed covariates.arXiv preprint arXiv:2511.19404, 2025

    Zikai Shen, Zonghao Chen, Dimitri Meunier, Ingo Steinwart, Arthur Gretton, and Zhu Li. Nonparametric instrumental variable regression with observed covariates.arXiv preprint arXiv:2511.19404, 2025

  25. [25]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

  26. [26]

    Improving and generalizing flow-based generative models with minibatch optimal transport

    Alexander Tong, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Kilian Fatras, Guy Wolf, and Yoshua Bengio. Conditional flow matching: Simulation-free dynamic optimal transport.arXiv preprint arXiv:2302.00482, 2(3), 2023

  27. [27]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  28. [28]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  29. [29]

    Building Normalizing Flows with Stochastic Interpolants

    Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants.arXiv preprint arXiv:2209.15571, 2022

  30. [30]

    Springer Science & Business Media, 2012

    Serge Lang.Differential and Riemannian manifolds. Springer Science & Business Media, 2012

  31. [31]

    Flow Matching Guide and Code

    Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky TQ Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow matching guide and code.arXiv preprint arXiv:2412.06264, 2024

  32. [32]

    Springer, 2016

    Shun-ichi Amari.Information geometry and its applications. Springer, 2016

  33. [33]

    Springer, 2025

    Sinho Chewi, Jonathan Niles-Weed, and Philippe Rigollet.Statistical optimal transport. Springer, 2025

  34. [34]

    Normalizing flows for probabilistic modeling and inference.Journal of Machine Learning Research, 22(57):1–64, 2021

    George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference.Journal of Machine Learning Research, 22(57):1–64, 2021. 11

  35. [35]

    Springer, 2006

    Terence Tao.Analysis, volume 1. Springer, 2006

  36. [36]

    Springer, 2006

    Jorge Nocedal and Stephen J Wright.Numerical optimization. Springer, 2006

  37. [37]

    Revisiting the minimalist approach to offline reinforcement learning.Advances in Neural Information Processing Systems, 36:11592–11620, 2023

    Denis Tarasov, Vladislav Kurenkov, Alexander Nikulin, and Sergey Kolesnikov. Revisiting the minimalist approach to offline reinforcement learning.Advances in Neural Information Processing Systems, 36:11592–11620, 2023

  38. [38]

    D5rl: Diverse datasets for data- driven deep reinforcement learning.arXiv preprint arXiv:2408.08441, 2024

    Rafael Rafailov, Kyle Hatch, Anikait Singh, Laura Smith, Aviral Kumar, Ilya Kostrikov, Philippe Hansen-Estruch, Victor Kolev, Philip Ball, Jiajun Wu, et al. D5rl: Diverse datasets for data- driven deep reinforcement learning.arXiv preprint arXiv:2408.08441, 2024

  39. [39]

    D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning.arXiv preprint arXiv:2004.07219, 2020

  40. [40]

    Offline Reinforcement Learning with Implicit Q-Learning

    Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning.arXiv preprint arXiv:2110.06169, 2021

  41. [41]

    AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

    Nair Ashvin, Dalal Murtaza, Gupta Abhishek, and L Sergey. Accelerating online reinforcement learning with offline datasets.CoRR, vol. abs/2006.09359, 2020

  42. [42]

    Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning.Advances in Neural Information Processing Systems, 36:62244–62269, 2023

    Mitsuhiko Nakamoto, Simon Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning.Advances in Neural Information Processing Systems, 36:62244–62269, 2023

  43. [43]

    Efficient online reinforcement learning with offline data

    Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning, pages 1577–1594. PMLR, 2023

  44. [44]

    arXiv preprint arXiv:2302.08560 , year=

    Harshit Sikchi, Qinqing Zheng, Amy Zhang, and Scott Niekum. Dual rl: Unification and new methods for reinforcement and imitation learning.arXiv preprint arXiv:2302.08560, 2023

  45. [45]

    A minimalist approach to offline reinforcement learning

    Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145, 2021

  46. [46]

    Corl: Research-oriented deep offline reinforcement learning library.Advances in Neural Information Processing Systems, 36:30997–31020, 2023

    Denis Tarasov, Alexander Nikulin, Dmitry Akimov, Vladislav Kurenkov, and Sergey Kolesnikov. Corl: Research-oriented deep offline reinforcement learning library.Advances in Neural Information Processing Systems, 36:30997–31020, 2023

  47. [47]

    Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems, 33:1179– 1191, 2020

    Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems, 33:1179– 1191, 2020

  48. [48]

    -A"), 1e6 (

    Haoran Xu, Li Jiang, Jianxiong Li, Zhuoran Yang, Zhaoran Wang, Victor Wai Kin Chan, and Xianyuan Zhan. Offline rl with no ood actions: In-sample learning via implicit value regularization.arXiv preprint arXiv:2303.15810, 2023

  49. [49]

    Extreme q-learning: Maxent rl without entropy

    Divyansh Garg, Joey Hejna, Matthieu Geist, and Stefano Ermon. Extreme q-learning: Maxent rl without entropy.arXiv preprint arXiv:2301.02328, 2023

  50. [50]

    Mopo: Model-based offline policy optimization.Advances in neural information processing systems, 33:14129–14142, 2020

    Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization.Advances in neural information processing systems, 33:14129–14142, 2020

  51. [51]

    Anti- exploration by random network distillation

    Alexander Nikulin, Vladislav Kurenkov, Denis Tarasov, and Sergey Kolesnikov. Anti- exploration by random network distillation. InInternational conference on machine learning, pages 26228–26244. PMLR, 2023

  52. [52]

    Offline reinforcement learning as one big sequence modeling problem.Advances in neural information processing systems, 34:1273– 1286, 2021

    Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem.Advances in neural information processing systems, 34:1273– 1286, 2021. 12

  53. [53]

    Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021

    Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021

  54. [54]

    Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble

    Seunghyun Lee, Younggyo Seo, Kimin Lee, Pieter Abbeel, and Jinwoo Shin. Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble. InConference on Robot Learning, pages 1702–1712. PMLR, 2022

  55. [55]

    arXiv preprint arXiv:2210.06718 , year=

    Yuda Song, Yifei Zhou, Ayush Sekhari, J Andrew Bagnell, Akshay Krishnamurthy, and Wen Sun. Hybrid rl: Using both offline and online data can make rl efficient.arXiv preprint arXiv:2210.06718, 2022

  56. [56]

    Deep unsuper- vised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015

  57. [57]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  58. [58]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

  59. [59]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  60. [60]

    Reasoning with latent diffusion in offline reinforcement learning.arXiv preprint arXiv:2309.06599,

    Siddarth Venkatraman, Shivesh Khaitan, Ravi Tej Akella, John Dolan, Jeff Schneider, and Glen Berseth. Reasoning with latent diffusion in offline reinforcement learning.arXiv preprint arXiv:2309.06599, 2023

  61. [61]

    Safe and stable control via lyapunov- guided diffusion models.arXiv preprint arXiv:2509.25375, 2025

    Xiaoyuan Cheng, Xiaohang Tang, and Yiming Yang. Safe and stable control via lyapunov- guided diffusion models.arXiv preprint arXiv:2509.25375, 2025

  62. [62]

    Efficient online reinforcement learning for diffusion policy.arXiv preprint arXiv:2502.00361, 2025

    Haitong Ma, Tianyi Chen, Kai Wang, Na Li, and Bo Dai. Efficient online reinforcement learning for diffusion policy.arXiv preprint arXiv:2502.00361, 2025

  63. [63]

    Maximum entropy reinforcement learning with diffusion policy.arXiv preprint arXiv:2502.11612,

    Xiaoyi Dong, Jian Cheng, and Xi Sheryl Zhang. Maximum entropy reinforcement learning with diffusion policy.arXiv preprint arXiv:2502.11612, 2025

  64. [64]

    arXiv preprint arXiv:2503.09817 , year=

    Jesse Farebrother, Matteo Pirotta, Andrea Tirinzoni, Rémi Munos, Alessandro Lazaric, and Ahmed Touati. Temporal difference flows.arXiv preprint arXiv:2503.09817, 2025

  65. [65]

    Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

  66. [66]

    Training Agents Inside of Scalable World Models

    Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025

  67. [67]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  68. [68]

    Residual policy learning.arXiv preprint arXiv:1812.06298, 2018

    Tom Silver, Kelsey Allen, Josh Tenenbaum, and Leslie Kaelbling. Residual policy learning. arXiv preprint arXiv:1812.06298, 2018

  69. [69]

    From imitation to refinement-residual rl for precise assembly

    Lars Ankile, Anthony Simeonov, Idan Shenfeld, Marcel Torne, and Pulkit Agrawal. From imitation to refinement-residual rl for precise assembly. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 01–08. IEEE, 2025. 13

  70. [70]

    Self-improving vision-language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091, 2025

    Wenli Xiao, Haotian Lin, Andy Peng, Haoru Xue, Tairan He, Yuqi Xie, Fengyuan Hu, Jimmy Wu, Zhengyi Luo, Linxi Fan, et al. Self-improving vision-language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091, 2025

  71. [71]

    arXiv preprint arXiv:2412.13630 , year=

    Xiu Yuan, Tongzhou Mu, Stone Tao, Yunhao Fang, Mengke Zhang, and Hao Su. Policy decora- tor: Model-agnostic online refinement for large policy model.arXiv preprint arXiv:2412.13630, 2024

  72. [72]

    Courier Corporation, 2007

    Alfréd Rényi.Probability theory. Courier Corporation, 2007

  73. [73]

    Lectures on logarithmic sobolev inequalities

    Alice Guionnet and Bogusław Zegarlinksi. Lectures on logarithmic sobolev inequalities. In Séminaire de probabilités XXXVI, pages 1–134. Springer, 2004

  74. [74]

    Springer, 2014

    Herbert Federer.Geometric measure theory. Springer, 2014

  75. [75]

    Springer Science & Business Media, 2009

    John Neuberger.Sobolev gradients and differential equations. Springer Science & Business Media, 2009

  76. [76]

    Springer, 2011

    Haim Brezis and Haim Brézis.Functional analysis, Sobolev spaces and partial differential equations, volume 2. Springer, 2011

  77. [77]

    CreateSpace Independent Publishing Platform Scotts Valley, CA, USA, 2012

    Alan Macdonald.Vector and geometric calculus, volume 12. CreateSpace Independent Publishing Platform Scotts Valley, CA, USA, 2012

  78. [78]

    Towards a unified analysis of neural networks in nonparametric instrumental variable regression: Optimization and generalization.arXiv preprint arXiv:2511.14710, 2025

    Zonghao Chen, Atsushi Nitanda, Arthur Gretton, and Taiji Suzuki. Towards a unified analy- sis of neural networks in nonparametric instrumental variable regression: Optimization and generalization.arXiv preprint arXiv:2511.14710, 2025. 14 Notation Notation Meaning aaction sstate sscore function ttime step rreward function vvelocity field (or flow vector) A...

  79. [79]

    Let A ⊆R d be the action space

    Global Cancellation of Density CurvatureWe first establish that the total curvature of any valid probability density integrates to zero over its support. Let A ⊆R d be the action space. By the definition of the Laplacian operator, we have ∇2 aπβ =∇ ·(∇ aπβ). According to the Divergence Theorem (or Stokes’ Theorem) [77], the integral overAcan be converted ...

  80. [80]

    collapsing

    Physical Insight: Avoiding Distributional ShiftIn our residual policy setting, we refine the behavior policy πβ via a displacement field δ(s, a). The curvature term ∇2 aπβ πβ represents local volumetric deformation—the degree to which the action space is locally compressed or stretched. Equation (43) shows that the expectation of this curvature over the e...

Showing first 80 references.