Recognition: unknown
Fisher Decorator: Refining Flow Policy via a Local Transport Map
Pith reviewed 2026-05-10 05:23 UTC · model grok-4.3
The pith
Modeling flow policy refinement as a local transport map with a Fisher-information quadratic yields controllable error near the optimal solution in offline RL.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The optimality gap in earlier flow policies arises from their isotropic L2 upper bound on the 2-Wasserstein distance. In contrast, the local transport map formulation induces a density transformation whose first-order effect is captured exactly by a Fisher-information quadratic form. Optimizing under the corresponding quadratic constraint keeps the solution inside a neighborhood where the approximation error remains controllable, directly addressing the misalignment between isotropic regularization and the anisotropic data geometry.
What carries the argument
The local transport map formed by a base flow policy augmented by a residual displacement, whose effect on the induced density is approximated quadratically by the Fisher information matrix extracted from the flow's score function.
If this is right
- Optimization directions become density-sensitive and aligned with the behavioral manifold rather than isotropic.
- The approximation error is bounded and controllable by restricting updates to the small-residual neighborhood.
- The resulting policy achieves state-of-the-art performance on standard offline RL benchmarks.
- The framework explains why previous L2-based flow methods systematically misalign gradients.
Where Pith is reading between the lines
- The local-transport perspective could be applied to refine other score-based or flow-based generative models outside RL.
- Tracking residual size during training offers a built-in diagnostic for when the quadratic approximation begins to degrade.
- Density-aware quadratic constraints of this form may improve regularization in broader generative modeling tasks.
Load-bearing premise
The local quadratic approximation to the KL objective stays accurate whenever the residual displacement is small enough that higher-order density changes remain negligible.
What would settle it
On a simple benchmark, measure the gap between the quadratic approximation and the true KL divergence while systematically increasing the size of the residual displacement; the gap should remain bounded only inside the predicted small-displacement neighborhood.
Figures
read the original abstract
Recent advances in flow-based offline reinforcement learning (RL) have achieved strong performance by parameterizing policies via flow matching. However, they still face critical trade-offs among expressiveness, optimality, and efficiency. In particular, existing flow policies interpret the $L_2$ regularization as an upper bound of the 2-Wasserstein distance ($W_2$), which can be problematic in offline settings. This issue stems from a fundamental geometric mismatch: the behavioral policy manifold is inherently anisotropic, whereas the $L_2$ (or upper bound of $W_2$) regularization is isotropic and density-insensitive, leading to systematically misaligned optimization directions. To address this, we revisit offline RL from a geometric perspective and show that policy refinement can be formulated as a local transport map: an initial flow policy augmented by a residual displacement. By analyzing the induced density transformation, we derive a local quadratic approximation of the KL-constrained objective governed by the Fisher information matrix, enabling a tractable anisotropic optimization formulation. By leveraging the score function embedded in the flow velocity, we obtain a corresponding quadratic constraint for efficient optimization. Our results reveal that the optimality gap in prior methods arises from their isotropic approximation. In contrast, our framework achieves a controllable approximation error within a provable neighborhood of the optimal solution. Extensive experiments demonstrate state-of-the-art performance across diverse offline RL benchmarks. See project page: https://github.com/ARC0127/Fisher-Decorator.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the Fisher Decorator method for refining flow-based policies in offline RL. It formulates policy improvement as augmenting an initial flow policy with a residual displacement (a local transport map), analyzes the induced density transformation to obtain a local quadratic approximation of the KL-constrained objective governed by the Fisher information matrix, and leverages the embedded score function of the flow velocity to derive a corresponding quadratic constraint. This is positioned as addressing the geometric mismatch between isotropic L2 regularization (an upper bound on W2) and the anisotropic behavioral policy manifold, with the claim that the resulting optimality gap is controllable within a provable neighborhood of the optimum. Experiments report state-of-the-art results on standard offline RL benchmarks.
Significance. If the local quadratic approximation is shown to have controllable error with an explicit neighborhood, the work would provide a geometrically principled alternative to isotropic regularization in flow-matching offline RL. The technical device of extracting the quadratic constraint directly from the flow velocity's score function is a strength that could improve both efficiency and alignment with the policy manifold's anisotropy. This could influence subsequent work on regularized flow policies by emphasizing density-sensitive, Fisher-based local models over L2 penalties.
major comments (2)
- [Abstract] Abstract: the central claim that 'our framework achieves a controllable approximation error within a provable neighborhood of the optimal solution' is load-bearing for the contrast with prior isotropic methods, yet the derivation supplies only the second-order Fisher-matrix term without an explicit remainder bound (e.g., via third-derivative Lipschitz constants of the log-density or score function) or a concrete radius expressed in the Fisher metric. Without this, it is impossible to verify that the residual displacements chosen to improve the policy remain inside the region where the quadratic model is faithful.
- [Method derivation] The local quadratic approximation of the KL objective (governed by the Fisher matrix induced by the flow velocity's score) is presented as following from density transformation analysis, but the manuscript does not state the precise conditions under which the cubic and higher terms are negligible relative to the quadratic term for finite displacements; this leaves open whether the operating regime exits the valid neighborhood.
minor comments (2)
- The abstract refers to a project page for code; the main text should include at least a brief reproducibility statement or pseudocode for the quadratic-constraint optimization step.
- Notation for the residual displacement and the induced density transformation should be introduced with a clear diagram or equation reference early in the method section to aid readability.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments on our paper. We address the major concerns regarding the theoretical guarantees of our local quadratic approximation below. We will revise the manuscript to include explicit bounds and conditions as suggested.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that 'our framework achieves a controllable approximation error within a provable neighborhood of the optimal solution' is load-bearing for the contrast with prior isotropic methods, yet the derivation supplies only the second-order Fisher-matrix term without an explicit remainder bound (e.g., via third-derivative Lipschitz constants of the log-density or score function) or a concrete radius expressed in the Fisher metric. Without this, it is impossible to verify that the residual displacements chosen to improve the policy remain inside the region where the quadratic model is faithful.
Authors: We agree with the referee that an explicit remainder bound would make the claim more rigorous. The derivation in the paper uses the second-order Taylor expansion of the transformed density, leading to the Fisher quadratic term. Under the assumption that the log-density has bounded third derivatives (Lipschitz continuous Hessian), the remainder can be bounded using standard Taylor remainder theorems. We will add a new proposition in the method section that provides this bound and specifies the radius in the Fisher metric. This will clarify the provable neighborhood and ensure residual displacements stay within it. We will also update the abstract if necessary to reference this. revision: yes
-
Referee: [Method derivation] The local quadratic approximation of the KL objective (governed by the Fisher matrix induced by the flow velocity's score) is presented as following from density transformation analysis, but the manuscript does not state the precise conditions under which the cubic and higher terms are negligible relative to the quadratic term for finite displacements; this leaves open whether the operating regime exits the valid neighborhood.
Authors: We acknowledge that the precise conditions for neglecting higher-order terms are not explicitly stated. The analysis assumes small residual displacements for the local map. To address this, we will include a discussion in the method section on the validity conditions, such as requiring the displacement norm in the Fisher metric to be sufficiently small relative to the inverse of the Lipschitz constant of the third derivatives of the log-density. This ensures the cubic terms remain negligible compared to the quadratic term. We will also provide guidance on how the optimization procedure keeps the displacements within this regime. revision: yes
Circularity Check
No circularity: derivation uses standard density transformation and Fisher quadratic without reduction to inputs
full rationale
The paper derives its local quadratic approximation of the KL objective from analysis of the induced density transformation under a residual displacement transport map, using the flow velocity's embedded score function to obtain the Fisher-governed form. This follows standard second-order Taylor expansion around the base policy and does not reduce by construction to a fitted parameter, self-definition, or self-citation chain. No equations in the abstract or description equate the final result to its inputs tautologically, nor rename known patterns or smuggle ansatzes via prior self-work. The controllable approximation error claim rests on the local neighborhood assumption rather than circular logic, making the derivation self-contained.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption behavioral policy manifold is inherently anisotropic
- ad hoc to paper local quadratic approximation of KL objective governed by Fisher matrix is valid near the current policy
Reference graph
Works this paper leans on
-
[1]
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020
work page internal anchor Pith review arXiv 2005
-
[2]
A closer look at offline rl agents.Advances in Neural Information Processing Systems, 35:8591–8604, 2022
Yuwei Fu, Di Wu, and Benoit Boulet. A closer look at offline rl agents.Advances in Neural Information Processing Systems, 35:8591–8604, 2022
2022
-
[3]
A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE transactions on neural networks and learning systems, 35(8):10237–10257, 2023
Rafael Figueiredo Prudencio, Marcos ROA Maximo, and Esther Luna Colombini. A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE transactions on neural networks and learning systems, 35(8):10237–10257, 2023
2023
-
[4]
Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0
Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024
2024
-
[5]
Ogbench: Benchmarking offline goal-conditioned rl.arXiv preprint arXiv:2410.20092,
Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. Ogbench: Benchmarking offline goal-conditioned rl.arXiv preprint arXiv:2410.20092, 2024
-
[6]
IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies
Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies.arXiv preprint arXiv:2304.10573, 2023
work page internal anchor Pith review arXiv 2023
-
[7]
arXiv preprint arXiv:2310.07297 , year=
Huayu Chen, Cheng Lu, Zhengyi Wang, Hang Su, and Jun Zhu. Score regularized policy optimization through diffusion behavior.arXiv preprint arXiv:2310.07297, 2023
-
[8]
Planning with Diffusion for Flexible Behavior Synthesis
Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis.arXiv preprint arXiv:2205.09991, 2022
work page internal anchor Pith review arXiv 2022
- [9]
-
[10]
Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning
Cheng Lu, Huayu Chen, Jianfei Chen, Hang Su, Chongxuan Li, and Jun Zhu. Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning. InInternational Conference on Machine Learning, pages 22825–22855. PMLR, 2023
2023
-
[11]
Efficient diffusion policies for offline reinforcement learning.Advances in Neural Information Processing Systems, 36:67195–67212, 2023
Bingyi Kang, Xiao Ma, Chao Du, Tianyu Pang, and Shuicheng Yan. Efficient diffusion policies for offline reinforcement learning.Advances in Neural Information Processing Systems, 36:67195–67212, 2023
2023
-
[12]
Diffusion-based reinforcement learning via q-weighted variational policy optimization
Shutong Ding, Ke Hu, Zhenhao Zhang, Kan Ren, Weinan Zhang, Jingyi Yu, Jingya Wang, and Ye Shi. Diffusion-based reinforcement learning via q-weighted variational policy optimization. Advances in Neural Information Processing Systems, 37:53945–53968, 2024
2024
-
[13]
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning
Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning.arXiv preprint arXiv:2208.06193, 2022. 10
work page internal anchor Pith review arXiv 2022
-
[14]
Zihan Ding and Chi Jin. Consistency models as a rich and efficient policy class for reinforcement learning.arXiv preprint arXiv:2309.16984, 2023
-
[15]
Is value learning really the main bottleneck in offline rl?Advances in Neural Information Processing Systems, 37:79029–79056, 2024
Seohong Park, Kevin Frans, Sergey Levine, and Aviral Kumar. Is value learning really the main bottleneck in offline rl?Advances in Neural Information Processing Systems, 37:79029–79056, 2024
2024
-
[16]
arXiv preprint arXiv:2510.08218 , year=
Nicolas Espinosa-Dice, Kiante Brantley, and Wen Sun. Expressive value learning for scalable offline reinforcement learning.arXiv preprint arXiv:2510.08218, 2025
-
[17]
Progressive Distillation for Fast Sampling of Diffusion Models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022
work page internal anchor Pith review arXiv 2022
-
[18]
Mean Flows for One-step Generative Modeling
Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447, 2025
work page internal anchor Pith review arXiv 2025
-
[19]
Flow q-learning, 2025
Seohong Park, Qiyang Li, and Sergey Levine. Flow q-learning, 2025
2025
-
[20]
Reform: Reflected flows for on-support offline rl via noise manipulation
Songyuan Zhang, Oswin So, HM Ahmad, Eric Yang Yu, Matthew Cleaveland, Mitchell Black, and Chuchu Fan. Reform: Reflected flows for on-support offline rl via noise manipulation. arXiv preprint arXiv:2602.05051, 2026
-
[21]
Thanh Nguyen and Chang D. Yoo. One-step flow q-learning: Addressing the diffusion policy bottleneck in offline reinforcement learning, 2026
2026
-
[22]
Springer, 2009
Cédric Villani et al.Optimal transport: old and new, volume 338. Springer, 2009
2009
-
[23]
DeFlow : Decoupling manifold modeling and value maximization for offline policy extraction
Zhancun Mu. Deflow: Decoupling manifold modeling and value maximization for offline policy extraction.arXiv preprint arXiv:2601.10471, 2026
-
[24]
Zikai Shen, Zonghao Chen, Dimitri Meunier, Ingo Steinwart, Arthur Gretton, and Zhu Li. Nonparametric instrumental variable regression with observed covariates.arXiv preprint arXiv:2511.19404, 2025
-
[25]
MIT press Cambridge, 1998
Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998
1998
-
[26]
Improving and generalizing flow-based generative models with minibatch optimal transport
Alexander Tong, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Kilian Fatras, Guy Wolf, and Yoshua Bengio. Conditional flow matching: Simulation-free dynamic optimal transport.arXiv preprint arXiv:2302.00482, 2(3), 2023
work page internal anchor Pith review arXiv 2023
-
[27]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[28]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022
work page internal anchor Pith review arXiv 2022
-
[29]
Building Normalizing Flows with Stochastic Interpolants
Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants.arXiv preprint arXiv:2209.15571, 2022
work page internal anchor Pith review arXiv 2022
-
[30]
Springer Science & Business Media, 2012
Serge Lang.Differential and Riemannian manifolds. Springer Science & Business Media, 2012
2012
-
[31]
Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky TQ Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow matching guide and code.arXiv preprint arXiv:2412.06264, 2024
work page internal anchor Pith review arXiv 2024
-
[32]
Springer, 2016
Shun-ichi Amari.Information geometry and its applications. Springer, 2016
2016
-
[33]
Springer, 2025
Sinho Chewi, Jonathan Niles-Weed, and Philippe Rigollet.Statistical optimal transport. Springer, 2025
2025
-
[34]
Normalizing flows for probabilistic modeling and inference.Journal of Machine Learning Research, 22(57):1–64, 2021
George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference.Journal of Machine Learning Research, 22(57):1–64, 2021. 11
2021
-
[35]
Springer, 2006
Terence Tao.Analysis, volume 1. Springer, 2006
2006
-
[36]
Springer, 2006
Jorge Nocedal and Stephen J Wright.Numerical optimization. Springer, 2006
2006
-
[37]
Revisiting the minimalist approach to offline reinforcement learning.Advances in Neural Information Processing Systems, 36:11592–11620, 2023
Denis Tarasov, Vladislav Kurenkov, Alexander Nikulin, and Sergey Kolesnikov. Revisiting the minimalist approach to offline reinforcement learning.Advances in Neural Information Processing Systems, 36:11592–11620, 2023
2023
-
[38]
Rafael Rafailov, Kyle Hatch, Anikait Singh, Laura Smith, Aviral Kumar, Ilya Kostrikov, Philippe Hansen-Estruch, Victor Kolev, Philip Ball, Jiajun Wu, et al. D5rl: Diverse datasets for data- driven deep reinforcement learning.arXiv preprint arXiv:2408.08441, 2024
-
[39]
D4RL: Datasets for Deep Data-Driven Reinforcement Learning
Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning.arXiv preprint arXiv:2004.07219, 2020
work page internal anchor Pith review arXiv 2004
-
[40]
Offline Reinforcement Learning with Implicit Q-Learning
Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning.arXiv preprint arXiv:2110.06169, 2021
work page internal anchor Pith review arXiv 2021
-
[41]
AWAC: Accelerating Online Reinforcement Learning with Offline Datasets
Nair Ashvin, Dalal Murtaza, Gupta Abhishek, and L Sergey. Accelerating online reinforcement learning with offline datasets.CoRR, vol. abs/2006.09359, 2020
work page internal anchor Pith review arXiv 2006
-
[42]
Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning.Advances in Neural Information Processing Systems, 36:62244–62269, 2023
Mitsuhiko Nakamoto, Simon Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning.Advances in Neural Information Processing Systems, 36:62244–62269, 2023
2023
-
[43]
Efficient online reinforcement learning with offline data
Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning, pages 1577–1594. PMLR, 2023
2023
-
[44]
arXiv preprint arXiv:2302.08560 , year=
Harshit Sikchi, Qinqing Zheng, Amy Zhang, and Scott Niekum. Dual rl: Unification and new methods for reinforcement and imitation learning.arXiv preprint arXiv:2302.08560, 2023
-
[45]
A minimalist approach to offline reinforcement learning
Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145, 2021
2021
-
[46]
Corl: Research-oriented deep offline reinforcement learning library.Advances in Neural Information Processing Systems, 36:30997–31020, 2023
Denis Tarasov, Alexander Nikulin, Dmitry Akimov, Vladislav Kurenkov, and Sergey Kolesnikov. Corl: Research-oriented deep offline reinforcement learning library.Advances in Neural Information Processing Systems, 36:30997–31020, 2023
2023
-
[47]
Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems, 33:1179– 1191, 2020
Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems, 33:1179– 1191, 2020
2020
-
[48]
Haoran Xu, Li Jiang, Jianxiong Li, Zhuoran Yang, Zhaoran Wang, Victor Wai Kin Chan, and Xianyuan Zhan. Offline rl with no ood actions: In-sample learning via implicit value regularization.arXiv preprint arXiv:2303.15810, 2023
-
[49]
Extreme q-learning: Maxent rl without entropy
Divyansh Garg, Joey Hejna, Matthieu Geist, and Stefano Ermon. Extreme q-learning: Maxent rl without entropy.arXiv preprint arXiv:2301.02328, 2023
-
[50]
Mopo: Model-based offline policy optimization.Advances in neural information processing systems, 33:14129–14142, 2020
Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization.Advances in neural information processing systems, 33:14129–14142, 2020
2020
-
[51]
Anti- exploration by random network distillation
Alexander Nikulin, Vladislav Kurenkov, Denis Tarasov, and Sergey Kolesnikov. Anti- exploration by random network distillation. InInternational conference on machine learning, pages 26228–26244. PMLR, 2023
2023
-
[52]
Offline reinforcement learning as one big sequence modeling problem.Advances in neural information processing systems, 34:1273– 1286, 2021
Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem.Advances in neural information processing systems, 34:1273– 1286, 2021. 12
2021
-
[53]
Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021
Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021
2021
-
[54]
Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble
Seunghyun Lee, Younggyo Seo, Kimin Lee, Pieter Abbeel, and Jinwoo Shin. Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble. InConference on Robot Learning, pages 1702–1712. PMLR, 2022
2022
-
[55]
arXiv preprint arXiv:2210.06718 , year=
Yuda Song, Yifei Zhou, Ayush Sekhari, J Andrew Bagnell, Akshay Krishnamurthy, and Wen Sun. Hybrid rl: Using both offline and online data can make rl efficient.arXiv preprint arXiv:2210.06718, 2022
-
[56]
Deep unsuper- vised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015
2015
-
[57]
Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
2020
-
[58]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[59]
Scaling rectified flow trans- formers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024
2024
-
[60]
Reasoning with latent diffusion in offline reinforcement learning.arXiv preprint arXiv:2309.06599,
Siddarth Venkatraman, Shivesh Khaitan, Ravi Tej Akella, John Dolan, Jeff Schneider, and Glen Berseth. Reasoning with latent diffusion in offline reinforcement learning.arXiv preprint arXiv:2309.06599, 2023
-
[61]
Safe and stable control via lyapunov- guided diffusion models.arXiv preprint arXiv:2509.25375, 2025
Xiaoyuan Cheng, Xiaohang Tang, and Yiming Yang. Safe and stable control via lyapunov- guided diffusion models.arXiv preprint arXiv:2509.25375, 2025
-
[62]
Efficient online reinforcement learning for diffusion policy.arXiv preprint arXiv:2502.00361, 2025
Haitong Ma, Tianyi Chen, Kai Wang, Na Li, and Bo Dai. Efficient online reinforcement learning for diffusion policy.arXiv preprint arXiv:2502.00361, 2025
-
[63]
Maximum entropy reinforcement learning with diffusion policy.arXiv preprint arXiv:2502.11612,
Xiaoyi Dong, Jian Cheng, and Xi Sheryl Zhang. Maximum entropy reinforcement learning with diffusion policy.arXiv preprint arXiv:2502.11612, 2025
-
[64]
arXiv preprint arXiv:2503.09817 , year=
Jesse Farebrother, Matteo Pirotta, Andrea Tirinzoni, Rémi Munos, Alessandro Lazaric, and Ahmed Touati. Temporal difference flows.arXiv preprint arXiv:2503.09817, 2025
-
[65]
Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024
Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024
2024
-
[66]
Training Agents Inside of Scalable World Models
Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025
work page internal anchor Pith review arXiv 2025
-
[67]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
2016
-
[68]
Residual policy learning.arXiv preprint arXiv:1812.06298, 2018
Tom Silver, Kelsey Allen, Josh Tenenbaum, and Leslie Kaelbling. Residual policy learning. arXiv preprint arXiv:1812.06298, 2018
-
[69]
From imitation to refinement-residual rl for precise assembly
Lars Ankile, Anthony Simeonov, Idan Shenfeld, Marcel Torne, and Pulkit Agrawal. From imitation to refinement-residual rl for precise assembly. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 01–08. IEEE, 2025. 13
2025
-
[70]
Wenli Xiao, Haotian Lin, Andy Peng, Haoru Xue, Tairan He, Yuqi Xie, Fengyuan Hu, Jimmy Wu, Zhengyi Luo, Linxi Fan, et al. Self-improving vision-language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091, 2025
-
[71]
arXiv preprint arXiv:2412.13630 , year=
Xiu Yuan, Tongzhou Mu, Stone Tao, Yunhao Fang, Mengke Zhang, and Hao Su. Policy decora- tor: Model-agnostic online refinement for large policy model.arXiv preprint arXiv:2412.13630, 2024
-
[72]
Courier Corporation, 2007
Alfréd Rényi.Probability theory. Courier Corporation, 2007
2007
-
[73]
Lectures on logarithmic sobolev inequalities
Alice Guionnet and Bogusław Zegarlinksi. Lectures on logarithmic sobolev inequalities. In Séminaire de probabilités XXXVI, pages 1–134. Springer, 2004
2004
-
[74]
Springer, 2014
Herbert Federer.Geometric measure theory. Springer, 2014
2014
-
[75]
Springer Science & Business Media, 2009
John Neuberger.Sobolev gradients and differential equations. Springer Science & Business Media, 2009
2009
-
[76]
Springer, 2011
Haim Brezis and Haim Brézis.Functional analysis, Sobolev spaces and partial differential equations, volume 2. Springer, 2011
2011
-
[77]
CreateSpace Independent Publishing Platform Scotts Valley, CA, USA, 2012
Alan Macdonald.Vector and geometric calculus, volume 12. CreateSpace Independent Publishing Platform Scotts Valley, CA, USA, 2012
2012
-
[78]
Zonghao Chen, Atsushi Nitanda, Arthur Gretton, and Taiji Suzuki. Towards a unified analy- sis of neural networks in nonparametric instrumental variable regression: Optimization and generalization.arXiv preprint arXiv:2511.14710, 2025. 14 Notation Notation Meaning aaction sstate sscore function ttime step rreward function vvelocity field (or flow vector) A...
-
[79]
Let A ⊆R d be the action space
Global Cancellation of Density CurvatureWe first establish that the total curvature of any valid probability density integrates to zero over its support. Let A ⊆R d be the action space. By the definition of the Laplacian operator, we have ∇2 aπβ =∇ ·(∇ aπβ). According to the Divergence Theorem (or Stokes’ Theorem) [77], the integral overAcan be converted ...
-
[80]
collapsing
Physical Insight: Avoiding Distributional ShiftIn our residual policy setting, we refine the behavior policy πβ via a displacement field δ(s, a). The curvature term ∇2 aπβ πβ represents local volumetric deformation—the degree to which the action space is locally compressed or stretched. Equation (43) shows that the expectation of this curvature over the e...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.