pith. sign in

arxiv: 2605.26552 · v2 · pith:GSYTBFMCnew · submitted 2026-05-26 · 💻 cs.LG · cs.AI

Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference

Pith reviewed 2026-06-29 19:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords generative model alignmentfew-step modelsStein variational gradient descentsample-based variational inferencerobotic policy alignmentimage generator fine-tuningreward-tilted sampling
0
0 comments X

The pith

FAV aligns few-step generative models by amortizing Stein variational gradient descent particle updates into the generator parameters via fixed-point regression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FAV as a general framework for aligning few-step generative models that requires only sample access to the generator and reference distribution. It recasts alignment as sampling from a reward-tilted target distribution and uses Stein variational gradient descent to produce particle updates toward that target. Those updates are then amortized into the generator by training it to reproduce the moves through fixed-point regression. The method is applied to robotic policy alignment and to fine-tuning image generators of several architectures.

Core claim

Alignment of few-step generative models can be performed by casting the problem as sampling from a reward-tilted distribution anchored to a reference, running Stein variational gradient descent to move particles, and regressing the generator parameters so that its next samples match the particle updates, thereby amortizing the inference steps without needing tractable likelihoods, specific ODE solvers, or model-family restrictions.

What carries the argument

Fixed-point regression that amortizes the particle updates produced by Stein variational gradient descent on the reward-tilted distribution into the parameters of the generator.

If this is right

  • On robotic manipulation, the aligned policies outperform prevailing policy extraction baselines on 56 offline and 30 offline-to-online reinforcement learning tasks.
  • The same procedure fine-tunes GANs, drifting models, consistency models, and flow maps for image generation.
  • The approach scales from ImageNet-256 to 1024 squared text-to-image synthesis.
  • Alignment succeeds using only sample access rather than requiring tractable likelihoods or particular dynamics solvers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The amortization step could be tested for stability when the reward function changes rapidly between alignment rounds.
  • The same regression idea might extend to amortizing other particle-based inference methods beyond Stein variational gradient descent.
  • If the fixed-point mapping holds, the aligned generator could serve as an approximate sampler for new reward functions without re-running particle optimization from scratch.

Load-bearing premise

The particle updates from Stein variational gradient descent on the reward-tilted distribution can be faithfully recovered by training the generator parameters through fixed-point regression without introducing bias or instability.

What would settle it

A direct comparison experiment in which samples drawn from the aligned generator after FAV training are shown to have a reward distribution that differs measurably from the distribution obtained by running many steps of Stein variational gradient descent directly on the same tilted target.

Figures

Figures reproduced from arXiv: 2605.26552 by Dohyun Kim, Hyeongyu Kang, Jaewoo Lee, Jeongjae Lee, Jinkyoo Park, Jong Chul Ye, Kyuil Sim, Minsu Kim, Sanghyeok Choi, Tabitha Edith Lee, Taeyoung Yun, Woocheol Shin.

Figure 1
Figure 1. Figure 1: Left: Sampling from the Q-tilted distribution via FAV yields state-of-the-art performance on offline and offline-to-online RL tasks. Right: Sampling from a human-preference-tilted distribution through FAV improves image quality of a high-resolution text-to-image generator. where pref denotes a reference distribution, such as a distribution induced by a pretrained generator or an empirical data distribution… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of SVGD transport. shared bandwidth σ yields the following approximated transport field: ϕˆ∗ qθ,q∗ σ (x) = E x ′∼qθ x ref∼pref " kσ(x ′ , x) [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sampling from a reward-tilted distribution. On the 8 Gaussian p(x) with the reward function shown on the left, we target to sample from p(x) exp(r(x)). Regularized REINFORCE and Adjoint Matching are not applicable to 1-step generators, whereas FAV applies uniformly across all architectures. For MeanFlow from 2 to 16 sampling steps, FAV consistently yields samples that better match the reward-tilted target … view at source ↗
Figure 4
Figure 4. Figure 4: Target reward vs evaluation metrics. Aesthetic Score is the target reward; (a) HPSv2, (b) ImageReward, (c) DreamSim diversity and (d) CLIP diversity evaluate quality and diversity to indicate reward overoptimization. FAV achieves the best Pareto frontier [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: FAV-B on black-box reward functions. FAV outperforms baselines in few-step regimes [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: FAV with non-differentiable reward. (a) For aesthetic-score optimization, all methods are trained for 200 steps. (b),(c) For compressibility and incompressibility, FAV-B is trained for 40 steps, while Flow-GRPO+KL is run for the same wall-clock time, corresponding to 120 steps. Wall-clock time is measured on 4 NVIDIA RTX 3090 GPUs. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training curves for offline-to-online RL. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ablation analysis for each components of FAV. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Sensitivity analysis for temperature parameter β. 0 50 100 150 200 Step 5 6 7 8 Aesthetic Score (a) Step vs Aesthetic 0 50 100 150 200 Step 0.4 0.5 0.6 0.7 0.8 DreamSim Div (b) Step vs DreamSim Div = 0.1 = 0.3 (default) = 0.5 = 1 [PITH_FULL_IMAGE:figures/full_fig_p033_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: , increasing τ generally improves reward optimization but reduces sample diversity. We attribute this trade-off to the locality of kernel interactions. When τ is small, samples mainly interact with nearby reference points, preserving the multi-modal structure of the prior and maintaining diversity. However, such local updates limit reward improvement because samples are unlikely to move toward higher-rewa… view at source ↗
Figure 11
Figure 11. Figure 11: Training dynamics of each alignment method. NSFW classifier as the target reward for (a),(b); HPS as the target reward for (c),(d). 35 [PITH_FULL_IMAGE:figures/full_fig_p035_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: shows qualitative comparisons on ImageNet 256, where each method aligns the iMeanFlow (8 steps) model with the aesthetic score as the target reward. Base (iMF 8-step) FAV (Ours) DRaFT Adjoint Matching Flow-GRPO +KL Best-of-256 ReNO-50 [PITH_FULL_IMAGE:figures/full_fig_p036_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: shows qualitative samples from Sana-Sprint 1.6B (4-step) fine-tuned with each alignment method. The first four rows show samples from models fine-tuned on DrawBench prompts to maximize the HPS reward, and the last row shows samples from models fine-tuned on Sneaky prompts to minimize the NSFW classifier score. Since Sneaky prompts contain adversarial prompt and cannot be disclosed here, we refer readers t… view at source ↗
read the original abstract

Aligning a few-step generative model is challenging, since existing alignment frameworks typically rely on restrictive assumptions: a tractable likelihood, a specific ODE/SDE solver, or a particular model family. We introduce FAV, Few-step Generative Models Alignment via Sample-based Variational Inference, a general alignment framework that requires only sample access to the generator and the reference distribution. We cast alignment as sampling from a reward-tilted distribution anchored to a reference distribution. We leverage Stein Variational Gradient Descent as a sample-based variational inference scheme and amortize its particle updates into the generator parameters via fixed-point regression. We evaluate FAV on two domains: robotics manipulation and image generator alignment. On generative policy alignment for robotic manipulation, FAV outperforms prevailing policy extraction baselines across 56 offline and 30 offline-to-online RL tasks. For image generator alignment, FAV fine-tunes diverse few-step backbones, including GAN, drifting model, consistency models, and flow maps, scaling from ImageNet-$256$ to 1024$^2$ text-to-image synthesis. Code is available at https://github.com/Jaewoopudding/FAV.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces FAV, a general framework for aligning few-step generative models that requires only sample access. Alignment is cast as sampling from a reward-tilted distribution p(x) ∝ p_ref(x) exp(r(x)); SVGD is used to generate particle updates on this target, which are then amortized into the generator parameters via fixed-point regression. Empirical claims include outperformance over policy extraction baselines on 56 offline and 30 offline-to-online robotic manipulation tasks, plus successful fine-tuning of diverse few-step backbones (GAN, drifting models, consistency models, flow maps) scaling from ImageNet-256 to 1024² text-to-image synthesis. Code is released.

Significance. If the amortization step is shown to be unbiased and stable, the method would provide a broadly applicable alignment procedure that avoids assumptions on likelihoods, ODE/SDE solvers, or model families, with demonstrated scaling across robotics and high-resolution image domains. Explicit code release is a positive contribution to reproducibility.

major comments (2)
  1. [Abstract and method overview] The central claim that fixed-point regression of generator parameters recovers the SVGD stationary distribution on the reward-tilted target (and thereby supports the alignment guarantee) is load-bearing, yet the abstract and method description provide no derivation establishing that the regression objective is unbiased with respect to the Stein operator or that the few-step generator class is sufficiently expressive to represent the required transport map without introducing bias or instability.
  2. [Experimental evaluation sections] Empirical performance claims (outperformance on 86 RL tasks and scaling to 1024² synthesis) are presented without reported error bars, ablation studies on the amortization step, or verification that the learned generator's empirical distribution matches the SVGD particles on the tilted distribution; these omissions prevent assessment of whether the reported gains are attributable to the proposed amortization or to other factors.
minor comments (2)
  1. [Method] Notation for the fixed-point regression objective and the precise form of the Stein operator used in the amortization step should be introduced with explicit equations rather than descriptive prose.
  2. [Abstract] The abstract states results across 'diverse few-step backbones' but does not list the exact model families or training hyperparameters used in the image-alignment experiments; a table summarizing these would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional clarity and rigor would strengthen the presentation of FAV. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and method overview] The central claim that fixed-point regression of generator parameters recovers the SVGD stationary distribution on the reward-tilted target (and thereby supports the alignment guarantee) is load-bearing, yet the abstract and method description provide no derivation establishing that the regression objective is unbiased with respect to the Stein operator or that the few-step generator class is sufficiently expressive to represent the required transport map without introducing bias or instability.

    Authors: We agree that the abstract and method overview are concise and do not contain an explicit derivation. The full method section motivates the fixed-point regression as directly amortizing the SVGD particle updates on the reward-tilted distribution, with the regression objective designed to enforce the fixed-point condition. To make the connection to the Stein operator explicit and address potential bias, we will add a dedicated subsection deriving that the regression loss corresponds to an empirical estimate of the Stein discrepancy, establishing unbiasedness in the large-sample limit. We will also expand the discussion of expressiveness to acknowledge that the few-step generator approximates the transport map and may introduce some bias, while noting that the empirical results across diverse backbones provide supporting evidence; we will include a brief analysis of approximation quality. revision: yes

  2. Referee: [Experimental evaluation sections] Empirical performance claims (outperformance on 86 RL tasks and scaling to 1024² synthesis) are presented without reported error bars, ablation studies on the amortization step, or verification that the learned generator's empirical distribution matches the SVGD particles on the tilted distribution; these omissions prevent assessment of whether the reported gains are attributable to the proposed amortization or to other factors.

    Authors: We agree that these elements are necessary for a complete evaluation. In the revised version we will report error bars (standard deviations across multiple random seeds) for all quantitative results on the robotics and image tasks. We will add ablation studies that isolate the amortization step, including variants with and without the fixed-point regression. We will also include a verification experiment that directly compares the empirical distribution of the learned generator to the SVGD particles (e.g., via MMD or sliced Wasserstein distance on representative tasks) to confirm that the reported gains arise from the proposed amortization procedure. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper describes casting alignment as sampling from a reward-tilted distribution, applying SVGD for sample-based VI, and amortizing particle updates via fixed-point regression into generator parameters. No quoted equations or steps in the abstract reduce any claimed prediction or result to a fitted input, self-definition, or self-citation chain by construction. Performance is reported on external robotics RL tasks and image synthesis benchmarks, with no indication that results are forced by the method's own fitted quantities. The derivation remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is based solely on the abstract; no explicit free parameters, axioms, or invented entities are named in the provided text. The approach invokes standard SVGD and fixed-point regression as background techniques.

axioms (1)
  • domain assumption Stein Variational Gradient Descent produces useful particle updates for sampling from reward-tilted distributions anchored to a reference.
    Invoked when casting alignment as sampling and applying SVGD.

pith-pipeline@v0.9.1-grok · 5782 in / 1385 out tokens · 45415 ms · 2026-06-29T19:14:28.486045+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

116 extracted references · 27 canonical work pages · 16 internal anchors

  1. [1]

    Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models

    Xinyue Ai, Yutong He, Albert Gu, Ruslan Salakhutdinov, J Zico Kolter, Nicholas Matthew Boffi, and Max Simchowitz. Joint distillation for fast likelihood evaluation and sampling in flow-based models.arXiv preprint arXiv:2512.02636, 2025

  2. [2]

    Efficient online reinforcement learning with offline data

    Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning, pages 1577–1594. PMLR, 2023

  3. [3]

    Demystifying MMD GANs

    Mikołaj Bi´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans.arXiv preprint arXiv:1801.01401, 2018

  4. [4]

    Training diffusion models with reinforcement learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. InThe Twelfth International Conference on Learning Representations

  5. [5]

    Variational inference: A review for statisticians.Journal of the American statistical Association, 112(518):859–877, 2017

    David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians.Journal of the American statistical Association, 112(518):859–877, 2017

  6. [6]

    Flow map matching with stochastic interpolants: A mathematical framework for consistency models

    Nicholas Matthew Boffi, Michael Samuel Albergo, and Eric Vanden-Eijnden. Flow map matching with stochastic interpolants: A mathematical framework for consistency models. Transactions on Machine Learning Research

  7. [7]

    How to build a consistency model: Learning flow maps via self-distillation

    Nicholas Matthew Boffi, Michael Samuel Albergo, and Eric Vanden-Eijnden. How to build a consistency model: Learning flow maps via self-distillation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  8. [8]

    JAX: composable transformations of Python+NumPy programs, 2018

    James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Yash Katariya, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman- Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018

  9. [9]

    Score regularized policy optimization through diffusion behavior

    Huayu Chen, Cheng Lu, Zhengyi Wang, Hang Su, and Jun Zhu. Score regularized policy optimization through diffusion behavior. InThe Twelfth International Conference on Learning Representations

  10. [10]

    Sana-sprint: One-step diffusion with continuous-time consistency distillation

    Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Song Han, and Enze Xie. Sana-sprint: One-step diffusion with continuous-time consistency distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16185–16195, 2025

  11. [11]

    Mean shift, mode seeking, and clustering.IEEE transactions on pattern analysis and machine intelligence, 17(8):790–799, 1995

    Yizong Cheng. Mean shift, mode seeking, and clustering.IEEE transactions on pattern analysis and machine intelligence, 17(8):790–799, 1995

  12. [12]

    Diffusion posterior sampling for general noisy inverse problems

    Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. InThe Eleventh International Conference on Learning Representations

  13. [13]

    A kernel test of goodness of fit

    Kacper Chwialkowski, Heiko Strathmann, and Arthur Gretton. A kernel test of goodness of fit. InInternational conference on machine learning, pages 2606–2615. PMLR, 2016

  14. [14]

    Directly fine-tuning diffusion models on differentiable rewards

    Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable rewards. InThe Twelfth International Conference on Learning Representations

  15. [15]

    Relative trajectory balance is equivalent to trust-pcl.arXiv preprint arXiv:2509.01632, 2025

    Tristan Deleu, Padideh Nouri, Yoshua Bengio, and Doina Precup. Relative trajectory balance is equivalent to trust-pcl.arXiv preprint arXiv:2509.01632, 2025

  16. [16]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  17. [17]

    Generative Modeling via Drifting

    Mingyang Deng, He Li, Tianhong Li, Yilun Du, and Kaiming He. Generative modeling via drifting.arXiv preprint arXiv:2602.04770, 2026. 11

  18. [18]

    John Wiley & Sons, 1985

    Luc Devroye and László Györfi.Nonparametric Density Estimation: The L1 View. John Wiley & Sons, 1985

  19. [19]

    Diffusion-based reinforcement learning via q-weighted variational policy optimization

    Shutong Ding, Ke Hu, Zhenhao Zhang, Kan Ren, Weinan Zhang, Jingyi Yu, Jingya Wang, and Ye Shi. Diffusion-based reinforcement learning via q-weighted variational policy optimization. Advances in Neural Information Processing Systems, 37:53945–53968, 2024

  20. [20]

    Consistency models as a rich and efficient policy class for reinforce- ment learning

    Zihan Ding and Chi Jin. Consistency models as a rich and efficient policy class for reinforce- ment learning. InThe Twelfth International Conference on Learning Representations

  21. [21]

    Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control

    Carles Domingo-Enrich, Michal Drozdzal, Brian Karrer, and Ricky TQ Chen. Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control. InThe Thirteenth International Conference on Learning Representations

  22. [22]

    Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural networks, 107:3–11, 2018

    Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural networks, 107:3–11, 2018

  23. [23]

    Reno: Enhancing one-step text-to-image models through reward-based noise optimization

    Luca Eyring, Shyamgopal Karthik, Karsten Roth, Alexey Dosovitskiy, and Zeynep Akata. Reno: Enhancing one-step text-to-image models through reward-based noise optimization. Advances in Neural Information Processing Systems, 37:125487–125519, 2024

  24. [24]

    Nsfw image detection model

    Falcons.ai. Nsfw image detection model. https://huggingface.co/Falconsai/nsfw_ image_detection, 2024. Accessed: 2025-10-09

  25. [25]

    Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

    Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

  26. [26]

    One step diffusion via shortcut models

    Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models. InThe Thirteenth International Conference on Learning Representations

  27. [27]

    D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning.arXiv preprint arXiv:2004.07219, 2020

  28. [28]

    Dreamsim: learning new dimensions of human visual similarity using synthetic data

    Stephanie Fu, Netanel Y Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: learning new dimensions of human visual similarity using synthetic data. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 50742–50768, 2023

  29. [29]

    Mean flows for one-step generative modeling

    Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  30. [30]

    Improved Mean Flows: On the Challenges of Fastforward Generative Models

    Zhengyang Geng, Yiyang Lu, Zongze Wu, Eli Shechtman, J Zico Kolter, and Kaiming He. Improved mean flows: On the challenges of fastforward generative models.arXiv preprint arXiv:2512.02012, 2025

  31. [31]

    Consistency models made easy

    Zhengyang Geng, Ashwini Pokle, Weijian Luo, Justin Lin, and J Zico Kolter. Consistency models made easy. InThe Thirteenth International Conference on Learning Representations

  32. [32]

    Generative adversarial nets.Advances in neural information processing systems, 27, 2014

    Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in neural information processing systems, 27, 2014

  33. [33]

    Dimensionality reduction by learning an invariant mapping

    Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006

  34. [34]

    IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

    Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies.arXiv preprint arXiv:2304.10573, 2023

  35. [35]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 12

  36. [36]

    Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps

    Peter Holderrieth, Douglas Chen, Luca Eyring, Ishin Shah, Giri Anantharaman, Yutong He, Zeynep Akata, Tommi Jaakkola, Nicholas Matthew Boffi, and Max Simchowitz. Diamond maps: Efficient reward alignment via stochastic flow maps.arXiv preprint arXiv:2602.05993, 2026

  37. [37]

    Glass flows: Transition sampling for alignment of flow and diffusion models.arXiv preprint arXiv:2509.25170, 2025

    Peter Holderrieth, Uriel Singer, Tommi Jaakkola, Ricky TQ Chen, Yaron Lipman, and Brian Karrer. Glass flows: Transition sampling for alignment of flow and diffusion models.arXiv preprint arXiv:2509.25170, 2025

  38. [38]

    Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  39. [39]

    Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

    Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog.arXiv preprint arXiv:1907.00456, 2019

  40. [40]

    Efficient diffusion policies for offline reinforcement learning.Advances in Neural Information Processing Systems, 36:67195–67212, 2023

    Bingyi Kang, Xiao Ma, Chao Du, Tianyu Pang, and Shuicheng Yan. Efficient diffusion policies for offline reinforcement learning.Advances in Neural Information Processing Systems, 36:67195–67212, 2023

  41. [41]

    Diffu- sion fine-tuning via reparameterized policy gradient of the soft q-function.arXiv preprint arXiv:2512.04559, 2025

    Hyeongyu Kang, Jaewoo Lee, Woocheol Shin, Kiyoung Om, and Jinkyoo Park. Diffu- sion fine-tuning via reparameterized policy gradient of the soft q-function.arXiv preprint arXiv:2512.04559, 2025

  42. [42]

    If at first you don’t succeed, try, try again: Faithful diffusion-based text-to-image generation by selection

    Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, and Zeynep Akata. If at first you don’t succeed, try, try again: Faithful diffusion-based text-to-image generation by selection. arXiv preprint arXiv:2305.13308, 2023

  43. [43]

    Consistency trajectory models: Learning probability flow ode trajectory of diffusion

    Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory models: Learning probability flow ode trajectory of diffusion. InThe Twelfth International Conference on Learning Representations

  44. [44]

    Test-time alignment of diffusion models without reward over-optimization

    Sunwoo Kim, Minkyu Kim, and Dongmin Park. Test-time alignment of diffusion models without reward over-optimization. InThe Thirteenth International Conference on Learning Representations

  45. [45]

    Auto-encoding variational bayes.stat, 1050:1, 2014

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes.stat, 1050:1, 2014

  46. [46]

    Rl with kl penalties is better viewed as bayesian inference

    Tomasz Korbak, Ethan Perez, and Christopher Buckley. Rl with kl penalties is better viewed as bayesian inference. InFindings of the Association for Computational Linguistics: EMNLP 2022, pages 1083–1091, 2022

  47. [47]

    Offline reinforcement learning with implicit q-learning

    Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. InInternational Conference on Learning Representations

  48. [48]

    A Unified View of Score-Based and Drifting Models

    Chieh-Hsin Lai, Bac Nguyen, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon, and Molei Tao. A unified view of drifting and score-based models.arXiv preprint arXiv:2603.07514, 2026

  49. [49]

    Diffusion alignment as variational expectation-maximization.arXiv preprint arXiv:2510.00502, 2025

    Jaewoo Lee, Minsu Kim, Sanghyeok Choi, Inhyuck Song, Sujin Yun, Hyeongyu Kang, Woocheol Shin, Taeyoung Yun, Kiyoung Om, and Jinkyoo Park. Diffusion alignment as variational expectation-maximization.arXiv preprint arXiv:2510.00502, 2025

  50. [50]

    Q-learning with Adjoint Matching

    Qiyang Li and Sergey Levine. Q-learning with adjoint matching.arXiv preprint arXiv:2601.14234, 2026

  51. [51]

    Decoupled q-chunking.arXiv preprint arXiv:2512.10926, 2025

    Qiyang Li, Seohong Park, and Sergey Levine. Decoupled q-chunking.arXiv preprint arXiv:2512.10926, 2025

  52. [52]

    Aligning diffusion models by optimizing human utility.Advances in Neural Information Processing Systems, 37:24897–24925, 2024

    Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, and Kazuki Kozuka. Aligning diffusion models by optimizing human utility.Advances in Neural Information Processing Systems, 37:24897–24925, 2024. 13

  53. [53]

    Derivative-free guidance in continuous and discrete diffusion models with soft value-based decoding

    Xiner Li, Yulai Zhao, Chenyu Wang, Gabriele Scalia, Gökcen Eraslan, Surag Nair, Tommaso Biancalani, Shuiwang Ji, Aviv Regev, Sergey Levine, et al. Derivative-free guidance in continuous and discrete diffusion models with soft value-based decoding. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  54. [54]

    Gradient estimators for implicit models

    Yingzhen Li and Richard E Turner. Gradient estimators for implicit models. InInternational Conference on Learning Representations, 2018

  55. [55]

    Flow matching for generative modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations

  56. [56]

    Flow-grpo: Training flow matching models via online rl

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di ZHANG, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl. In The Thirty-ninth Annual Conference on Neural Information Processing Systems

  57. [57]

    Stein variational gradient descent as gradient flow.Advances in neural information processing systems, 30, 2017

    Qiang Liu. Stein variational gradient descent as gradient flow.Advances in neural information processing systems, 30, 2017

  58. [58]

    A kernelized stein discrepancy for goodness-of-fit tests

    Qiang Liu, Jason Lee, and Michael Jordan. A kernelized stein discrepancy for goodness-of-fit tests. InInternational conference on machine learning, pages 276–284. PMLR, 2016

  59. [59]

    Stein variational gradient descent: A general purpose bayesian inference algorithm.Advances in neural information processing systems, 29, 2016

    Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm.Advances in neural information processing systems, 29, 2016

  60. [60]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

  61. [61]

    Value gradient guidance for flow matching alignment

    Zhen Liu, Tim Z Xiao, Carles Domingo-Enrich, Weiyang Liu, and Dinghuai Zhang. Value gradient guidance for flow matching alignment. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  62. [62]

    Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning

    Cheng Lu, Huayu Chen, Jianfei Chen, Hang Su, Chongxuan Li, and Jun Zhu. Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning. InInternational Conference on Machine Learning, pages 22825–22855. PMLR, 2023

  63. [63]

    Simplifying, stabilizing and scaling continuous-time consistency models

    Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models. InThe Thirteenth International Conference on Learning Representations

  64. [64]

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023

  65. [65]

    Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models.Advances in Neural Information Processing Systems, 36:76525–76546, 2023

    Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models.Advances in Neural Information Processing Systems, 36:76525–76546, 2023

  66. [66]

    AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

    Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. Awac: Accelerating online reinforcement learning with offline datasets.arXiv preprint arXiv:2006.09359, 2020

  67. [67]

    Springer Science & Business Media, 2013

    Yurii Nesterov.Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013

  68. [68]

    Random gradient-free minimization of convex func- tions.F oundations of Computational Mathematics, 17(2):527–566, 2017

    Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex func- tions.F oundations of Computational Mathematics, 17(2):527–566, 2017

  69. [69]

    Control functionals for monte carlo integra- tion.Journal of the Royal Statistical Society Series B: Statistical Methodology, 79(3):695–718, 2017

    Chris J Oates, Mark Girolami, and Nicolas Chopin. Control functionals for monte carlo integra- tion.Journal of the Royal Statistical Society Series B: Statistical Methodology, 79(3):695–718, 2017

  70. [70]

    Rl for consis- tency models: Faster reward guided text-to-image generation.arXiv preprint arXiv:2404.03673, 2024

    Owen Oertell, Jonathan D Chang, Yiyi Zhang, Kianté Brantley, and Wen Sun. Rl for consis- tency models: Faster reward guided text-to-image generation.arXiv preprint arXiv:2404.03673, 2024. 14

  71. [71]

    Ogbench: Bench- marking offline goal-conditioned rl

    Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. Ogbench: Bench- marking offline goal-conditioned rl. InThe Thirteenth International Conference on Learning Representations

  72. [72]

    Is value learning really the main bottleneck in offline rl?Advances in Neural Information Processing Systems, 37:79029–79056, 2024

    Seohong Park, Kevin Frans, Sergey Levine, and Aviral Kumar. Is value learning really the main bottleneck in offline rl?Advances in Neural Information Processing Systems, 37:79029–79056, 2024

  73. [73]

    Flow q-learning

    Seohong Park, Qiyang Li, and Sergey Levine. Flow q-learning. InInternational Conference on Machine Learning (ICML), 2025

  74. [74]

    On estimation of a probability density function and mode.The annals of mathematical statistics, 33(3):1065–1076, 1962

    Emanuel Parzen. On estimation of a probability density function and mode.The annals of mathematical statistics, 33(3):1065–1076, 1962

  75. [75]

    Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

    Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regres- sion: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019

  76. [76]

    Aligning text-to-image diffusion models with reward backpropagation

    Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Katerina Fragkiadaki. Aligning text-to-image diffusion models with reward backpropagation. 2023

  77. [77]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  78. [78]

    Variational inference with normalizing flows

    Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International conference on machine learning, pages 1530–1538. PMLR, 2015

  79. [79]

    Vargrad: a low-variance gradient estimator for variational inference.Advances in Neural Information Processing Systems, 33:13481–13492, 2020

    Lorenz Richter, Ayman Boustati, Nikolas Nüsken, Francisco Ruiz, and Omer Deniz Akyildiz. Vargrad: a low-variance gradient estimator for variational inference.Advances in Neural Information Processing Systems, 33:13481–13492, 2020

  80. [80]

    Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Showing first 80 references.