pith. sign in

arxiv: 2605.19804 · v1 · pith:2AFDZRIVnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI· cs.LG

Stitched Value Model for Diffusion Alignment

Pith reviewed 2026-05-20 06:25 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords diffusion alignmentvalue modelmodel stitchinglatent spacereward modelsgenerative modelssteering methods
0
0 comments X

The pith

StitchVM stitches pixel-space reward models with frozen diffusion backbones to estimate values directly at noisy latents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes stitching a truncated pixel-space reward model to a frozen diffusion backbone so that the hybrid can predict rewards at noisy intermediate latents rather than only at clean outputs. Existing alignment methods rely on either biased Tweedie approximations or expensive Monte Carlo rollouts; the stitched model avoids both by constructing the correct value function once and amortizing it across samples and steps. This matters for practical diffusion alignment because reward signals for prompt fidelity or aesthetics are defined only on final images yet the training loop must evaluate noisy states. A sympathetic reader would see the result as shifting from per-sample estimation to a reusable latent-space value model.

Core claim

StitchVM starts from an existing truncated pixel-space reward model and attaches a frozen diffusion backbone as its head. The hybrid keeps the pretrained reward capability from the pixel model and gains the backbone's native handling of noisy latents. Stitching and finetuning are lightweight, taking only 10 GPU-hours for CLIP ViT-L with SD 3.5 Medium. The approach lets the correct value function for actual noisy latents be built once and then reused over many samples and iterations instead of relying on rough per-sample approximations.

What carries the argument

The stitched hybrid model that combines a truncated pixel-space reward model with a frozen diffusion backbone as its head.

If this is right

  • DPS alignment runs 3.2 times faster while halving peak GPU memory.
  • DiffusionNFT runs 2.3 times faster.
  • Value functions are constructed once and then amortized across many samples and alignment iterations.
  • The same stitching method yields improvements across a broad range of downstream steering and post-training techniques.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same stitching pattern could be applied to other latent generative models that need reward signals at intermediate noise levels.
  • Multiple pretrained reward models could be stitched in parallel to create composite value functions for combined objectives.
  • The reduced per-sample cost opens the possibility of running alignment loops inside interactive or real-time generation pipelines.

Load-bearing premise

The stitching procedure preserves the original reward capability while the frozen backbone supplies accurate value estimates specifically at noisy latents without any further training of the backbone.

What would settle it

Experiments that compare the stitched model's value estimates at intermediate noisy latents against high-fidelity Monte Carlo rollouts and find large systematic discrepancies, or that show no speedup when the model is plugged into DPS or DiffusionNFT.

read the original abstract

For practical use, diffusion- or flow-based generative models must be aligned with task-specific rewards, such as prompt fidelity or aesthetic preference. That alignment is challenging because the reward is defined for clean output images, but the alignment procedure requires value function estimates at noisy intermediate latents. Existing methods resort to Tweedie-style or Monte Carlo approximations, trading off estimator bias against computational cost: Tweedie estimates are efficient but biased, while Monte Carlo estimates are more accurate but require expensive rollouts. A natural alternative would be a learned value function, but it remains an open question how to effectively train a strong and general value model specifically for noisy latents. Here, we propose StitchVM, a model stitching framework that efficiently transfers reward models pretrained for clean images to the noisy latent regime. StitchVM starts from an existing, truncated pixel-space reward model and attaches a frozen diffusion backbone to it as its head. From the pixel-space model, the resulting hybrid retains a carefully pretrained, robust reward capability; from the diffusion backbone, it inherits its native ability to handle noisy latents. The stitching procedure is exceptionally lightweight, e.g., stitching and finetuning CLIP ViT-L and SD 3.5 Medium takes only 10 GPU-hours. By lifting powerful pixel-space reward models to latent space, StitchVM opens up a new style of diffusion alignment: instead of rough, yet costly per-sample approximation of the value function, the correct function for the actual, noisy latents is constructed once and then amortized over many samples and iterations. We show that this approach yields improvements across a broad range of downstream steering and post-training methods: DPS becomes $3.2\times$ faster while halving peak GPU memory, and DiffusionNFT becomes $2.3\times$ faster.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes StitchVM, a model-stitching framework that attaches a frozen diffusion backbone to a truncated pixel-space reward model to produce a value function for noisy latents. The central claim is that this hybrid retains robust reward capability from the pixel-space model while inheriting native handling of noisy states from the backbone, thereby constructing the correct value function once and amortizing it over alignment iterations. Downstream results report 3.2× speedup and halved memory for DPS, 2.3× speedup for DiffusionNFT, and a total stitching/finetuning cost of 10 GPU-hours.

Significance. If the hybrid indeed supplies accurate conditional expectations of clean-image rewards given noisy latents, the method would replace expensive per-sample approximations with a reusable, lightweight model and materially lower the barrier to reward-based diffusion alignment. The reported efficiency gains and broad applicability to both steering and post-training pipelines constitute a practical contribution.

major comments (2)
  1. [Experiments] The manuscript provides no quantitative validation (MSE, rank correlation, or similar) of StitchVM value estimates against high-sample Monte Carlo rollouts at fixed intermediate timesteps t. Without this check, the claim that the hybrid computes the conditional expectation rather than a Tweedie-style point estimate remains unsubstantiated.
  2. [§5] §5 (Downstream Alignment Experiments): the reported 3.2× and 2.3× speedups are presented without ablations, statistical significance tests, or controls for confounding factors such as implementation details or hyperparameter tuning, weakening support for the efficiency claims.
minor comments (1)
  1. [Abstract] The abstract states that the stitching procedure is 'exceptionally lightweight' but does not specify the exact layers frozen, the loss used for the 10 GPU-hour finetuning stage, or the truncation point chosen for the pixel-space reward model.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and have incorporated revisions to strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: [Experiments] The manuscript provides no quantitative validation (MSE, rank correlation, or similar) of StitchVM value estimates against high-sample Monte Carlo rollouts at fixed intermediate timesteps t. Without this check, the claim that the hybrid computes the conditional expectation rather than a Tweedie-style point estimate remains unsubstantiated.

    Authors: We agree that a direct quantitative comparison to high-sample Monte Carlo rollouts at fixed timesteps would provide stronger substantiation for the claim that StitchVM approximates the conditional expectation. The original submission emphasized downstream utility because Monte Carlo rollouts at intermediate noise levels are precisely the expensive computation we aim to amortize. In the revision we will add a targeted validation subsection reporting MSE and Spearman rank correlation between StitchVM outputs and 100-sample Monte Carlo estimates at several fixed t values (e.g., t=200, 400, 600) on a held-out prompt set, while keeping the added compute modest. revision: yes

  2. Referee: [§5] §5 (Downstream Alignment Experiments): the reported 3.2× and 2.3× speedups are presented without ablations, statistical significance tests, or controls for confounding factors such as implementation details or hyperparameter tuning, weakening support for the efficiency claims.

    Authors: The speedups were measured by executing the identical alignment pipelines (same random seeds, batch sizes, hardware, and hyper-parameters) with and without StitchVM. To strengthen the presentation we will expand §5 with (i) ablations across two additional reward models, (ii) mean and standard deviation of wall-clock time over five independent runs, and (iii) paired t-test p-values for the timing differences. We will also add a short paragraph detailing the exact implementation stack and hyper-parameter settings used for both baselines and StitchVM variants. revision: yes

Circularity Check

0 steps flagged

StitchVM constructs hybrid value model via stitching of pretrained components; no circular reduction to inputs

full rationale

The paper proposes StitchVM as a stitching framework that attaches a frozen diffusion backbone to a truncated pixel-space reward model, followed by lightweight finetuning (e.g., 10 GPU-hours for CLIP ViT-L + SD 3.5). This transfers existing reward capability to noisy latents without deriving the value function from fitted parameters, self-referential equations, or load-bearing self-citations. The central claim—that the hybrid yields the correct value function for noisy latents—is presented as an engineering construction amortized over samples, not a mathematical identity or prediction forced by prior fits. No steps reduce by construction to the paper's own inputs; downstream speedups (DPS 3.2x, DiffusionNFT 2.3x) are empirical outcomes rather than tautological results. The derivation remains self-contained against external pretrained models.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method relies on the domain assumption that pretrained clean-image reward models can be hybridized with diffusion backbones while retaining capability; no new free parameters or invented entities are introduced in the abstract description.

axioms (1)
  • domain assumption A truncated pixel-space reward model retains its core reward judgment capability when a frozen diffusion backbone is attached as its head.
    This premise is required for the hybrid to function as a value model for noisy latents.

pith-pipeline@v0.9.0 · 5891 in / 1337 out tokens · 54570 ms · 2026-05-20T06:25:51.094586+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

108 extracted references · 108 canonical work pages

  1. [1]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020

  2. [2]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InICML, 2015

  3. [3]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InICLR, 2021

  4. [4]

    Addressing negative transfer in diffusion models

    Hyojun Go, Kim Kim, Yunsung Lee, Seunghyun Lee, Shinhyeok Oh, Hyeongdon Moon, and Seungtaek Choi. Addressing negative transfer in diffusion models. InNeurIPS, volume 36, pages 27199–27222, 2023

  5. [5]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023

  6. [6]

    Building normalizing flows with stochastic inter- polants

    Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic inter- polants. InICLR, 2023

  7. [7]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023

  8. [8]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. FLUX. 1 Kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint, 2025. 11 Stitched Value Model for Diffusion Alignment

  9. [9]

    Photorealistic text-to- image diffusion models with deep language understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to- image diffusion models with deep language understanding. InNeurIPS, volume 35, 2022

  10. [10]

    Qwen-image technical report.arXiv preprint, 2025

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint, 2025

  11. [11]

    Wan: Open and advanced large-scale video generative models.arXiv preprint, 2025

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint, 2025

  12. [12]

    Video models are zero-shot learners and reasoners.arXiv preprint, 2025

    Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners.arXiv preprint, 2025

  13. [13]

    Video understanding: From geometry and semantics to unified models

    Zhaochong An, Zirui Li, Mingqiao Ye, Feng Qiao, Jiaang Li, Zongwei Wu, Vishal Thengane, Chengzu Li, Lei Li, Luc Van Gool, et al. Video understanding: From geometry and semantics to unified models. Machine Intelligence Research, 2026

  14. [14]

    OneStory: Coherent multi-shot video generation with adaptive memory.CVPR, 2026

    Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, et al. OneStory: Coherent multi-shot video generation with adaptive memory.CVPR, 2026

  15. [15]

    Text-to-3D by stitching a multi-view reconstruction network to a video generator

    Hyojun Go, Dominik Narnhofer, Goutam Bhat, Prune Truong, Federico Tombari, and Konrad Schindler. Text-to-3D by stitching a multi-view reconstruction network to a video generator. InICLR, 2026

  16. [16]

    Splatflow: Multi-view rectified flow model for 3d gaussian splatting synthesis

    Hyojun Go, Byeongjun Park, Jiho Jang, Jin-Young Kim, Soonwoo Kwon, and Changick Kim. Splatflow: Multi-view rectified flow model for 3d gaussian splatting synthesis. InCVPR, 2025

  17. [17]

    Videorfsplat: Direct scene-level text-to-3d gaussian splatting generation with flexible pose and multi- view joint modeling

    Hyojun Go, Byeongjun Park, Hyelin Nam, Byung-Hoon Kim, Hyungjin Chung, and Changick Kim. Videorfsplat: Direct scene-level text-to-3d gaussian splatting generation with flexible pose and multi- view joint modeling. InICCV, 2025

  18. [18]

    Geneval: An object-focused framework for evaluating text-to-image alignment

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023

  19. [19]

    Aesthetic post-training diffusion models from generic preferences with step-by-step preference optimization

    Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Mingxi Cheng, Ji Li, and Liang Zheng. Aesthetic post-training diffusion models from generic preferences with step-by-step preference optimization. InCVPR, 2025

  20. [20]

    Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint, 2023

  21. [21]

    Aligning text-to-image diffusion models with reward backpropagation.arXiv preprint, 2023

    Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Katerina Fragkiadaki. Aligning text-to-image diffusion models with reward backpropagation.arXiv preprint, 2023

  22. [22]

    Kevin Clark, Paul Vicol, Kevin Swersky, and David J. Fleet. Directly fine-tuning diffusion models on differentiable rewards. InICLR, 2024

  23. [23]

    Aligning text-to-image models using human feedback

    Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mo- hammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. arXiv preprint, 2023

  24. [24]

    RAFT: Reward ranked finetuning for generative foundation model alignment.Transactions on Machine Learning Research, 2023

    Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment.Transactions on Machine Learning Research, 2023

  25. [25]

    Diffusion model alignment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InCVPR, 2024. 12 Stitched Value Model for Diffusion Alignment

  26. [26]

    Using human feedback to fine-tune diffusion models without any reward model

    Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. InCVPR, 2024

  27. [27]

    Improving video generation with human feedback

    Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di ZHANG, Kun Gai, Yujiu Yang, and Wanli Ouyang. Improving video generation with human feedback. InNeurIPS, 2026

  28. [28]

    Training diffusion models with reinforcement learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. InICLR, 2024

  29. [29]

    Reinforcement learning for fine-tuning text-to-image diffusion models

    Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Moham- mad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Reinforcement learning for fine-tuning text-to-image diffusion models. InNeurIPS, 2023

  30. [30]

    DiffusionNFT: Online diffusion reinforcement with forward process

    Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. DiffusionNFT: Online diffusion reinforcement with forward process. InICLR, 2026

  31. [31]

    Diffusion posterior sampling for general noisy inverse problems

    Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. InICLR, 2023

  32. [32]

    Loss-guided diffusion models for plug-and-play controllable generation

    Jiaming Song, Qinsheng Zhang, Hongxu Yin, Morteza Mardani, Ming-Yu Liu, Jan Kautz, Yongxin Chen, and Arash Vahdat. Loss-guided diffusion models for plug-and-play controllable generation. InICML, 2023

  33. [33]

    TFG: Unified training-free guidance for diffusion models

    Haotian Ye, Haowei Lin, Jiaqi Han, Minkai Xu, Sheng Liu, Yitao Liang, Jianzhu Ma, James Zou, and Stefano Ermon. TFG: Unified training-free guidance for diffusion models. InNeurIPS, 2024

  34. [34]

    FreeDoM: Training-free energy-guided conditional diffusion model

    Jiwen Yu, Yinhuai Wang, Chen Zhao, Bernard Ghanem, and Jian Zhang. FreeDoM: Training-free energy-guided conditional diffusion model. InICCV, 2023

  35. [35]

    Manifold preserving guided diffusion

    Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim, Wei-Hsiang Liao, Yuki Mitsufuji, J Zico Kolter, Ruslan Salakhutdinov, and Stefano Ermon. Manifold preserving guided diffusion. InICLR, 2024

  36. [36]

    FlowDPS: Flow-driven posterior sampling for inverse problems

    Jeongsol Kim, Bryan Sangwoo Kim, and Jong Chul Ye. FlowDPS: Flow-driven posterior sampling for inverse problems. InICCV, 2025

  37. [37]

    Pseudoinverse-guided diffusion models for inverse problems

    Jiaming Song, Arash Vahdat, Morteza Mardani, and Jan Kautz. Pseudoinverse-guided diffusion models for inverse problems. InICLR, 2023

  38. [38]

    A general framework for inference-time scaling and steering of diffusion models

    Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, and Rajesh Ranganath. A general framework for inference-time scaling and steering of diffusion models. InICML, 2025

  39. [39]

    Inference-time scaling for flow models via stochastic generation and rollover budget forcing

    Jaihoon Kim, TaeHoon Yoon, Jisung Hwang, and Minhyuk Sung. Inference-time scaling for flow models via stochastic generation and rollover budget forcing. InNeurIPS, 2026

  40. [40]

    Derivative-free guidance in continuous and discrete diffusion models with soft value-based decoding.arXiv preprint, 2024

    Xiner Li, Yulai Zhao, Chenyu Wang, Gabriele Scalia, Gokcen Eraslan, Surag Nair, Tommaso Biancalani, Aviv Regev, Sergey Levine, and Masatoshi Uehara. Derivative-free guidance in continuous and discrete diffusion models with soft value-based decoding.arXiv preprint, 2024

  41. [41]

    Trippe, Christian A Naesseth, John Patrick Cunningham, and David Blei

    Luhuan Wu, Brian L. Trippe, Christian A Naesseth, John Patrick Cunningham, and David Blei. Practical and asymptotically exact conditional sampling in diffusion models. InNeurIPS, 2023

  42. [42]

    Test-time alignment of diffusion models without reward over-optimization

    Sunwoo Kim, Minkyu Kim, and Dongmin Park. Test-time alignment of diffusion models without reward over-optimization. InICLR, 2025

  43. [43]

    Dynamic search for inference-time alignment in diffusion models.arXiv preprint, 2025

    Xiner Li, Masatoshi Uehara, Xingyu Su, Gabriele Scalia, Tommaso Biancalani, Aviv Regev, Sergey Levine, and Shuiwang Ji. Dynamic search for inference-time alignment in diffusion models.arXiv preprint, 2025. 13 Stitched Value Model for Diffusion Alignment

  44. [44]

    Inference-time scaling of diffusion models through classical search

    XiangCheng Zhang, Haowei Lin, Haotian Ye, James Zou, Jianzhu Ma, Yitao Liang, and Yilun Du. Inference-time scaling of diffusion models through classical search. InNeurIPS 2025 Workshop on Structured Probabilistic Inference & Generative Modeling, 2025

  45. [45]

    Feynman-Kac correctors in diffusion: Annealing, guidance, and product of experts

    Marta Skreta, Tara Akhound-Sadegh, Viktor Ohanesian, Roberto Bondesan, Alan Aspuru-Guzik, Arnaud Doucet, Rob Brekelmans, Alexander Tong, and Kirill Neklyudov. Feynman-Kac correctors in diffusion: Annealing, guidance, and product of experts. InICML, 2025

  46. [46]

    Inference-time alignment in diffusion models with reward-guided generation: Tutorial and review.arXiv preprint, 2025

    Masatoshi Uehara, Yulai Zhao, Chenyu Wang, Xiner Li, Aviv Regev, Sergey Levine, and Tommaso Biancalani. Inference-time alignment in diffusion models with reward-guided generation: Tutorial and review.arXiv preprint, 2025

  47. [47]

    ImageReward: Learning and evaluating human preferences for text-to-image generation.arXiv preprint, 2023

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. ImageReward: Learning and evaluating human preferences for text-to-image generation.arXiv preprint, 2023

  48. [48]

    Unified reward model for multimodal understanding and generation.arXiv preprint, 2025

    Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint, 2025

  49. [49]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML, 2021

  50. [50]

    HPSv3: Towards wide-spectrum human preference score

    Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. HPSv3: Towards wide-spectrum human preference score. InICCV, 2025

  51. [51]

    Tweedie’s formula and selection bias.Journal of the American Statistical Association, 106 (496):1602–1614, 2011

    Bradley Efron. Tweedie’s formula and selection bias.Journal of the American Statistical Association, 106 (496):1602–1614, 2011

  52. [52]

    Think twice before you act: Improving inverse problem solving with MCMC.arXiv preprint, 2024

    Yaxuan Zhu, Zehao Dou, Haoxin Zheng, Yasi Zhang, Ying Nian Wu, and Ruiqi Gao. Think twice before you act: Improving inverse problem solving with MCMC.arXiv preprint, 2024

  53. [53]

    VARD: Efficient and dense fine-tuning for diffusion models with value-based RL.arXiv preprint, 2025

    Fengyuan Dai, Zifeng Zhuang, Yufei Huang, Siteng Huang, Bangyan Liao, Donglin Wang, and Fajie Yuan. VARD: Efficient and dense fine-tuning for diffusion models with value-based RL.arXiv preprint, 2025

  54. [54]

    Beyond VLM-based rewards: Diffusion-native latent reward modeling.arXiv preprint, 2026

    Gongye Liu, Bo Yang, Yida Zhi, Zhizhou Zhong, Lei Ke, Didan Deng, Han Gao, Yongxiang Huang, Kaihao Zhang, Hongbo Fu, et al. Beyond VLM-based rewards: Diffusion-native latent reward modeling.arXiv preprint, 2026

  55. [55]

    Video generation models are good latent reward models.arXiv preprint, 2025

    Xiaoyue Mi, Wenqing Yu, Jiesong Lian, Shibo Jie, Ruizhe Zhong, Zijun Liu, Guozhen Zhang, Zixiang Zhou, Zhiyong Xu, Yuan Zhou, et al. Video generation models are good latent reward models.arXiv preprint, 2025

  56. [56]

    Critic-guided reinforcement unlearning in text-to-image diffusion.arXiv preprint, 2026

    Mykola Vysotskyi, Zahar Kohut, Mariia Shpir, Taras Rumezhak, and Volodymyr Karpiv. Critic-guided reinforcement unlearning in text-to-image diffusion.arXiv preprint, 2026

  57. [57]

    Consistent noisy latent rewards for trajectory preference optimization in diffusion models

    Xiaole Xian, Xilin He, Wenting Chen, Wenshuang Liu, wenqi mu, Yancheng He, Liang Li, Yi Zhang, and Xiangyu Yue. Consistent noisy latent rewards for trajectory preference optimization in diffusion models. InICLR, 2026

  58. [58]

    Diffusion model as a noise-aware latent reward model for step-level preference optimization

    Tao Zhang, Cheng Da, Kun Ding, Huan Yang, kun jin, Yan Li, Tingting Gao, Di Zhang, Shiming Xiang, and Chunhong Pan. Diffusion model as a noise-aware latent reward model for step-level preference optimization. InNeurIPS, 2026

  59. [59]

    Confronting reward overoptimization for diffusion models: A perspective of inductive and primacy biases

    Ziyi Zhang, Sen Zhang, Yibing Zhan, Yong Luo, Yonggang Wen, and Dacheng Tao. Confronting reward overoptimization for diffusion models: A perspective of inductive and primacy biases. InICML, 2024

  60. [60]

    LatSearch: Latent reward-guided search for faster inference-time scaling in video diffusion.arXiv preprint, 2026

    Zengqun Zhao, Ziquan Liu, Yu Cao, Shaogang Gong, Zhensong Zhang, Jifei Song, Jiankang Deng, and Ioannis Patras. LatSearch: Latent reward-guided search for faster inference-time scaling in video diffusion.arXiv preprint, 2026. 14 Stitched Value Model for Diffusion Alignment

  61. [61]

    Understanding image representations by measuring their equivariance and equivalence

    Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence. InCVPR, 2015

  62. [62]

    Similarity and matching of neural network representations

    Adrián Csiszárik, Péter Kőrösi-Szabó, Akos Matszangosz, Gergely Papp, and Dániel Varga. Similarity and matching of neural network representations. InNeurIPS, 2021

  63. [63]

    Deep model reassembly

    Xingyi Yang, Daquan Zhou, Songhua Liu, Jingwen Ye, and Xinchao Wang. Deep model reassembly. In NeurIPS, 2022

  64. [64]

    Revisiting model stitching to compare neural representations

    Yamini Bansal, Preetum Nakkiran, and Boaz Barak. Revisiting model stitching to compare neural representations. InNeurIPS, 2021

  65. [65]

    Stitchable neural networks

    Zizheng Pan, Jianfei Cai, and Bohan Zhuang. Stitchable neural networks. InCVPR, 2023

  66. [66]

    Decoupled meanflow: Turning flow models into flow maps for accelerated sampling.arXiv preprint, 2025

    Kyungmin Lee, Sihyun Yu, and Jinwoo Shin. Decoupled meanflow: Turning flow models into flow maps for accelerated sampling.arXiv preprint, 2025

  67. [67]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

  68. [68]

    Stable Diffusion 3.5

    Stability AI. Stable Diffusion 3.5. https://github.com/Stability-AI/sd3.5, 2024. Official inference repository for Stable Diffusion 3.5 Large, Large Turbo, and Medium. Accessed: 2026-04-27

  69. [69]

    FLUX.1 [dev].https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. FLUX.1 [dev].https://github.com/black-forest-labs/flux, 2024. Official inference repository for FLUX.1 open-weight models. Accessed: 2026-04-27

  70. [70]

    Data filtering networks

    AlexFang, AlbinMadappallyJose, AmitJain, LudwigSchmidt, AlexanderTToshev, andVaishaalShankar. Data filtering networks. InICLR, 2024

  71. [71]

    CLIP+MLP aesthetic score predictor

    Christoph Schuhmann. CLIP+MLP aesthetic score predictor. https://github.com/ christophschuhmann/improved-aesthetic-predictor, 2022. Train, use, and visualize an aesthetic score predictor based on a neural network taking CLIP embeddings as input. Accessed: 2026-04-27

  72. [72]

    Universal guidance for diffusion models

    Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Universal guidance for diffusion models. InCVPR, 2023

  73. [73]

    Training-free multi-objective diffusion model for 3d molecule generation

    Xu Han, Caihua Shan, Yifei Shen, Can Xu, Han Yang, Xiang Li, and Dongsheng Li. Training-free multi-objective diffusion model for 3d molecule generation. InICLR, 2024

  74. [74]

    Deep reward supervisions for tuning text-to-image diffusion models

    Xiaoshi Wu, Yiming Hao, Manyuan Zhang, Keqiang Sun, Zhaoyang Huang, Guanglu Song, Yu Liu, and Hongsheng Li. Deep reward supervisions for tuning text-to-image diffusion models. InECCV, 2024

  75. [75]

    Training diffusion models towards diverse image generation with reinforcement learning

    Zichen Miao, Jiang Wang, Ze Wang, Zhengyuan Yang, Lijuan Wang, Qiang Qiu, and Zicheng Liu. Training diffusion models towards diverse image generation with reinforcement learning. InCVPR, 2024

  76. [76]

    Flow-GRPO: Training flow matching models via online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di ZHANG, and Wanli Ouyang. Flow-GRPO: Training flow matching models via online RL. InNeurIPS, 2026

  77. [77]

    DanceGRPO: Unleashing GRPO on visual generation.arXiv preprint, 2025

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. DanceGRPO: Unleashing GRPO on visual generation.arXiv preprint, 2025

  78. [78]

    Tiny inference- time scaling with latent verifiers.arXiv preprint, 2026

    Davide Bucciarelli, Evelyn Turri, Lorenzo Baraldi, Marcella Cornia, and Rita Cucchiara. Tiny inference- time scaling with latent verifiers.arXiv preprint, 2026

  79. [79]

    Beyond the Noise: Aligning prompts with latent representations in diffusion models.arXiv preprint, 2025

    Vasco Ramos, Regev Cohen, Idan Szpektor, and Joao Magalhaes. Beyond the Noise: Aligning prompts with latent representations in diffusion models.arXiv preprint, 2025. 15 Stitched Value Model for Diffusion Alignment

  80. [80]

    Similarity of neural network representations revisited

    Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InICML, 2019

Showing first 80 references.