Stitched Value Model for Diffusion Alignment

Dominik Narnhofer; Federico Tombari; Goutam Bhat; Hyojun Go; Hyungjin Chung; Konrad Schindler; Li Mi; Prune Truong; Serge Belongie; Zhaochong An

arxiv: 2605.19804 · v1 · pith:2AFDZRIVnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI· cs.LG

Stitched Value Model for Diffusion Alignment

Hyojun Go , Hyungjin Chung , Prune Truong , Goutam Bhat , Li Mi , Zhaochong An , Zixiang Zhao , Dominik Narnhofer

show 3 more authors

Serge Belongie Federico Tombari Konrad Schindler

This is my paper

Pith reviewed 2026-05-20 06:25 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords diffusion alignmentvalue modelmodel stitchinglatent spacereward modelsgenerative modelssteering methods

0 comments

The pith

StitchVM stitches pixel-space reward models with frozen diffusion backbones to estimate values directly at noisy latents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes stitching a truncated pixel-space reward model to a frozen diffusion backbone so that the hybrid can predict rewards at noisy intermediate latents rather than only at clean outputs. Existing alignment methods rely on either biased Tweedie approximations or expensive Monte Carlo rollouts; the stitched model avoids both by constructing the correct value function once and amortizing it across samples and steps. This matters for practical diffusion alignment because reward signals for prompt fidelity or aesthetics are defined only on final images yet the training loop must evaluate noisy states. A sympathetic reader would see the result as shifting from per-sample estimation to a reusable latent-space value model.

Core claim

StitchVM starts from an existing truncated pixel-space reward model and attaches a frozen diffusion backbone as its head. The hybrid keeps the pretrained reward capability from the pixel model and gains the backbone's native handling of noisy latents. Stitching and finetuning are lightweight, taking only 10 GPU-hours for CLIP ViT-L with SD 3.5 Medium. The approach lets the correct value function for actual noisy latents be built once and then reused over many samples and iterations instead of relying on rough per-sample approximations.

What carries the argument

The stitched hybrid model that combines a truncated pixel-space reward model with a frozen diffusion backbone as its head.

If this is right

DPS alignment runs 3.2 times faster while halving peak GPU memory.
DiffusionNFT runs 2.3 times faster.
Value functions are constructed once and then amortized across many samples and alignment iterations.
The same stitching method yields improvements across a broad range of downstream steering and post-training techniques.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same stitching pattern could be applied to other latent generative models that need reward signals at intermediate noise levels.
Multiple pretrained reward models could be stitched in parallel to create composite value functions for combined objectives.
The reduced per-sample cost opens the possibility of running alignment loops inside interactive or real-time generation pipelines.

Load-bearing premise

The stitching procedure preserves the original reward capability while the frozen backbone supplies accurate value estimates specifically at noisy latents without any further training of the backbone.

What would settle it

Experiments that compare the stitched model's value estimates at intermediate noisy latents against high-fidelity Monte Carlo rollouts and find large systematic discrepancies, or that show no speedup when the model is plugged into DPS or DiffusionNFT.

read the original abstract

For practical use, diffusion- or flow-based generative models must be aligned with task-specific rewards, such as prompt fidelity or aesthetic preference. That alignment is challenging because the reward is defined for clean output images, but the alignment procedure requires value function estimates at noisy intermediate latents. Existing methods resort to Tweedie-style or Monte Carlo approximations, trading off estimator bias against computational cost: Tweedie estimates are efficient but biased, while Monte Carlo estimates are more accurate but require expensive rollouts. A natural alternative would be a learned value function, but it remains an open question how to effectively train a strong and general value model specifically for noisy latents. Here, we propose StitchVM, a model stitching framework that efficiently transfers reward models pretrained for clean images to the noisy latent regime. StitchVM starts from an existing, truncated pixel-space reward model and attaches a frozen diffusion backbone to it as its head. From the pixel-space model, the resulting hybrid retains a carefully pretrained, robust reward capability; from the diffusion backbone, it inherits its native ability to handle noisy latents. The stitching procedure is exceptionally lightweight, e.g., stitching and finetuning CLIP ViT-L and SD 3.5 Medium takes only 10 GPU-hours. By lifting powerful pixel-space reward models to latent space, StitchVM opens up a new style of diffusion alignment: instead of rough, yet costly per-sample approximation of the value function, the correct function for the actual, noisy latents is constructed once and then amortized over many samples and iterations. We show that this approach yields improvements across a broad range of downstream steering and post-training methods: DPS becomes $3.2\times$ faster while halving peak GPU memory, and DiffusionNFT becomes $2.3\times$ faster.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StitchVM stitches a clean-image reward model to a frozen diffusion backbone for value estimates on noisy latents and reports downstream speedups, but skips direct checks against Monte Carlo ground truth at intermediate steps.

read the letter

The main point is that StitchVM builds a hybrid value model by attaching a frozen diffusion backbone to a truncated pixel-space reward model, then uses the result for alignment tasks on noisy latents. The stitching is lightweight and they show it cuts compute and memory in existing steering methods. What is new is the specific construction that transfers the reward capability without retraining the backbone or relying on per-sample approximations like Tweedie or Monte Carlo rollouts. The hybrid keeps the pretrained reward strength from the pixel model while inheriting the backbone's native handling of noise. They report that stitching CLIP ViT-L to SD 3.5 Medium takes only 10 GPU-hours, which is a practical detail. The paper does well on the application results. It gives concrete numbers: DPS becomes 3.2 times faster with halved peak GPU memory, and DiffusionNFT becomes 2.3 times faster. These gains matter for anyone running reward-based steering or post-training on diffusion models. The soft spot is the missing direct validation. The abstract claims the stitched model gives the correct value function for noisy latents, yet there are no reported MSE or correlation numbers comparing its outputs to high-sample Monte Carlo estimates at fixed timesteps. Downstream task improvements are shown, but those do not confirm how closely the estimates match the conditional expectation of the clean reward. If the backbone is mostly supplying a point estimate, the advantage over prior approximations could be narrower than stated. This paper is for people working on efficient diffusion alignment and reward-guided generation. Readers who need lower-cost ways to steer models at inference time will find the speed and memory figures useful. It deserves a serious referee because the stitching idea is concrete, the training cost is low, and the downstream evidence is there, even if targeted checks on value accuracy would make the central claim tighter. I would send it for peer review.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes StitchVM, a model-stitching framework that attaches a frozen diffusion backbone to a truncated pixel-space reward model to produce a value function for noisy latents. The central claim is that this hybrid retains robust reward capability from the pixel-space model while inheriting native handling of noisy states from the backbone, thereby constructing the correct value function once and amortizing it over alignment iterations. Downstream results report 3.2× speedup and halved memory for DPS, 2.3× speedup for DiffusionNFT, and a total stitching/finetuning cost of 10 GPU-hours.

Significance. If the hybrid indeed supplies accurate conditional expectations of clean-image rewards given noisy latents, the method would replace expensive per-sample approximations with a reusable, lightweight model and materially lower the barrier to reward-based diffusion alignment. The reported efficiency gains and broad applicability to both steering and post-training pipelines constitute a practical contribution.

major comments (2)

[Experiments] The manuscript provides no quantitative validation (MSE, rank correlation, or similar) of StitchVM value estimates against high-sample Monte Carlo rollouts at fixed intermediate timesteps t. Without this check, the claim that the hybrid computes the conditional expectation rather than a Tweedie-style point estimate remains unsubstantiated.
[§5] §5 (Downstream Alignment Experiments): the reported 3.2× and 2.3× speedups are presented without ablations, statistical significance tests, or controls for confounding factors such as implementation details or hyperparameter tuning, weakening support for the efficiency claims.

minor comments (1)

[Abstract] The abstract states that the stitching procedure is 'exceptionally lightweight' but does not specify the exact layers frozen, the loss used for the 10 GPU-hour finetuning stage, or the truncation point chosen for the pixel-space reward model.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and have incorporated revisions to strengthen the empirical support for our claims.

read point-by-point responses

Referee: [Experiments] The manuscript provides no quantitative validation (MSE, rank correlation, or similar) of StitchVM value estimates against high-sample Monte Carlo rollouts at fixed intermediate timesteps t. Without this check, the claim that the hybrid computes the conditional expectation rather than a Tweedie-style point estimate remains unsubstantiated.

Authors: We agree that a direct quantitative comparison to high-sample Monte Carlo rollouts at fixed timesteps would provide stronger substantiation for the claim that StitchVM approximates the conditional expectation. The original submission emphasized downstream utility because Monte Carlo rollouts at intermediate noise levels are precisely the expensive computation we aim to amortize. In the revision we will add a targeted validation subsection reporting MSE and Spearman rank correlation between StitchVM outputs and 100-sample Monte Carlo estimates at several fixed t values (e.g., t=200, 400, 600) on a held-out prompt set, while keeping the added compute modest. revision: yes
Referee: [§5] §5 (Downstream Alignment Experiments): the reported 3.2× and 2.3× speedups are presented without ablations, statistical significance tests, or controls for confounding factors such as implementation details or hyperparameter tuning, weakening support for the efficiency claims.

Authors: The speedups were measured by executing the identical alignment pipelines (same random seeds, batch sizes, hardware, and hyper-parameters) with and without StitchVM. To strengthen the presentation we will expand §5 with (i) ablations across two additional reward models, (ii) mean and standard deviation of wall-clock time over five independent runs, and (iii) paired t-test p-values for the timing differences. We will also add a short paragraph detailing the exact implementation stack and hyper-parameter settings used for both baselines and StitchVM variants. revision: yes

Circularity Check

0 steps flagged

StitchVM constructs hybrid value model via stitching of pretrained components; no circular reduction to inputs

full rationale

The paper proposes StitchVM as a stitching framework that attaches a frozen diffusion backbone to a truncated pixel-space reward model, followed by lightweight finetuning (e.g., 10 GPU-hours for CLIP ViT-L + SD 3.5). This transfers existing reward capability to noisy latents without deriving the value function from fitted parameters, self-referential equations, or load-bearing self-citations. The central claim—that the hybrid yields the correct value function for noisy latents—is presented as an engineering construction amortized over samples, not a mathematical identity or prediction forced by prior fits. No steps reduce by construction to the paper's own inputs; downstream speedups (DPS 3.2x, DiffusionNFT 2.3x) are empirical outcomes rather than tautological results. The derivation remains self-contained against external pretrained models.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method relies on the domain assumption that pretrained clean-image reward models can be hybridized with diffusion backbones while retaining capability; no new free parameters or invented entities are introduced in the abstract description.

axioms (1)

domain assumption A truncated pixel-space reward model retains its core reward judgment capability when a frozen diffusion backbone is attached as its head.
This premise is required for the hybrid to function as a value model for noisy latents.

pith-pipeline@v0.9.0 · 5891 in / 1337 out tokens · 54570 ms · 2026-05-20T06:25:51.094586+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

StitchVM starts from an existing, truncated pixel-space reward model and attaches a frozen diffusion backbone to it as its head... V(i*,j*)_ω(z_t) = r≥j_ϕ(s_ψ(u≤i_θ(z_t)))
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lvalue(ω) = E[ |V(i*,j*)_ω(z_t) - r_ϕ(z0)|^2 ]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

108 extracted references · 108 canonical work pages

[1]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020

work page 2020
[2]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InICML, 2015

work page 2015
[3]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InICLR, 2021

work page 2021
[4]

Addressing negative transfer in diffusion models

Hyojun Go, Kim Kim, Yunsung Lee, Seunghyun Lee, Shinhyeok Oh, Hyeongdon Moon, and Seungtaek Choi. Addressing negative transfer in diffusion models. InNeurIPS, volume 36, pages 27199–27222, 2023

work page 2023
[5]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023

work page 2023
[6]

Building normalizing flows with stochastic inter- polants

Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic inter- polants. InICLR, 2023

work page 2023
[7]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023

work page 2023
[8]

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. FLUX. 1 Kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint, 2025. 11 Stitched Value Model for Diffusion Alignment

work page 2025
[9]

Photorealistic text-to- image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to- image diffusion models with deep language understanding. InNeurIPS, volume 35, 2022

work page 2022
[10]

Qwen-image technical report.arXiv preprint, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint, 2025

work page 2025
[11]

Wan: Open and advanced large-scale video generative models.arXiv preprint, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint, 2025

work page 2025
[12]

Video models are zero-shot learners and reasoners.arXiv preprint, 2025

Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners.arXiv preprint, 2025

work page 2025
[13]

Video understanding: From geometry and semantics to unified models

Zhaochong An, Zirui Li, Mingqiao Ye, Feng Qiao, Jiaang Li, Zongwei Wu, Vishal Thengane, Chengzu Li, Lei Li, Luc Van Gool, et al. Video understanding: From geometry and semantics to unified models. Machine Intelligence Research, 2026

work page 2026
[14]

OneStory: Coherent multi-shot video generation with adaptive memory.CVPR, 2026

Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, et al. OneStory: Coherent multi-shot video generation with adaptive memory.CVPR, 2026

work page 2026
[15]

Text-to-3D by stitching a multi-view reconstruction network to a video generator

Hyojun Go, Dominik Narnhofer, Goutam Bhat, Prune Truong, Federico Tombari, and Konrad Schindler. Text-to-3D by stitching a multi-view reconstruction network to a video generator. InICLR, 2026

work page 2026
[16]

Splatflow: Multi-view rectified flow model for 3d gaussian splatting synthesis

Hyojun Go, Byeongjun Park, Jiho Jang, Jin-Young Kim, Soonwoo Kwon, and Changick Kim. Splatflow: Multi-view rectified flow model for 3d gaussian splatting synthesis. InCVPR, 2025

work page 2025
[17]

Videorfsplat: Direct scene-level text-to-3d gaussian splatting generation with flexible pose and multi- view joint modeling

Hyojun Go, Byeongjun Park, Hyelin Nam, Byung-Hoon Kim, Hyungjin Chung, and Changick Kim. Videorfsplat: Direct scene-level text-to-3d gaussian splatting generation with flexible pose and multi- view joint modeling. InICCV, 2025

work page 2025
[18]

Geneval: An object-focused framework for evaluating text-to-image alignment

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023

work page 2023
[19]

Aesthetic post-training diffusion models from generic preferences with step-by-step preference optimization

Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Mingxi Cheng, Ji Li, and Liang Zheng. Aesthetic post-training diffusion models from generic preferences with step-by-step preference optimization. InCVPR, 2025

work page 2025
[20]

Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint, 2023

work page 2023
[21]

Aligning text-to-image diffusion models with reward backpropagation.arXiv preprint, 2023

Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Katerina Fragkiadaki. Aligning text-to-image diffusion models with reward backpropagation.arXiv preprint, 2023

work page 2023
[22]

Kevin Clark, Paul Vicol, Kevin Swersky, and David J. Fleet. Directly fine-tuning diffusion models on differentiable rewards. InICLR, 2024

work page 2024
[23]

Aligning text-to-image models using human feedback

Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mo- hammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. arXiv preprint, 2023

work page 2023
[24]

RAFT: Reward ranked finetuning for generative foundation model alignment.Transactions on Machine Learning Research, 2023

Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment.Transactions on Machine Learning Research, 2023

work page 2023
[25]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InCVPR, 2024. 12 Stitched Value Model for Diffusion Alignment

work page 2024
[26]

Using human feedback to fine-tune diffusion models without any reward model

Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. InCVPR, 2024

work page 2024
[27]

Improving video generation with human feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di ZHANG, Kun Gai, Yujiu Yang, and Wanli Ouyang. Improving video generation with human feedback. InNeurIPS, 2026

work page 2026
[28]

Training diffusion models with reinforcement learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. InICLR, 2024

work page 2024
[29]

Reinforcement learning for fine-tuning text-to-image diffusion models

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Moham- mad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Reinforcement learning for fine-tuning text-to-image diffusion models. InNeurIPS, 2023

work page 2023
[30]

DiffusionNFT: Online diffusion reinforcement with forward process

Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. DiffusionNFT: Online diffusion reinforcement with forward process. InICLR, 2026

work page 2026
[31]

Diffusion posterior sampling for general noisy inverse problems

Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. InICLR, 2023

work page 2023
[32]

Loss-guided diffusion models for plug-and-play controllable generation

Jiaming Song, Qinsheng Zhang, Hongxu Yin, Morteza Mardani, Ming-Yu Liu, Jan Kautz, Yongxin Chen, and Arash Vahdat. Loss-guided diffusion models for plug-and-play controllable generation. InICML, 2023

work page 2023
[33]

TFG: Unified training-free guidance for diffusion models

Haotian Ye, Haowei Lin, Jiaqi Han, Minkai Xu, Sheng Liu, Yitao Liang, Jianzhu Ma, James Zou, and Stefano Ermon. TFG: Unified training-free guidance for diffusion models. InNeurIPS, 2024

work page 2024
[34]

FreeDoM: Training-free energy-guided conditional diffusion model

Jiwen Yu, Yinhuai Wang, Chen Zhao, Bernard Ghanem, and Jian Zhang. FreeDoM: Training-free energy-guided conditional diffusion model. InICCV, 2023

work page 2023
[35]

Manifold preserving guided diffusion

Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim, Wei-Hsiang Liao, Yuki Mitsufuji, J Zico Kolter, Ruslan Salakhutdinov, and Stefano Ermon. Manifold preserving guided diffusion. InICLR, 2024

work page 2024
[36]

FlowDPS: Flow-driven posterior sampling for inverse problems

Jeongsol Kim, Bryan Sangwoo Kim, and Jong Chul Ye. FlowDPS: Flow-driven posterior sampling for inverse problems. InICCV, 2025

work page 2025
[37]

Pseudoinverse-guided diffusion models for inverse problems

Jiaming Song, Arash Vahdat, Morteza Mardani, and Jan Kautz. Pseudoinverse-guided diffusion models for inverse problems. InICLR, 2023

work page 2023
[38]

A general framework for inference-time scaling and steering of diffusion models

Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, and Rajesh Ranganath. A general framework for inference-time scaling and steering of diffusion models. InICML, 2025

work page 2025
[39]

Inference-time scaling for flow models via stochastic generation and rollover budget forcing

Jaihoon Kim, TaeHoon Yoon, Jisung Hwang, and Minhyuk Sung. Inference-time scaling for flow models via stochastic generation and rollover budget forcing. InNeurIPS, 2026

work page 2026
[40]

Derivative-free guidance in continuous and discrete diffusion models with soft value-based decoding.arXiv preprint, 2024

Xiner Li, Yulai Zhao, Chenyu Wang, Gabriele Scalia, Gokcen Eraslan, Surag Nair, Tommaso Biancalani, Aviv Regev, Sergey Levine, and Masatoshi Uehara. Derivative-free guidance in continuous and discrete diffusion models with soft value-based decoding.arXiv preprint, 2024

work page 2024
[41]

Trippe, Christian A Naesseth, John Patrick Cunningham, and David Blei

Luhuan Wu, Brian L. Trippe, Christian A Naesseth, John Patrick Cunningham, and David Blei. Practical and asymptotically exact conditional sampling in diffusion models. InNeurIPS, 2023

work page 2023
[42]

Test-time alignment of diffusion models without reward over-optimization

Sunwoo Kim, Minkyu Kim, and Dongmin Park. Test-time alignment of diffusion models without reward over-optimization. InICLR, 2025

work page 2025
[43]

Dynamic search for inference-time alignment in diffusion models.arXiv preprint, 2025

Xiner Li, Masatoshi Uehara, Xingyu Su, Gabriele Scalia, Tommaso Biancalani, Aviv Regev, Sergey Levine, and Shuiwang Ji. Dynamic search for inference-time alignment in diffusion models.arXiv preprint, 2025. 13 Stitched Value Model for Diffusion Alignment

work page 2025
[44]

Inference-time scaling of diffusion models through classical search

XiangCheng Zhang, Haowei Lin, Haotian Ye, James Zou, Jianzhu Ma, Yitao Liang, and Yilun Du. Inference-time scaling of diffusion models through classical search. InNeurIPS 2025 Workshop on Structured Probabilistic Inference & Generative Modeling, 2025

work page 2025
[45]

Feynman-Kac correctors in diffusion: Annealing, guidance, and product of experts

Marta Skreta, Tara Akhound-Sadegh, Viktor Ohanesian, Roberto Bondesan, Alan Aspuru-Guzik, Arnaud Doucet, Rob Brekelmans, Alexander Tong, and Kirill Neklyudov. Feynman-Kac correctors in diffusion: Annealing, guidance, and product of experts. InICML, 2025

work page 2025
[46]

Inference-time alignment in diffusion models with reward-guided generation: Tutorial and review.arXiv preprint, 2025

Masatoshi Uehara, Yulai Zhao, Chenyu Wang, Xiner Li, Aviv Regev, Sergey Levine, and Tommaso Biancalani. Inference-time alignment in diffusion models with reward-guided generation: Tutorial and review.arXiv preprint, 2025

work page 2025
[47]

ImageReward: Learning and evaluating human preferences for text-to-image generation.arXiv preprint, 2023

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. ImageReward: Learning and evaluating human preferences for text-to-image generation.arXiv preprint, 2023

work page 2023
[48]

Unified reward model for multimodal understanding and generation.arXiv preprint, 2025

Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint, 2025

work page 2025
[49]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML, 2021

work page 2021
[50]

HPSv3: Towards wide-spectrum human preference score

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. HPSv3: Towards wide-spectrum human preference score. InICCV, 2025

work page 2025
[51]

Tweedie’s formula and selection bias.Journal of the American Statistical Association, 106 (496):1602–1614, 2011

Bradley Efron. Tweedie’s formula and selection bias.Journal of the American Statistical Association, 106 (496):1602–1614, 2011

work page 2011
[52]

Think twice before you act: Improving inverse problem solving with MCMC.arXiv preprint, 2024

Yaxuan Zhu, Zehao Dou, Haoxin Zheng, Yasi Zhang, Ying Nian Wu, and Ruiqi Gao. Think twice before you act: Improving inverse problem solving with MCMC.arXiv preprint, 2024

work page 2024
[53]

VARD: Efficient and dense fine-tuning for diffusion models with value-based RL.arXiv preprint, 2025

Fengyuan Dai, Zifeng Zhuang, Yufei Huang, Siteng Huang, Bangyan Liao, Donglin Wang, and Fajie Yuan. VARD: Efficient and dense fine-tuning for diffusion models with value-based RL.arXiv preprint, 2025

work page 2025
[54]

Beyond VLM-based rewards: Diffusion-native latent reward modeling.arXiv preprint, 2026

Gongye Liu, Bo Yang, Yida Zhi, Zhizhou Zhong, Lei Ke, Didan Deng, Han Gao, Yongxiang Huang, Kaihao Zhang, Hongbo Fu, et al. Beyond VLM-based rewards: Diffusion-native latent reward modeling.arXiv preprint, 2026

work page 2026
[55]

Video generation models are good latent reward models.arXiv preprint, 2025

Xiaoyue Mi, Wenqing Yu, Jiesong Lian, Shibo Jie, Ruizhe Zhong, Zijun Liu, Guozhen Zhang, Zixiang Zhou, Zhiyong Xu, Yuan Zhou, et al. Video generation models are good latent reward models.arXiv preprint, 2025

work page 2025
[56]

Critic-guided reinforcement unlearning in text-to-image diffusion.arXiv preprint, 2026

Mykola Vysotskyi, Zahar Kohut, Mariia Shpir, Taras Rumezhak, and Volodymyr Karpiv. Critic-guided reinforcement unlearning in text-to-image diffusion.arXiv preprint, 2026

work page 2026
[57]

Consistent noisy latent rewards for trajectory preference optimization in diffusion models

Xiaole Xian, Xilin He, Wenting Chen, Wenshuang Liu, wenqi mu, Yancheng He, Liang Li, Yi Zhang, and Xiangyu Yue. Consistent noisy latent rewards for trajectory preference optimization in diffusion models. InICLR, 2026

work page 2026
[58]

Diffusion model as a noise-aware latent reward model for step-level preference optimization

Tao Zhang, Cheng Da, Kun Ding, Huan Yang, kun jin, Yan Li, Tingting Gao, Di Zhang, Shiming Xiang, and Chunhong Pan. Diffusion model as a noise-aware latent reward model for step-level preference optimization. InNeurIPS, 2026

work page 2026
[59]

Confronting reward overoptimization for diffusion models: A perspective of inductive and primacy biases

Ziyi Zhang, Sen Zhang, Yibing Zhan, Yong Luo, Yonggang Wen, and Dacheng Tao. Confronting reward overoptimization for diffusion models: A perspective of inductive and primacy biases. InICML, 2024

work page 2024
[60]

LatSearch: Latent reward-guided search for faster inference-time scaling in video diffusion.arXiv preprint, 2026

Zengqun Zhao, Ziquan Liu, Yu Cao, Shaogang Gong, Zhensong Zhang, Jifei Song, Jiankang Deng, and Ioannis Patras. LatSearch: Latent reward-guided search for faster inference-time scaling in video diffusion.arXiv preprint, 2026. 14 Stitched Value Model for Diffusion Alignment

work page 2026
[61]

Understanding image representations by measuring their equivariance and equivalence

Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence. InCVPR, 2015

work page 2015
[62]

Similarity and matching of neural network representations

Adrián Csiszárik, Péter Kőrösi-Szabó, Akos Matszangosz, Gergely Papp, and Dániel Varga. Similarity and matching of neural network representations. InNeurIPS, 2021

work page 2021
[63]

Deep model reassembly

Xingyi Yang, Daquan Zhou, Songhua Liu, Jingwen Ye, and Xinchao Wang. Deep model reassembly. In NeurIPS, 2022

work page 2022
[64]

Revisiting model stitching to compare neural representations

Yamini Bansal, Preetum Nakkiran, and Boaz Barak. Revisiting model stitching to compare neural representations. InNeurIPS, 2021

work page 2021
[65]

Stitchable neural networks

Zizheng Pan, Jianfei Cai, and Bohan Zhuang. Stitchable neural networks. InCVPR, 2023

work page 2023
[66]

Decoupled meanflow: Turning flow models into flow maps for accelerated sampling.arXiv preprint, 2025

Kyungmin Lee, Sihyun Yu, and Jinwoo Shin. Decoupled meanflow: Turning flow models into flow maps for accelerated sampling.arXiv preprint, 2025

work page 2025
[67]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

work page 2024
[68]

Stable Diffusion 3.5

Stability AI. Stable Diffusion 3.5. https://github.com/Stability-AI/sd3.5, 2024. Official inference repository for Stable Diffusion 3.5 Large, Large Turbo, and Medium. Accessed: 2026-04-27

work page 2024
[69]

FLUX.1 [dev].https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. FLUX.1 [dev].https://github.com/black-forest-labs/flux, 2024. Official inference repository for FLUX.1 open-weight models. Accessed: 2026-04-27

work page 2024
[70]

Data filtering networks

AlexFang, AlbinMadappallyJose, AmitJain, LudwigSchmidt, AlexanderTToshev, andVaishaalShankar. Data filtering networks. InICLR, 2024

work page 2024
[71]

CLIP+MLP aesthetic score predictor

Christoph Schuhmann. CLIP+MLP aesthetic score predictor. https://github.com/ christophschuhmann/improved-aesthetic-predictor, 2022. Train, use, and visualize an aesthetic score predictor based on a neural network taking CLIP embeddings as input. Accessed: 2026-04-27

work page 2022
[72]

Universal guidance for diffusion models

Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Universal guidance for diffusion models. InCVPR, 2023

work page 2023
[73]

Training-free multi-objective diffusion model for 3d molecule generation

Xu Han, Caihua Shan, Yifei Shen, Can Xu, Han Yang, Xiang Li, and Dongsheng Li. Training-free multi-objective diffusion model for 3d molecule generation. InICLR, 2024

work page 2024
[74]

Deep reward supervisions for tuning text-to-image diffusion models

Xiaoshi Wu, Yiming Hao, Manyuan Zhang, Keqiang Sun, Zhaoyang Huang, Guanglu Song, Yu Liu, and Hongsheng Li. Deep reward supervisions for tuning text-to-image diffusion models. InECCV, 2024

work page 2024
[75]

Training diffusion models towards diverse image generation with reinforcement learning

Zichen Miao, Jiang Wang, Ze Wang, Zhengyuan Yang, Lijuan Wang, Qiang Qiu, and Zicheng Liu. Training diffusion models towards diverse image generation with reinforcement learning. InCVPR, 2024

work page 2024
[76]

Flow-GRPO: Training flow matching models via online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di ZHANG, and Wanli Ouyang. Flow-GRPO: Training flow matching models via online RL. InNeurIPS, 2026

work page 2026
[77]

DanceGRPO: Unleashing GRPO on visual generation.arXiv preprint, 2025

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. DanceGRPO: Unleashing GRPO on visual generation.arXiv preprint, 2025

work page 2025
[78]

Tiny inference- time scaling with latent verifiers.arXiv preprint, 2026

Davide Bucciarelli, Evelyn Turri, Lorenzo Baraldi, Marcella Cornia, and Rita Cucchiara. Tiny inference- time scaling with latent verifiers.arXiv preprint, 2026

work page 2026
[79]

Beyond the Noise: Aligning prompts with latent representations in diffusion models.arXiv preprint, 2025

Vasco Ramos, Regev Cohen, Idan Szpektor, and Joao Magalhaes. Beyond the Noise: Aligning prompts with latent representations in diffusion models.arXiv preprint, 2025. 15 Stitched Value Model for Diffusion Alignment

work page 2025
[80]

Similarity of neural network representations revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InICML, 2019

work page 2019

Showing first 80 references.

[1] [1]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020

work page 2020

[2] [2]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InICML, 2015

work page 2015

[3] [3]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InICLR, 2021

work page 2021

[4] [4]

Addressing negative transfer in diffusion models

Hyojun Go, Kim Kim, Yunsung Lee, Seunghyun Lee, Shinhyeok Oh, Hyeongdon Moon, and Seungtaek Choi. Addressing negative transfer in diffusion models. InNeurIPS, volume 36, pages 27199–27222, 2023

work page 2023

[5] [5]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023

work page 2023

[6] [6]

Building normalizing flows with stochastic inter- polants

Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic inter- polants. InICLR, 2023

work page 2023

[7] [7]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023

work page 2023

[8] [8]

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. FLUX. 1 Kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint, 2025. 11 Stitched Value Model for Diffusion Alignment

work page 2025

[9] [9]

Photorealistic text-to- image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to- image diffusion models with deep language understanding. InNeurIPS, volume 35, 2022

work page 2022

[10] [10]

Qwen-image technical report.arXiv preprint, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint, 2025

work page 2025

[11] [11]

Wan: Open and advanced large-scale video generative models.arXiv preprint, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint, 2025

work page 2025

[12] [12]

Video models are zero-shot learners and reasoners.arXiv preprint, 2025

Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners.arXiv preprint, 2025

work page 2025

[13] [13]

Video understanding: From geometry and semantics to unified models

Zhaochong An, Zirui Li, Mingqiao Ye, Feng Qiao, Jiaang Li, Zongwei Wu, Vishal Thengane, Chengzu Li, Lei Li, Luc Van Gool, et al. Video understanding: From geometry and semantics to unified models. Machine Intelligence Research, 2026

work page 2026

[14] [14]

OneStory: Coherent multi-shot video generation with adaptive memory.CVPR, 2026

Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, et al. OneStory: Coherent multi-shot video generation with adaptive memory.CVPR, 2026

work page 2026

[15] [15]

Text-to-3D by stitching a multi-view reconstruction network to a video generator

Hyojun Go, Dominik Narnhofer, Goutam Bhat, Prune Truong, Federico Tombari, and Konrad Schindler. Text-to-3D by stitching a multi-view reconstruction network to a video generator. InICLR, 2026

work page 2026

[16] [16]

Splatflow: Multi-view rectified flow model for 3d gaussian splatting synthesis

Hyojun Go, Byeongjun Park, Jiho Jang, Jin-Young Kim, Soonwoo Kwon, and Changick Kim. Splatflow: Multi-view rectified flow model for 3d gaussian splatting synthesis. InCVPR, 2025

work page 2025

[17] [17]

Videorfsplat: Direct scene-level text-to-3d gaussian splatting generation with flexible pose and multi- view joint modeling

Hyojun Go, Byeongjun Park, Hyelin Nam, Byung-Hoon Kim, Hyungjin Chung, and Changick Kim. Videorfsplat: Direct scene-level text-to-3d gaussian splatting generation with flexible pose and multi- view joint modeling. InICCV, 2025

work page 2025

[18] [18]

Geneval: An object-focused framework for evaluating text-to-image alignment

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023

work page 2023

[19] [19]

Aesthetic post-training diffusion models from generic preferences with step-by-step preference optimization

Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Mingxi Cheng, Ji Li, and Liang Zheng. Aesthetic post-training diffusion models from generic preferences with step-by-step preference optimization. InCVPR, 2025

work page 2025

[20] [20]

Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint, 2023

work page 2023

[21] [21]

Aligning text-to-image diffusion models with reward backpropagation.arXiv preprint, 2023

Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Katerina Fragkiadaki. Aligning text-to-image diffusion models with reward backpropagation.arXiv preprint, 2023

work page 2023

[22] [22]

Kevin Clark, Paul Vicol, Kevin Swersky, and David J. Fleet. Directly fine-tuning diffusion models on differentiable rewards. InICLR, 2024

work page 2024

[23] [23]

Aligning text-to-image models using human feedback

Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mo- hammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. arXiv preprint, 2023

work page 2023

[24] [24]

RAFT: Reward ranked finetuning for generative foundation model alignment.Transactions on Machine Learning Research, 2023

Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment.Transactions on Machine Learning Research, 2023

work page 2023

[25] [25]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InCVPR, 2024. 12 Stitched Value Model for Diffusion Alignment

work page 2024

[26] [26]

Using human feedback to fine-tune diffusion models without any reward model

Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. InCVPR, 2024

work page 2024

[27] [27]

Improving video generation with human feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di ZHANG, Kun Gai, Yujiu Yang, and Wanli Ouyang. Improving video generation with human feedback. InNeurIPS, 2026

work page 2026

[28] [28]

Training diffusion models with reinforcement learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. InICLR, 2024

work page 2024

[29] [29]

Reinforcement learning for fine-tuning text-to-image diffusion models

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Moham- mad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Reinforcement learning for fine-tuning text-to-image diffusion models. InNeurIPS, 2023

work page 2023

[30] [30]

DiffusionNFT: Online diffusion reinforcement with forward process

Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. DiffusionNFT: Online diffusion reinforcement with forward process. InICLR, 2026

work page 2026

[31] [31]

Diffusion posterior sampling for general noisy inverse problems

Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. InICLR, 2023

work page 2023

[32] [32]

Loss-guided diffusion models for plug-and-play controllable generation

Jiaming Song, Qinsheng Zhang, Hongxu Yin, Morteza Mardani, Ming-Yu Liu, Jan Kautz, Yongxin Chen, and Arash Vahdat. Loss-guided diffusion models for plug-and-play controllable generation. InICML, 2023

work page 2023

[33] [33]

TFG: Unified training-free guidance for diffusion models

Haotian Ye, Haowei Lin, Jiaqi Han, Minkai Xu, Sheng Liu, Yitao Liang, Jianzhu Ma, James Zou, and Stefano Ermon. TFG: Unified training-free guidance for diffusion models. InNeurIPS, 2024

work page 2024

[34] [34]

FreeDoM: Training-free energy-guided conditional diffusion model

Jiwen Yu, Yinhuai Wang, Chen Zhao, Bernard Ghanem, and Jian Zhang. FreeDoM: Training-free energy-guided conditional diffusion model. InICCV, 2023

work page 2023

[35] [35]

Manifold preserving guided diffusion

Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim, Wei-Hsiang Liao, Yuki Mitsufuji, J Zico Kolter, Ruslan Salakhutdinov, and Stefano Ermon. Manifold preserving guided diffusion. InICLR, 2024

work page 2024

[36] [36]

FlowDPS: Flow-driven posterior sampling for inverse problems

Jeongsol Kim, Bryan Sangwoo Kim, and Jong Chul Ye. FlowDPS: Flow-driven posterior sampling for inverse problems. InICCV, 2025

work page 2025

[37] [37]

Pseudoinverse-guided diffusion models for inverse problems

Jiaming Song, Arash Vahdat, Morteza Mardani, and Jan Kautz. Pseudoinverse-guided diffusion models for inverse problems. InICLR, 2023

work page 2023

[38] [38]

A general framework for inference-time scaling and steering of diffusion models

Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, and Rajesh Ranganath. A general framework for inference-time scaling and steering of diffusion models. InICML, 2025

work page 2025

[39] [39]

Inference-time scaling for flow models via stochastic generation and rollover budget forcing

Jaihoon Kim, TaeHoon Yoon, Jisung Hwang, and Minhyuk Sung. Inference-time scaling for flow models via stochastic generation and rollover budget forcing. InNeurIPS, 2026

work page 2026

[40] [40]

Derivative-free guidance in continuous and discrete diffusion models with soft value-based decoding.arXiv preprint, 2024

Xiner Li, Yulai Zhao, Chenyu Wang, Gabriele Scalia, Gokcen Eraslan, Surag Nair, Tommaso Biancalani, Aviv Regev, Sergey Levine, and Masatoshi Uehara. Derivative-free guidance in continuous and discrete diffusion models with soft value-based decoding.arXiv preprint, 2024

work page 2024

[41] [41]

Trippe, Christian A Naesseth, John Patrick Cunningham, and David Blei

Luhuan Wu, Brian L. Trippe, Christian A Naesseth, John Patrick Cunningham, and David Blei. Practical and asymptotically exact conditional sampling in diffusion models. InNeurIPS, 2023

work page 2023

[42] [42]

Test-time alignment of diffusion models without reward over-optimization

Sunwoo Kim, Minkyu Kim, and Dongmin Park. Test-time alignment of diffusion models without reward over-optimization. InICLR, 2025

work page 2025

[43] [43]

Dynamic search for inference-time alignment in diffusion models.arXiv preprint, 2025

Xiner Li, Masatoshi Uehara, Xingyu Su, Gabriele Scalia, Tommaso Biancalani, Aviv Regev, Sergey Levine, and Shuiwang Ji. Dynamic search for inference-time alignment in diffusion models.arXiv preprint, 2025. 13 Stitched Value Model for Diffusion Alignment

work page 2025

[44] [44]

Inference-time scaling of diffusion models through classical search

XiangCheng Zhang, Haowei Lin, Haotian Ye, James Zou, Jianzhu Ma, Yitao Liang, and Yilun Du. Inference-time scaling of diffusion models through classical search. InNeurIPS 2025 Workshop on Structured Probabilistic Inference & Generative Modeling, 2025

work page 2025

[45] [45]

Feynman-Kac correctors in diffusion: Annealing, guidance, and product of experts

Marta Skreta, Tara Akhound-Sadegh, Viktor Ohanesian, Roberto Bondesan, Alan Aspuru-Guzik, Arnaud Doucet, Rob Brekelmans, Alexander Tong, and Kirill Neklyudov. Feynman-Kac correctors in diffusion: Annealing, guidance, and product of experts. InICML, 2025

work page 2025

[46] [46]

Inference-time alignment in diffusion models with reward-guided generation: Tutorial and review.arXiv preprint, 2025

Masatoshi Uehara, Yulai Zhao, Chenyu Wang, Xiner Li, Aviv Regev, Sergey Levine, and Tommaso Biancalani. Inference-time alignment in diffusion models with reward-guided generation: Tutorial and review.arXiv preprint, 2025

work page 2025

[47] [47]

ImageReward: Learning and evaluating human preferences for text-to-image generation.arXiv preprint, 2023

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. ImageReward: Learning and evaluating human preferences for text-to-image generation.arXiv preprint, 2023

work page 2023

[48] [48]

Unified reward model for multimodal understanding and generation.arXiv preprint, 2025

Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint, 2025

work page 2025

[49] [49]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML, 2021

work page 2021

[50] [50]

HPSv3: Towards wide-spectrum human preference score

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. HPSv3: Towards wide-spectrum human preference score. InICCV, 2025

work page 2025

[51] [51]

Tweedie’s formula and selection bias.Journal of the American Statistical Association, 106 (496):1602–1614, 2011

Bradley Efron. Tweedie’s formula and selection bias.Journal of the American Statistical Association, 106 (496):1602–1614, 2011

work page 2011

[52] [52]

Think twice before you act: Improving inverse problem solving with MCMC.arXiv preprint, 2024

Yaxuan Zhu, Zehao Dou, Haoxin Zheng, Yasi Zhang, Ying Nian Wu, and Ruiqi Gao. Think twice before you act: Improving inverse problem solving with MCMC.arXiv preprint, 2024

work page 2024

[53] [53]

VARD: Efficient and dense fine-tuning for diffusion models with value-based RL.arXiv preprint, 2025

Fengyuan Dai, Zifeng Zhuang, Yufei Huang, Siteng Huang, Bangyan Liao, Donglin Wang, and Fajie Yuan. VARD: Efficient and dense fine-tuning for diffusion models with value-based RL.arXiv preprint, 2025

work page 2025

[54] [54]

Beyond VLM-based rewards: Diffusion-native latent reward modeling.arXiv preprint, 2026

Gongye Liu, Bo Yang, Yida Zhi, Zhizhou Zhong, Lei Ke, Didan Deng, Han Gao, Yongxiang Huang, Kaihao Zhang, Hongbo Fu, et al. Beyond VLM-based rewards: Diffusion-native latent reward modeling.arXiv preprint, 2026

work page 2026

[55] [55]

Video generation models are good latent reward models.arXiv preprint, 2025

Xiaoyue Mi, Wenqing Yu, Jiesong Lian, Shibo Jie, Ruizhe Zhong, Zijun Liu, Guozhen Zhang, Zixiang Zhou, Zhiyong Xu, Yuan Zhou, et al. Video generation models are good latent reward models.arXiv preprint, 2025

work page 2025

[56] [56]

Critic-guided reinforcement unlearning in text-to-image diffusion.arXiv preprint, 2026

Mykola Vysotskyi, Zahar Kohut, Mariia Shpir, Taras Rumezhak, and Volodymyr Karpiv. Critic-guided reinforcement unlearning in text-to-image diffusion.arXiv preprint, 2026

work page 2026

[57] [57]

Consistent noisy latent rewards for trajectory preference optimization in diffusion models

Xiaole Xian, Xilin He, Wenting Chen, Wenshuang Liu, wenqi mu, Yancheng He, Liang Li, Yi Zhang, and Xiangyu Yue. Consistent noisy latent rewards for trajectory preference optimization in diffusion models. InICLR, 2026

work page 2026

[58] [58]

Diffusion model as a noise-aware latent reward model for step-level preference optimization

Tao Zhang, Cheng Da, Kun Ding, Huan Yang, kun jin, Yan Li, Tingting Gao, Di Zhang, Shiming Xiang, and Chunhong Pan. Diffusion model as a noise-aware latent reward model for step-level preference optimization. InNeurIPS, 2026

work page 2026

[59] [59]

Confronting reward overoptimization for diffusion models: A perspective of inductive and primacy biases

Ziyi Zhang, Sen Zhang, Yibing Zhan, Yong Luo, Yonggang Wen, and Dacheng Tao. Confronting reward overoptimization for diffusion models: A perspective of inductive and primacy biases. InICML, 2024

work page 2024

[60] [60]

LatSearch: Latent reward-guided search for faster inference-time scaling in video diffusion.arXiv preprint, 2026

Zengqun Zhao, Ziquan Liu, Yu Cao, Shaogang Gong, Zhensong Zhang, Jifei Song, Jiankang Deng, and Ioannis Patras. LatSearch: Latent reward-guided search for faster inference-time scaling in video diffusion.arXiv preprint, 2026. 14 Stitched Value Model for Diffusion Alignment

work page 2026

[61] [61]

Understanding image representations by measuring their equivariance and equivalence

Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence. InCVPR, 2015

work page 2015

[62] [62]

Similarity and matching of neural network representations

Adrián Csiszárik, Péter Kőrösi-Szabó, Akos Matszangosz, Gergely Papp, and Dániel Varga. Similarity and matching of neural network representations. InNeurIPS, 2021

work page 2021

[63] [63]

Deep model reassembly

Xingyi Yang, Daquan Zhou, Songhua Liu, Jingwen Ye, and Xinchao Wang. Deep model reassembly. In NeurIPS, 2022

work page 2022

[64] [64]

Revisiting model stitching to compare neural representations

Yamini Bansal, Preetum Nakkiran, and Boaz Barak. Revisiting model stitching to compare neural representations. InNeurIPS, 2021

work page 2021

[65] [65]

Stitchable neural networks

Zizheng Pan, Jianfei Cai, and Bohan Zhuang. Stitchable neural networks. InCVPR, 2023

work page 2023

[66] [66]

Decoupled meanflow: Turning flow models into flow maps for accelerated sampling.arXiv preprint, 2025

Kyungmin Lee, Sihyun Yu, and Jinwoo Shin. Decoupled meanflow: Turning flow models into flow maps for accelerated sampling.arXiv preprint, 2025

work page 2025

[67] [67]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

work page 2024

[68] [68]

Stable Diffusion 3.5

Stability AI. Stable Diffusion 3.5. https://github.com/Stability-AI/sd3.5, 2024. Official inference repository for Stable Diffusion 3.5 Large, Large Turbo, and Medium. Accessed: 2026-04-27

work page 2024

[69] [69]

FLUX.1 [dev].https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. FLUX.1 [dev].https://github.com/black-forest-labs/flux, 2024. Official inference repository for FLUX.1 open-weight models. Accessed: 2026-04-27

work page 2024

[70] [70]

Data filtering networks

AlexFang, AlbinMadappallyJose, AmitJain, LudwigSchmidt, AlexanderTToshev, andVaishaalShankar. Data filtering networks. InICLR, 2024

work page 2024

[71] [71]

CLIP+MLP aesthetic score predictor

Christoph Schuhmann. CLIP+MLP aesthetic score predictor. https://github.com/ christophschuhmann/improved-aesthetic-predictor, 2022. Train, use, and visualize an aesthetic score predictor based on a neural network taking CLIP embeddings as input. Accessed: 2026-04-27

work page 2022

[72] [72]

Universal guidance for diffusion models

Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Universal guidance for diffusion models. InCVPR, 2023

work page 2023

[73] [73]

Training-free multi-objective diffusion model for 3d molecule generation

Xu Han, Caihua Shan, Yifei Shen, Can Xu, Han Yang, Xiang Li, and Dongsheng Li. Training-free multi-objective diffusion model for 3d molecule generation. InICLR, 2024

work page 2024

[74] [74]

Deep reward supervisions for tuning text-to-image diffusion models

Xiaoshi Wu, Yiming Hao, Manyuan Zhang, Keqiang Sun, Zhaoyang Huang, Guanglu Song, Yu Liu, and Hongsheng Li. Deep reward supervisions for tuning text-to-image diffusion models. InECCV, 2024

work page 2024

[75] [75]

Training diffusion models towards diverse image generation with reinforcement learning

Zichen Miao, Jiang Wang, Ze Wang, Zhengyuan Yang, Lijuan Wang, Qiang Qiu, and Zicheng Liu. Training diffusion models towards diverse image generation with reinforcement learning. InCVPR, 2024

work page 2024

[76] [76]

Flow-GRPO: Training flow matching models via online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di ZHANG, and Wanli Ouyang. Flow-GRPO: Training flow matching models via online RL. InNeurIPS, 2026

work page 2026

[77] [77]

DanceGRPO: Unleashing GRPO on visual generation.arXiv preprint, 2025

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. DanceGRPO: Unleashing GRPO on visual generation.arXiv preprint, 2025

work page 2025

[78] [78]

Tiny inference- time scaling with latent verifiers.arXiv preprint, 2026

Davide Bucciarelli, Evelyn Turri, Lorenzo Baraldi, Marcella Cornia, and Rita Cucchiara. Tiny inference- time scaling with latent verifiers.arXiv preprint, 2026

work page 2026

[79] [79]

Beyond the Noise: Aligning prompts with latent representations in diffusion models.arXiv preprint, 2025

Vasco Ramos, Regev Cohen, Idan Szpektor, and Joao Magalhaes. Beyond the Noise: Aligning prompts with latent representations in diffusion models.arXiv preprint, 2025. 15 Stitched Value Model for Diffusion Alignment

work page 2025

[80] [80]

Similarity of neural network representations revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InICML, 2019

work page 2019