Stitched Value Model for Diffusion Alignment
Pith reviewed 2026-05-20 06:25 UTC · model grok-4.3
The pith
StitchVM stitches pixel-space reward models with frozen diffusion backbones to estimate values directly at noisy latents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
StitchVM starts from an existing truncated pixel-space reward model and attaches a frozen diffusion backbone as its head. The hybrid keeps the pretrained reward capability from the pixel model and gains the backbone's native handling of noisy latents. Stitching and finetuning are lightweight, taking only 10 GPU-hours for CLIP ViT-L with SD 3.5 Medium. The approach lets the correct value function for actual noisy latents be built once and then reused over many samples and iterations instead of relying on rough per-sample approximations.
What carries the argument
The stitched hybrid model that combines a truncated pixel-space reward model with a frozen diffusion backbone as its head.
If this is right
- DPS alignment runs 3.2 times faster while halving peak GPU memory.
- DiffusionNFT runs 2.3 times faster.
- Value functions are constructed once and then amortized across many samples and alignment iterations.
- The same stitching method yields improvements across a broad range of downstream steering and post-training techniques.
Where Pith is reading between the lines
- The same stitching pattern could be applied to other latent generative models that need reward signals at intermediate noise levels.
- Multiple pretrained reward models could be stitched in parallel to create composite value functions for combined objectives.
- The reduced per-sample cost opens the possibility of running alignment loops inside interactive or real-time generation pipelines.
Load-bearing premise
The stitching procedure preserves the original reward capability while the frozen backbone supplies accurate value estimates specifically at noisy latents without any further training of the backbone.
What would settle it
Experiments that compare the stitched model's value estimates at intermediate noisy latents against high-fidelity Monte Carlo rollouts and find large systematic discrepancies, or that show no speedup when the model is plugged into DPS or DiffusionNFT.
read the original abstract
For practical use, diffusion- or flow-based generative models must be aligned with task-specific rewards, such as prompt fidelity or aesthetic preference. That alignment is challenging because the reward is defined for clean output images, but the alignment procedure requires value function estimates at noisy intermediate latents. Existing methods resort to Tweedie-style or Monte Carlo approximations, trading off estimator bias against computational cost: Tweedie estimates are efficient but biased, while Monte Carlo estimates are more accurate but require expensive rollouts. A natural alternative would be a learned value function, but it remains an open question how to effectively train a strong and general value model specifically for noisy latents. Here, we propose StitchVM, a model stitching framework that efficiently transfers reward models pretrained for clean images to the noisy latent regime. StitchVM starts from an existing, truncated pixel-space reward model and attaches a frozen diffusion backbone to it as its head. From the pixel-space model, the resulting hybrid retains a carefully pretrained, robust reward capability; from the diffusion backbone, it inherits its native ability to handle noisy latents. The stitching procedure is exceptionally lightweight, e.g., stitching and finetuning CLIP ViT-L and SD 3.5 Medium takes only 10 GPU-hours. By lifting powerful pixel-space reward models to latent space, StitchVM opens up a new style of diffusion alignment: instead of rough, yet costly per-sample approximation of the value function, the correct function for the actual, noisy latents is constructed once and then amortized over many samples and iterations. We show that this approach yields improvements across a broad range of downstream steering and post-training methods: DPS becomes $3.2\times$ faster while halving peak GPU memory, and DiffusionNFT becomes $2.3\times$ faster.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes StitchVM, a model-stitching framework that attaches a frozen diffusion backbone to a truncated pixel-space reward model to produce a value function for noisy latents. The central claim is that this hybrid retains robust reward capability from the pixel-space model while inheriting native handling of noisy states from the backbone, thereby constructing the correct value function once and amortizing it over alignment iterations. Downstream results report 3.2× speedup and halved memory for DPS, 2.3× speedup for DiffusionNFT, and a total stitching/finetuning cost of 10 GPU-hours.
Significance. If the hybrid indeed supplies accurate conditional expectations of clean-image rewards given noisy latents, the method would replace expensive per-sample approximations with a reusable, lightweight model and materially lower the barrier to reward-based diffusion alignment. The reported efficiency gains and broad applicability to both steering and post-training pipelines constitute a practical contribution.
major comments (2)
- [Experiments] The manuscript provides no quantitative validation (MSE, rank correlation, or similar) of StitchVM value estimates against high-sample Monte Carlo rollouts at fixed intermediate timesteps t. Without this check, the claim that the hybrid computes the conditional expectation rather than a Tweedie-style point estimate remains unsubstantiated.
- [§5] §5 (Downstream Alignment Experiments): the reported 3.2× and 2.3× speedups are presented without ablations, statistical significance tests, or controls for confounding factors such as implementation details or hyperparameter tuning, weakening support for the efficiency claims.
minor comments (1)
- [Abstract] The abstract states that the stitching procedure is 'exceptionally lightweight' but does not specify the exact layers frozen, the loss used for the 10 GPU-hour finetuning stage, or the truncation point chosen for the pixel-space reward model.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and have incorporated revisions to strengthen the empirical support for our claims.
read point-by-point responses
-
Referee: [Experiments] The manuscript provides no quantitative validation (MSE, rank correlation, or similar) of StitchVM value estimates against high-sample Monte Carlo rollouts at fixed intermediate timesteps t. Without this check, the claim that the hybrid computes the conditional expectation rather than a Tweedie-style point estimate remains unsubstantiated.
Authors: We agree that a direct quantitative comparison to high-sample Monte Carlo rollouts at fixed timesteps would provide stronger substantiation for the claim that StitchVM approximates the conditional expectation. The original submission emphasized downstream utility because Monte Carlo rollouts at intermediate noise levels are precisely the expensive computation we aim to amortize. In the revision we will add a targeted validation subsection reporting MSE and Spearman rank correlation between StitchVM outputs and 100-sample Monte Carlo estimates at several fixed t values (e.g., t=200, 400, 600) on a held-out prompt set, while keeping the added compute modest. revision: yes
-
Referee: [§5] §5 (Downstream Alignment Experiments): the reported 3.2× and 2.3× speedups are presented without ablations, statistical significance tests, or controls for confounding factors such as implementation details or hyperparameter tuning, weakening support for the efficiency claims.
Authors: The speedups were measured by executing the identical alignment pipelines (same random seeds, batch sizes, hardware, and hyper-parameters) with and without StitchVM. To strengthen the presentation we will expand §5 with (i) ablations across two additional reward models, (ii) mean and standard deviation of wall-clock time over five independent runs, and (iii) paired t-test p-values for the timing differences. We will also add a short paragraph detailing the exact implementation stack and hyper-parameter settings used for both baselines and StitchVM variants. revision: yes
Circularity Check
StitchVM constructs hybrid value model via stitching of pretrained components; no circular reduction to inputs
full rationale
The paper proposes StitchVM as a stitching framework that attaches a frozen diffusion backbone to a truncated pixel-space reward model, followed by lightweight finetuning (e.g., 10 GPU-hours for CLIP ViT-L + SD 3.5). This transfers existing reward capability to noisy latents without deriving the value function from fitted parameters, self-referential equations, or load-bearing self-citations. The central claim—that the hybrid yields the correct value function for noisy latents—is presented as an engineering construction amortized over samples, not a mathematical identity or prediction forced by prior fits. No steps reduce by construction to the paper's own inputs; downstream speedups (DPS 3.2x, DiffusionNFT 2.3x) are empirical outcomes rather than tautological results. The derivation remains self-contained against external pretrained models.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A truncated pixel-space reward model retains its core reward judgment capability when a frozen diffusion backbone is attached as its head.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
StitchVM starts from an existing, truncated pixel-space reward model and attaches a frozen diffusion backbone to it as its head... V(i*,j*)_ω(z_t) = r≥j_ϕ(s_ψ(u≤i_θ(z_t)))
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lvalue(ω) = E[ |V(i*,j*)_ω(z_t) - r_ϕ(z0)|^2 ]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020
work page 2020
-
[2]
Deep unsupervised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InICML, 2015
work page 2015
-
[3]
Score-based generative modeling through stochastic differential equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InICLR, 2021
work page 2021
-
[4]
Addressing negative transfer in diffusion models
Hyojun Go, Kim Kim, Yunsung Lee, Seunghyun Lee, Shinhyeok Oh, Hyeongdon Moon, and Seungtaek Choi. Addressing negative transfer in diffusion models. InNeurIPS, volume 36, pages 27199–27222, 2023
work page 2023
-
[5]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023
work page 2023
-
[6]
Building normalizing flows with stochastic inter- polants
Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic inter- polants. InICLR, 2023
work page 2023
-
[7]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023
work page 2023
-
[8]
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. FLUX. 1 Kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint, 2025. 11 Stitched Value Model for Diffusion Alignment
work page 2025
-
[9]
Photorealistic text-to- image diffusion models with deep language understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to- image diffusion models with deep language understanding. InNeurIPS, volume 35, 2022
work page 2022
-
[10]
Qwen-image technical report.arXiv preprint, 2025
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint, 2025
work page 2025
-
[11]
Wan: Open and advanced large-scale video generative models.arXiv preprint, 2025
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint, 2025
work page 2025
-
[12]
Video models are zero-shot learners and reasoners.arXiv preprint, 2025
Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners.arXiv preprint, 2025
work page 2025
-
[13]
Video understanding: From geometry and semantics to unified models
Zhaochong An, Zirui Li, Mingqiao Ye, Feng Qiao, Jiaang Li, Zongwei Wu, Vishal Thengane, Chengzu Li, Lei Li, Luc Van Gool, et al. Video understanding: From geometry and semantics to unified models. Machine Intelligence Research, 2026
work page 2026
-
[14]
OneStory: Coherent multi-shot video generation with adaptive memory.CVPR, 2026
Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, et al. OneStory: Coherent multi-shot video generation with adaptive memory.CVPR, 2026
work page 2026
-
[15]
Text-to-3D by stitching a multi-view reconstruction network to a video generator
Hyojun Go, Dominik Narnhofer, Goutam Bhat, Prune Truong, Federico Tombari, and Konrad Schindler. Text-to-3D by stitching a multi-view reconstruction network to a video generator. InICLR, 2026
work page 2026
-
[16]
Splatflow: Multi-view rectified flow model for 3d gaussian splatting synthesis
Hyojun Go, Byeongjun Park, Jiho Jang, Jin-Young Kim, Soonwoo Kwon, and Changick Kim. Splatflow: Multi-view rectified flow model for 3d gaussian splatting synthesis. InCVPR, 2025
work page 2025
-
[17]
Hyojun Go, Byeongjun Park, Hyelin Nam, Byung-Hoon Kim, Hyungjin Chung, and Changick Kim. Videorfsplat: Direct scene-level text-to-3d gaussian splatting generation with flexible pose and multi- view joint modeling. InICCV, 2025
work page 2025
-
[18]
Geneval: An object-focused framework for evaluating text-to-image alignment
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023
work page 2023
-
[19]
Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Mingxi Cheng, Ji Li, and Liang Zheng. Aesthetic post-training diffusion models from generic preferences with step-by-step preference optimization. InCVPR, 2025
work page 2025
-
[20]
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint, 2023
work page 2023
-
[21]
Aligning text-to-image diffusion models with reward backpropagation.arXiv preprint, 2023
Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Katerina Fragkiadaki. Aligning text-to-image diffusion models with reward backpropagation.arXiv preprint, 2023
work page 2023
-
[22]
Kevin Clark, Paul Vicol, Kevin Swersky, and David J. Fleet. Directly fine-tuning diffusion models on differentiable rewards. InICLR, 2024
work page 2024
-
[23]
Aligning text-to-image models using human feedback
Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mo- hammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. arXiv preprint, 2023
work page 2023
-
[24]
Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment.Transactions on Machine Learning Research, 2023
work page 2023
-
[25]
Diffusion model alignment using direct preference optimization
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InCVPR, 2024. 12 Stitched Value Model for Diffusion Alignment
work page 2024
-
[26]
Using human feedback to fine-tune diffusion models without any reward model
Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. InCVPR, 2024
work page 2024
-
[27]
Improving video generation with human feedback
Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di ZHANG, Kun Gai, Yujiu Yang, and Wanli Ouyang. Improving video generation with human feedback. InNeurIPS, 2026
work page 2026
-
[28]
Training diffusion models with reinforcement learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. InICLR, 2024
work page 2024
-
[29]
Reinforcement learning for fine-tuning text-to-image diffusion models
Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Moham- mad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Reinforcement learning for fine-tuning text-to-image diffusion models. InNeurIPS, 2023
work page 2023
-
[30]
DiffusionNFT: Online diffusion reinforcement with forward process
Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. DiffusionNFT: Online diffusion reinforcement with forward process. InICLR, 2026
work page 2026
-
[31]
Diffusion posterior sampling for general noisy inverse problems
Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. InICLR, 2023
work page 2023
-
[32]
Loss-guided diffusion models for plug-and-play controllable generation
Jiaming Song, Qinsheng Zhang, Hongxu Yin, Morteza Mardani, Ming-Yu Liu, Jan Kautz, Yongxin Chen, and Arash Vahdat. Loss-guided diffusion models for plug-and-play controllable generation. InICML, 2023
work page 2023
-
[33]
TFG: Unified training-free guidance for diffusion models
Haotian Ye, Haowei Lin, Jiaqi Han, Minkai Xu, Sheng Liu, Yitao Liang, Jianzhu Ma, James Zou, and Stefano Ermon. TFG: Unified training-free guidance for diffusion models. InNeurIPS, 2024
work page 2024
-
[34]
FreeDoM: Training-free energy-guided conditional diffusion model
Jiwen Yu, Yinhuai Wang, Chen Zhao, Bernard Ghanem, and Jian Zhang. FreeDoM: Training-free energy-guided conditional diffusion model. InICCV, 2023
work page 2023
-
[35]
Manifold preserving guided diffusion
Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim, Wei-Hsiang Liao, Yuki Mitsufuji, J Zico Kolter, Ruslan Salakhutdinov, and Stefano Ermon. Manifold preserving guided diffusion. InICLR, 2024
work page 2024
-
[36]
FlowDPS: Flow-driven posterior sampling for inverse problems
Jeongsol Kim, Bryan Sangwoo Kim, and Jong Chul Ye. FlowDPS: Flow-driven posterior sampling for inverse problems. InICCV, 2025
work page 2025
-
[37]
Pseudoinverse-guided diffusion models for inverse problems
Jiaming Song, Arash Vahdat, Morteza Mardani, and Jan Kautz. Pseudoinverse-guided diffusion models for inverse problems. InICLR, 2023
work page 2023
-
[38]
A general framework for inference-time scaling and steering of diffusion models
Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, and Rajesh Ranganath. A general framework for inference-time scaling and steering of diffusion models. InICML, 2025
work page 2025
-
[39]
Inference-time scaling for flow models via stochastic generation and rollover budget forcing
Jaihoon Kim, TaeHoon Yoon, Jisung Hwang, and Minhyuk Sung. Inference-time scaling for flow models via stochastic generation and rollover budget forcing. InNeurIPS, 2026
work page 2026
-
[40]
Xiner Li, Yulai Zhao, Chenyu Wang, Gabriele Scalia, Gokcen Eraslan, Surag Nair, Tommaso Biancalani, Aviv Regev, Sergey Levine, and Masatoshi Uehara. Derivative-free guidance in continuous and discrete diffusion models with soft value-based decoding.arXiv preprint, 2024
work page 2024
-
[41]
Trippe, Christian A Naesseth, John Patrick Cunningham, and David Blei
Luhuan Wu, Brian L. Trippe, Christian A Naesseth, John Patrick Cunningham, and David Blei. Practical and asymptotically exact conditional sampling in diffusion models. InNeurIPS, 2023
work page 2023
-
[42]
Test-time alignment of diffusion models without reward over-optimization
Sunwoo Kim, Minkyu Kim, and Dongmin Park. Test-time alignment of diffusion models without reward over-optimization. InICLR, 2025
work page 2025
-
[43]
Dynamic search for inference-time alignment in diffusion models.arXiv preprint, 2025
Xiner Li, Masatoshi Uehara, Xingyu Su, Gabriele Scalia, Tommaso Biancalani, Aviv Regev, Sergey Levine, and Shuiwang Ji. Dynamic search for inference-time alignment in diffusion models.arXiv preprint, 2025. 13 Stitched Value Model for Diffusion Alignment
work page 2025
-
[44]
Inference-time scaling of diffusion models through classical search
XiangCheng Zhang, Haowei Lin, Haotian Ye, James Zou, Jianzhu Ma, Yitao Liang, and Yilun Du. Inference-time scaling of diffusion models through classical search. InNeurIPS 2025 Workshop on Structured Probabilistic Inference & Generative Modeling, 2025
work page 2025
-
[45]
Feynman-Kac correctors in diffusion: Annealing, guidance, and product of experts
Marta Skreta, Tara Akhound-Sadegh, Viktor Ohanesian, Roberto Bondesan, Alan Aspuru-Guzik, Arnaud Doucet, Rob Brekelmans, Alexander Tong, and Kirill Neklyudov. Feynman-Kac correctors in diffusion: Annealing, guidance, and product of experts. InICML, 2025
work page 2025
-
[46]
Masatoshi Uehara, Yulai Zhao, Chenyu Wang, Xiner Li, Aviv Regev, Sergey Levine, and Tommaso Biancalani. Inference-time alignment in diffusion models with reward-guided generation: Tutorial and review.arXiv preprint, 2025
work page 2025
-
[47]
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. ImageReward: Learning and evaluating human preferences for text-to-image generation.arXiv preprint, 2023
work page 2023
-
[48]
Unified reward model for multimodal understanding and generation.arXiv preprint, 2025
Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint, 2025
work page 2025
-
[49]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML, 2021
work page 2021
-
[50]
HPSv3: Towards wide-spectrum human preference score
Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. HPSv3: Towards wide-spectrum human preference score. InICCV, 2025
work page 2025
-
[51]
Bradley Efron. Tweedie’s formula and selection bias.Journal of the American Statistical Association, 106 (496):1602–1614, 2011
work page 2011
-
[52]
Think twice before you act: Improving inverse problem solving with MCMC.arXiv preprint, 2024
Yaxuan Zhu, Zehao Dou, Haoxin Zheng, Yasi Zhang, Ying Nian Wu, and Ruiqi Gao. Think twice before you act: Improving inverse problem solving with MCMC.arXiv preprint, 2024
work page 2024
-
[53]
VARD: Efficient and dense fine-tuning for diffusion models with value-based RL.arXiv preprint, 2025
Fengyuan Dai, Zifeng Zhuang, Yufei Huang, Siteng Huang, Bangyan Liao, Donglin Wang, and Fajie Yuan. VARD: Efficient and dense fine-tuning for diffusion models with value-based RL.arXiv preprint, 2025
work page 2025
-
[54]
Beyond VLM-based rewards: Diffusion-native latent reward modeling.arXiv preprint, 2026
Gongye Liu, Bo Yang, Yida Zhi, Zhizhou Zhong, Lei Ke, Didan Deng, Han Gao, Yongxiang Huang, Kaihao Zhang, Hongbo Fu, et al. Beyond VLM-based rewards: Diffusion-native latent reward modeling.arXiv preprint, 2026
work page 2026
-
[55]
Video generation models are good latent reward models.arXiv preprint, 2025
Xiaoyue Mi, Wenqing Yu, Jiesong Lian, Shibo Jie, Ruizhe Zhong, Zijun Liu, Guozhen Zhang, Zixiang Zhou, Zhiyong Xu, Yuan Zhou, et al. Video generation models are good latent reward models.arXiv preprint, 2025
work page 2025
-
[56]
Critic-guided reinforcement unlearning in text-to-image diffusion.arXiv preprint, 2026
Mykola Vysotskyi, Zahar Kohut, Mariia Shpir, Taras Rumezhak, and Volodymyr Karpiv. Critic-guided reinforcement unlearning in text-to-image diffusion.arXiv preprint, 2026
work page 2026
-
[57]
Consistent noisy latent rewards for trajectory preference optimization in diffusion models
Xiaole Xian, Xilin He, Wenting Chen, Wenshuang Liu, wenqi mu, Yancheng He, Liang Li, Yi Zhang, and Xiangyu Yue. Consistent noisy latent rewards for trajectory preference optimization in diffusion models. InICLR, 2026
work page 2026
-
[58]
Diffusion model as a noise-aware latent reward model for step-level preference optimization
Tao Zhang, Cheng Da, Kun Ding, Huan Yang, kun jin, Yan Li, Tingting Gao, Di Zhang, Shiming Xiang, and Chunhong Pan. Diffusion model as a noise-aware latent reward model for step-level preference optimization. InNeurIPS, 2026
work page 2026
-
[59]
Ziyi Zhang, Sen Zhang, Yibing Zhan, Yong Luo, Yonggang Wen, and Dacheng Tao. Confronting reward overoptimization for diffusion models: A perspective of inductive and primacy biases. InICML, 2024
work page 2024
-
[60]
Zengqun Zhao, Ziquan Liu, Yu Cao, Shaogang Gong, Zhensong Zhang, Jifei Song, Jiankang Deng, and Ioannis Patras. LatSearch: Latent reward-guided search for faster inference-time scaling in video diffusion.arXiv preprint, 2026. 14 Stitched Value Model for Diffusion Alignment
work page 2026
-
[61]
Understanding image representations by measuring their equivariance and equivalence
Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence. InCVPR, 2015
work page 2015
-
[62]
Similarity and matching of neural network representations
Adrián Csiszárik, Péter Kőrösi-Szabó, Akos Matszangosz, Gergely Papp, and Dániel Varga. Similarity and matching of neural network representations. InNeurIPS, 2021
work page 2021
-
[63]
Xingyi Yang, Daquan Zhou, Songhua Liu, Jingwen Ye, and Xinchao Wang. Deep model reassembly. In NeurIPS, 2022
work page 2022
-
[64]
Revisiting model stitching to compare neural representations
Yamini Bansal, Preetum Nakkiran, and Boaz Barak. Revisiting model stitching to compare neural representations. InNeurIPS, 2021
work page 2021
-
[65]
Zizheng Pan, Jianfei Cai, and Bohan Zhuang. Stitchable neural networks. InCVPR, 2023
work page 2023
-
[66]
Decoupled meanflow: Turning flow models into flow maps for accelerated sampling.arXiv preprint, 2025
Kyungmin Lee, Sihyun Yu, and Jinwoo Shin. Decoupled meanflow: Turning flow models into flow maps for accelerated sampling.arXiv preprint, 2025
work page 2025
-
[67]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024
work page 2024
-
[68]
Stability AI. Stable Diffusion 3.5. https://github.com/Stability-AI/sd3.5, 2024. Official inference repository for Stable Diffusion 3.5 Large, Large Turbo, and Medium. Accessed: 2026-04-27
work page 2024
-
[69]
FLUX.1 [dev].https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. FLUX.1 [dev].https://github.com/black-forest-labs/flux, 2024. Official inference repository for FLUX.1 open-weight models. Accessed: 2026-04-27
work page 2024
-
[70]
AlexFang, AlbinMadappallyJose, AmitJain, LudwigSchmidt, AlexanderTToshev, andVaishaalShankar. Data filtering networks. InICLR, 2024
work page 2024
-
[71]
CLIP+MLP aesthetic score predictor
Christoph Schuhmann. CLIP+MLP aesthetic score predictor. https://github.com/ christophschuhmann/improved-aesthetic-predictor, 2022. Train, use, and visualize an aesthetic score predictor based on a neural network taking CLIP embeddings as input. Accessed: 2026-04-27
work page 2022
-
[72]
Universal guidance for diffusion models
Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Universal guidance for diffusion models. InCVPR, 2023
work page 2023
-
[73]
Training-free multi-objective diffusion model for 3d molecule generation
Xu Han, Caihua Shan, Yifei Shen, Can Xu, Han Yang, Xiang Li, and Dongsheng Li. Training-free multi-objective diffusion model for 3d molecule generation. InICLR, 2024
work page 2024
-
[74]
Deep reward supervisions for tuning text-to-image diffusion models
Xiaoshi Wu, Yiming Hao, Manyuan Zhang, Keqiang Sun, Zhaoyang Huang, Guanglu Song, Yu Liu, and Hongsheng Li. Deep reward supervisions for tuning text-to-image diffusion models. InECCV, 2024
work page 2024
-
[75]
Training diffusion models towards diverse image generation with reinforcement learning
Zichen Miao, Jiang Wang, Ze Wang, Zhengyuan Yang, Lijuan Wang, Qiang Qiu, and Zicheng Liu. Training diffusion models towards diverse image generation with reinforcement learning. InCVPR, 2024
work page 2024
-
[76]
Flow-GRPO: Training flow matching models via online RL
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di ZHANG, and Wanli Ouyang. Flow-GRPO: Training flow matching models via online RL. InNeurIPS, 2026
work page 2026
-
[77]
DanceGRPO: Unleashing GRPO on visual generation.arXiv preprint, 2025
Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. DanceGRPO: Unleashing GRPO on visual generation.arXiv preprint, 2025
work page 2025
-
[78]
Tiny inference- time scaling with latent verifiers.arXiv preprint, 2026
Davide Bucciarelli, Evelyn Turri, Lorenzo Baraldi, Marcella Cornia, and Rita Cucchiara. Tiny inference- time scaling with latent verifiers.arXiv preprint, 2026
work page 2026
-
[79]
Vasco Ramos, Regev Cohen, Idan Szpektor, and Joao Magalhaes. Beyond the Noise: Aligning prompts with latent representations in diffusion models.arXiv preprint, 2025. 15 Stitched Value Model for Diffusion Alignment
work page 2025
-
[80]
Similarity of neural network representations revisited
Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InICML, 2019
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.