Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling
Pith reviewed 2026-05-25 06:39 UTC · model grok-4.3
The pith
DiNa-LRM learns preference rewards directly on noisy latent diffusion states, matching VLM results at lower cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DiNa-LRM formulates preference learning directly on noisy diffusion states in latent space, employing a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty, supported by a timestep-conditioned reward head on a pretrained latent diffusion backbone and inference-time noise ensembling, to provide efficient and robust rewards for aligning diffusion models without VLM guidance or pixel-space supervision.
What carries the argument
Noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty, applied through a timestep-conditioned reward head on noisy latent states.
If this is right
- DiNa-LRM outperforms existing diffusion-based reward baselines across image alignment benchmarks.
- It reaches performance competitive with state-of-the-art VLMs at a fraction of the computational cost.
- Preference optimization using DiNa-LRM improves alignment dynamics and reduces overall resource use.
- Inference-time noise ensembling supplies a built-in mechanism for test-time scaling of the reward signal.
Where Pith is reading between the lines
- The latent formulation may allow reward signals to be computed inside the same forward pass used for generation, removing an external reward call entirely.
- Timestep conditioning opens the possibility that different denoising stages could receive differently calibrated preferences without retraining.
- If the noise-dependent uncertainty model transfers, similar heads could be attached to flow-matching backbones with minimal change.
Load-bearing premise
Preference learning directly on noisy diffusion states with a noise-calibrated Thurstone likelihood and diffusion-noise-dependent uncertainty is sufficient to eliminate domain mismatch and produce reliable preference signals without additional pixel-space supervision or external VLM guidance.
What would settle it
An image alignment benchmark result in which DiNa-LRM requires comparable or greater compute than a VLM to reach equivalent preference accuracy would falsify the efficiency claim.
read the original abstract
Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DiNa-LRM, a diffusion-native latent reward model for preference optimization of diffusion and flow-matching models. It formulates preference learning directly on noisy latent diffusion states via a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty, using a pretrained latent diffusion backbone plus a lightweight timestep-conditioned reward head, and supports inference-time noise ensembling. The central claims are that DiNa-LRM substantially outperforms existing diffusion-based reward baselines, achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost across image alignment benchmarks, and improves preference optimization dynamics.
Significance. If the empirical claims hold with rigorous validation, the work could be significant for enabling more efficient and domain-consistent reward modeling in diffusion alignment, reducing reliance on expensive pixel-space VLMs while introducing a native mechanism for test-time scaling via noise ensembling.
major comments (1)
- [Abstract] Abstract: the central claims of substantial outperformance over diffusion-based reward baselines and competitiveness with VLMs at a fraction of the cost are stated without any quantitative results, baseline details, statistical tests, ablation studies, or experimental setup. This absence makes the primary performance assertions impossible to verify and is load-bearing for the contribution.
minor comments (2)
- The definition and derivation of the noise-calibrated Thurstone likelihood and its diffusion-noise-dependent uncertainty term should be presented with explicit equations and variable definitions to allow reproducibility.
- Clarify whether the timestep-conditioned reward head introduces additional free parameters beyond the listed ones and how they are trained.
Simulated Author's Rebuttal
We thank the referee for highlighting the need for greater specificity in the abstract. We address this point directly below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claims of substantial outperformance over diffusion-based reward baselines and competitiveness with VLMs at a fraction of the cost are stated without any quantitative results, baseline details, statistical tests, ablation studies, or experimental setup. This absence makes the primary performance assertions impossible to verify and is load-bearing for the contribution.
Authors: We agree that the abstract would be stronger with concrete quantitative anchors. In the revised manuscript we will insert the key reported metrics (e.g., win-rate deltas versus diffusion baselines and VLM cost ratios on the primary alignment benchmarks), name the main baselines, and briefly note the evaluation protocol, all within standard abstract length limits. The full experimental details, including statistical tests, ablations, and setup, remain in Sections 4 and 5 as before. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces DiNa-LRM as an independent architectural proposal: preference learning formulated directly on noisy latent diffusion states via a noise-calibrated Thurstone likelihood with timestep-dependent uncertainty, implemented with a pretrained diffusion backbone plus lightweight reward head. No equations, performance claims, or central premises in the provided text reduce to quantities fitted from the evaluation data itself, nor do they rest on self-citation chains or imported uniqueness results. The efficiency and performance arguments follow directly from the stated construction without circular reduction. This is the common case of a self-contained method proposal.
Axiom & Free-Parameter Ledger
free parameters (1)
- timestep-conditioned reward head weights
axioms (1)
- standard math Thurstone model for modeling pairwise preferences
invented entities (1)
-
noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty
no independent evidence
Forward citations
Cited by 1 Pith paper
-
KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration
KVPO aligns streaming autoregressive video generators with human preferences via ODE-native GRPO, using KV cache for semantic exploration and TVE for velocity-based policy modeling, yielding gains in quality and alignment.
Reference graph
Works this paper leans on
-
[1]
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [2]
-
[3]
Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024
Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024
work page 2024
-
[4]
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Cai, H., Cao, S., Du, R., Gao, P., Hoi, S., Huang, S., Hou, Z., Jiang, D., Jin, X., Li, L., et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv:2511.22699, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Toward generalized image quality assessment: Relaxing the perfect reference quality assumption
Chen, D., Wu, T., Ma, K., and Zhang, L. Toward generalized image quality assessment: Relaxing the perfect reference quality assumption. InProceedings ofthe ComputerVisionandPattern RecognitionConference, pp. 12742–12752, 2025
work page 2025
-
[6]
Scaling rectified flow transformers for high-resolution image synthesis
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024
work page 2024
-
[7]
Gong, Y., Wang, X., Wu, J., Wang, S., Wang, Y., and Wu, X. Onereward: Unified mask-guided image generation via multi-task human preference learning.arXiv:2508.21066, 2025
-
[8]
Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation
He, X., Jiang, D., Zhang, G., Ku, M., Soni, A., Siu, S., Chen, H., Chandra, A., Jiang, Z., Arulraj, A., et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation. InEMNLP, pp. 2105–2123, 2024
work page 2024
-
[9]
Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models.Advancesin neural informationprocessing systems, 33:6840–6851, 2020
work page 2020
-
[10]
Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card.arXivpreprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Genai arena: An open evaluation platform for generative models
Jiang, D., Ku, M., Li, T., Ni, Y., Sun, S., Fan, R., and Chen, W. Genai arena: An open evaluation platform for generative models. Advancesin NeuralInformationProcessingSystems, 37:79889–79908, 2024
work page 2024
-
[12]
Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., and Levy, O. Pick-a-pic: An open dataset of user preferences for text-to-image generation.NeurIPS, 36:36652–36663, 2023. 10
work page 2023
-
[13]
Labs, B. F. Flux.https://github.com/black-forest-labs/flux, 2024
work page 2024
-
[14]
C., Prabhudesai, M., Duggal, S., Brown, E., and Pathak, D
Li, A. C., Prabhudesai, M., Duggal, S., Brown, E., and Pathak, D. Your diffusion model is secretly a zero-shot classifier. In Proceedings ofthe IEEE/CVFInternational Conferenceon ComputerVision, pp. 2206–2217, 2023
work page 2023
-
[15]
Li, J., Li, D., Xiong, C., and Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conferenceonmachinelearning, pp. 12888–12900. PMLR, 2022
work page 2022
-
[16]
Rich human feedback for text-to-image generation
Liang, Y., He, J., Li, G., Li, P., Klimovskiy, A., Carolan, N., Sun, J., Pont-Tuset, J., Young, S., Yang, F., et al. Rich human feedback for text-to-image generation. InCVPR, 2024
work page 2024
-
[17]
Flow Matching for Generative Modeling
Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
Lipman, Y., Havasi, M., Holderrieth, P., Shaul, N., Le, M., Karrer, B., Chen, R. T., Lopez-Paz, D., Ben-Hamu, H., and Gat, I. Flow matching guide and code.arXivpreprint arXiv:2412.06264, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Flow-GRPO: Training Flow Matching Models via Online RL
Liu, J., Liu, G., Liang, J., Li, Y., Liu, J., Wang, X., Wan, P., Zhang, D., and Ouyang, W. Flow-grpo: Training flow matching models via online rl.arXiv:2505.05470, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Improving Video Generation with Human Feedback
Liu, J., Liu, G., Liang, J., Yuan, Z., Liu, X., Zheng, M., Wu, X., Wang, Q., Xia, M., Wang, X., et al. Improving video generation with human feedback.arXiv:2501.13918, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[22]
Lu, Y., Ren, Y., Xia, X., Lin, S., Wang, X., Xiao, X., Ma, A. J., Xie, X., and Lai, J.-H. Adversarial distribution matching for diffusion distillation towards efficient image and video synthesis. InProceedingsoftheIEEE/CVFInternational Conferenceon Computer Vision, pp. 16818–16829, 2025
work page 2025
-
[23]
Hpsv3: Towards wide-spectrum human preference score
Ma, Y., Wu, X., Sun, K., and Li, H. Hpsv3: Towards wide-spectrum human preference score. InICCV, pp. 15086–15095, 2025
work page 2025
-
[24]
Video generation models are good latent reward models.arXiv:2511.21541, 2025
Mi, X., Yu, W., Lian, J., Jie, S., Zhong, R., Liu, Z., Zhang, G., Zhou, Z., Xu, Z., Zhou, Y., et al. Video generation models are good latent reward models.arXiv:2511.21541, 2025
-
[25]
Peebles, W. and Xie, S. Scalable diffusion models with transformers. InProceedingsoftheIEEE/CVFinternational conference oncomputer vision, pp. 4195–4205, 2023
work page 2023
-
[26]
Film: Visual reasoning with a general conditioning layer
Perez, E., Strub, F., De Vries, H., Dumoulin, V., and Courville, A. Film: Visual reasoning with a general conditioning layer. In Proceedings ofthe AAAIconferenceonartificialintelligence, volume 32, 2018
work page 2018
-
[27]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., and Rombach, R. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Prabhudesai,M.,Goyal,A.,Pathak,D.,andFragkiadaki,K.Aligningtext-to-imagediffusionmodelswithrewardbackpropagation. arXiv:2310.03739, 2023
-
[29]
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Qiu, Z., Wang, Z., Zheng, B., Huang, Z., Wen, K., Yang, S., Men, R., Yu, L., Huang, F., Huang, S., et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXivpreprint arXiv:2505.06708, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. InInternational conferenceon machinelearning, pp. 8748–8763. PmLR, 2021
work page 2021
-
[31]
High-resolution image synthesis with latent diffusion models
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedingsof theIEEE/CVFconferenceoncomputervision andpattern recognition, pp. 10684–10695, 2022
work page 2022
-
[32]
Fast high-resolution image synthesis with latent adversarial diffusion distillation
Sauer, A., Boesel, F., Dockhorn, T., Blattmann, A., Esser, P., and Rombach, R. Fast high-resolution image synthesis with latent adversarial diffusion distillation. InSIGGRAPH Asia2024 ConferencePapers, pp. 1–11, 2024
work page 2024
-
[33]
Seedream 4.0: Toward Next-generation Multimodal Image Generation
Seedream, T., Chen, Y., Gao, Y., Gong, L., Guo, M., Guo, Q., Guo, Z., Hou, X., Huang, W., Huang, Y., et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv:2509.20427, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
P., Kumar, A., Ermon, S., and Poole, B
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. InICLR, 2021
work page 2021
-
[35]
Thurstone, L. L. A law of comparative judgment. InScaling, pp. 81–92. Routledge, 2017
work page 2017
-
[36]
Frank: a ranking method with fidelity loss
Tsai, M.-F., Liu, T.-Y., Qin, T., Chen, H.-H., and Ma, W.-Y. Frank: a ranking method with fidelity loss. InProceedingsofthe 30th annualinternational ACMSIGIR conferenceonResearchanddevelopmentininformationretrieval, pp. 383–390, 2007. 11
work page 2007
-
[37]
Diffusion model alignment using direct preference optimization
Wallace, B., Dang, M., Rafailov, R., Zhou, L., Lou, A., Purushwalkam, S., Ermon, S., Xiong, C., Joty, S., andNaik, N. Diffusion model alignment using direct preference optimization. InCVPR, pp. 8228–8238, 2024
work page 2024
-
[38]
Wan: Open and Advanced Large-Scale Video Generative Models
Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.-W., Chen, D., Yu, F., Zhao, H., Yang, J., et al. Wan: Open and advanced large-scale video generative models.arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Wang, J., Liang, J., Liu, J., Liu, H., Liu, G., Zheng, J., Pang, W., Ma, A., Xie, Z., Wang, X., et al. Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv:2510.22319, 2025
-
[40]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXivpreprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Wang, Q., Liu, J., Liang, J., Jiang, Y., Zhang, Y., Chen, J., Zheng, Y., Wang, X., Wan, P., Yue, X., et al. Vr-thinker: Boosting video reward models through thinking-with-image reasoning.arXiv:2510.10518, 2025
-
[42]
Wang, Y., Li, Z., Zang, Y., Wang, C., Lu, Q., Jin, C., and Wang, J. Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv:2505.03318, 2025
-
[43]
Unified Reward Model for Multimodal Understanding and Generation
Wang, Y., Zang, Y., Li, H., Jin, C., and Wang, J. Unified reward model for multimodal understanding and generation. arXiv:2503.05236, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.-m., Bai, S., Xu, X., Chen, Y., et al. Qwen-image technical report. arXiv:2508.02324, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Rewarddance: Reward scaling in visual generation
Wu, J., Gao, Y., Ye, Z., Li, M., Li, L., Guo, H., Liu, J., Xue, Z., Hou, X., Liu, W., et al. Rewarddance: Reward scaling in visual generation. arXiv:2509.08826, 2025
-
[46]
Wu,T.,Zou,J.,Liang,J.,Zhang,L.,andMa,K. Visualquality-r1: Reasoning-inducedimagequalityassessmentviareinforcement learning to rank.arXiv:2505.14460, 2025
-
[47]
Wu, X., Hao, Y., Sun, K., Chen, Y., Zhu, F., Zhao, R., and Li, H. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv:2306.09341, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Human preference score: Better aligning text-to-image models with human preference
Wu, X., Sun, K., Zhu, F., Zhao, R., and Li, H. Human preference score: Better aligning text-to-image models with human preference. InICCV, pp. 2096–2105, 2023
work page 2096
-
[49]
Denoising diffusion autoencoders are unified self-supervised learners
Xiang, W., Yang, H., Huang, D., and Wang, Y. Denoising diffusion autoencoders are unified self-supervised learners. In Proceedings ofthe IEEE/CVFInternational Conferenceon ComputerVision, pp. 15802–15812, 2023
work page 2023
-
[50]
Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., and Dong, Y. Imagereward: Learning and evaluating human preferences for text-to-image generation.NeurIPS, 36:15903–15935, 2023
work page 2023
-
[51]
DanceGRPO: Unleashing GRPO on Visual Generation
Xue, Z., Wu, J., Gao, Y., Kong, F., Zhu, L., Chen, M., Liu, Z., Liu, W., Guo, Q., Huang, W., et al. Dancegrpo: Unleashing grpo on visual generation.arXiv:2505.07818, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
Using human feedback to fine-tune diffusion models without any reward model
Yang, K., Tao, J., Lyu, J., Ge, C., Chen, J., Shen, W., Zhu, X., and Li, X. Using human feedback to fine-tune diffusion models without any reward model. InCVPR, pp. 8941–8951, 2024
work page 2024
-
[53]
Yin, T., Gharbi, M., Park, T., Zhang, R., Shechtman, E., Durand, F., and Freeman, B. Improved distribution matching distillation for fast image synthesis.Advancesinneural informationprocessing systems, 37:47455–47487, 2024
work page 2024
-
[54]
Zhang, R., Zhang, M., Zhou, J., Guo, Z., Liu, X., Xu, Z., Zhong, Z., Yan, P., Luo, H., and Li, X. Mind-v: Hierarchical video generation for long-horizon robotic manipulation with rl-based physical alignment.arXivpreprint arXiv:2512.06628, 2025
-
[55]
Learning multi-dimensional human preference for text-to-image generation
Zhang, S., Wang, B., Wu, J., Li, Y., Gao, T., Zhang, D., and Wang, Z. Learning multi-dimensional human preference for text-to-image generation. InCVPR, pp. 8018–8027, 2024
work page 2024
-
[56]
Zhang,T.,Da,C.,Ding,K.,Yang,H.,Jin,K.,Li,Y.,Gao,T.,Zhang,D.,Xiang,S.,andPan,C. Diffusionmodelasanoise-aware latent reward model for step-level preference optimization.arXiv:2502.01051, 2025
-
[57]
Zhang, X., Zhang, X., Wu, Y., Cao, Y., Zhang, R., Chu, R., Yang, L., and Yang, Y. Generative universal verifier as multimodal meta-reasoner. arXivpreprint arXiv:2510.13804, 2025. 12 Appendix A Uncertainty Analysis Our reward model is evaluated on noise-conditioned inputs. Each evaluation samples Gaussian noise to construct a perturbed statex𝑡. Consequentl...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.