pith. sign in

arxiv: 2602.11146 · v2 · pith:TXFCXWZ4new · submitted 2026-02-11 · 💻 cs.CV · cs.AI

Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

Pith reviewed 2026-05-25 06:39 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords diffusion reward modellatent preference learningnoise-calibrated Thurstoneimage alignmentpreference optimizationVLM alternativediffusion models
0
0 comments X

The pith

DiNa-LRM learns preference rewards directly on noisy latent diffusion states, matching VLM results at lower cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models supply rewards for aligning diffusion generators but incur high compute costs and introduce a mismatch between pixel rewards and latent generation. DiNa-LRM instead trains a reward head on a pretrained latent diffusion backbone, operating directly on noisy states at varying timesteps. It replaces standard likelihoods with a noise-calibrated Thurstone model whose uncertainty scales with diffusion noise level. The same head supports noise ensembling at inference time. On image alignment benchmarks the resulting signals outperform prior diffusion reward methods and reach parity with leading VLMs while using far less computation during both reward evaluation and subsequent preference optimization.

Core claim

DiNa-LRM formulates preference learning directly on noisy diffusion states in latent space, employing a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty, supported by a timestep-conditioned reward head on a pretrained latent diffusion backbone and inference-time noise ensembling, to provide efficient and robust rewards for aligning diffusion models without VLM guidance or pixel-space supervision.

What carries the argument

Noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty, applied through a timestep-conditioned reward head on noisy latent states.

If this is right

  • DiNa-LRM outperforms existing diffusion-based reward baselines across image alignment benchmarks.
  • It reaches performance competitive with state-of-the-art VLMs at a fraction of the computational cost.
  • Preference optimization using DiNa-LRM improves alignment dynamics and reduces overall resource use.
  • Inference-time noise ensembling supplies a built-in mechanism for test-time scaling of the reward signal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The latent formulation may allow reward signals to be computed inside the same forward pass used for generation, removing an external reward call entirely.
  • Timestep conditioning opens the possibility that different denoising stages could receive differently calibrated preferences without retraining.
  • If the noise-dependent uncertainty model transfers, similar heads could be attached to flow-matching backbones with minimal change.

Load-bearing premise

Preference learning directly on noisy diffusion states with a noise-calibrated Thurstone likelihood and diffusion-noise-dependent uncertainty is sufficient to eliminate domain mismatch and produce reliable preference signals without additional pixel-space supervision or external VLM guidance.

What would settle it

An image alignment benchmark result in which DiNa-LRM requires comparable or greater compute than a VLM to reach equivalent preference accuracy would falsify the efficiency claim.

read the original abstract

Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes DiNa-LRM, a diffusion-native latent reward model for preference optimization of diffusion and flow-matching models. It formulates preference learning directly on noisy latent diffusion states via a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty, using a pretrained latent diffusion backbone plus a lightweight timestep-conditioned reward head, and supports inference-time noise ensembling. The central claims are that DiNa-LRM substantially outperforms existing diffusion-based reward baselines, achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost across image alignment benchmarks, and improves preference optimization dynamics.

Significance. If the empirical claims hold with rigorous validation, the work could be significant for enabling more efficient and domain-consistent reward modeling in diffusion alignment, reducing reliance on expensive pixel-space VLMs while introducing a native mechanism for test-time scaling via noise ensembling.

major comments (1)
  1. [Abstract] Abstract: the central claims of substantial outperformance over diffusion-based reward baselines and competitiveness with VLMs at a fraction of the cost are stated without any quantitative results, baseline details, statistical tests, ablation studies, or experimental setup. This absence makes the primary performance assertions impossible to verify and is load-bearing for the contribution.
minor comments (2)
  1. The definition and derivation of the noise-calibrated Thurstone likelihood and its diffusion-noise-dependent uncertainty term should be presented with explicit equations and variable definitions to allow reproducibility.
  2. Clarify whether the timestep-conditioned reward head introduces additional free parameters beyond the listed ones and how they are trained.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater specificity in the abstract. We address this point directly below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claims of substantial outperformance over diffusion-based reward baselines and competitiveness with VLMs at a fraction of the cost are stated without any quantitative results, baseline details, statistical tests, ablation studies, or experimental setup. This absence makes the primary performance assertions impossible to verify and is load-bearing for the contribution.

    Authors: We agree that the abstract would be stronger with concrete quantitative anchors. In the revised manuscript we will insert the key reported metrics (e.g., win-rate deltas versus diffusion baselines and VLM cost ratios on the primary alignment benchmarks), name the main baselines, and briefly note the evaluation protocol, all within standard abstract length limits. The full experimental details, including statistical tests, ablations, and setup, remain in Sections 4 and 5 as before. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces DiNa-LRM as an independent architectural proposal: preference learning formulated directly on noisy latent diffusion states via a noise-calibrated Thurstone likelihood with timestep-dependent uncertainty, implemented with a pretrained diffusion backbone plus lightweight reward head. No equations, performance claims, or central premises in the provided text reduce to quantities fitted from the evaluation data itself, nor do they rest on self-citation chains or imported uniqueness results. The efficiency and performance arguments follow directly from the stated construction without circular reduction. This is the common case of a self-contained method proposal.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on a new likelihood formulation and reuse of a pretrained diffusion backbone; no external benchmarks or machine-checked proofs are referenced in the abstract.

free parameters (1)
  • timestep-conditioned reward head weights
    Learned parameters of the added reward head that are fitted during training.
axioms (1)
  • standard math Thurstone model for modeling pairwise preferences
    Standard probabilistic model for preference data invoked to define the likelihood.
invented entities (1)
  • noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty no independent evidence
    purpose: To provide a diffusion-native preference probability that accounts for varying noise levels
    New formulation introduced to operate directly on noisy latent states.

pith-pipeline@v0.9.0 · 5750 in / 1321 out tokens · 35371 ms · 2026-05-25T06:39:12.012996+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration

    cs.CV 2026-05 unverdicted novelty 7.0

    KVPO aligns streaming autoregressive video generators with human preferences via ODE-native GRPO, using KV cache for semantic exploration and TVE for velocity-based policy modeling, yielding gains in quality and alignment.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 1 Pith paper · 17 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  2. [2]

    Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952. ISSN 00063444. URLhttp://www.jstor.org/stable/2334029

  3. [3]

    Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

    Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

  4. [4]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    Cai, H., Cao, S., Du, R., Gao, P., Hoi, S., Huang, S., Hou, Z., Jiang, D., Jin, X., Li, L., et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv:2511.22699, 2025

  5. [5]

    Toward generalized image quality assessment: Relaxing the perfect reference quality assumption

    Chen, D., Wu, T., Ma, K., and Zhang, L. Toward generalized image quality assessment: Relaxing the perfect reference quality assumption. InProceedings ofthe ComputerVisionandPattern RecognitionConference, pp. 12742–12752, 2025

  6. [6]

    Scaling rectified flow transformers for high-resolution image synthesis

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

  7. [7]

    Onereward: Unified mask-guided image generation via multi-task human preference learning.arXiv:2508.21066, 2025

    Gong, Y., Wang, X., Wu, J., Wang, S., Wang, Y., and Wu, X. Onereward: Unified mask-guided image generation via multi-task human preference learning.arXiv:2508.21066, 2025

  8. [8]

    Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation

    He, X., Jiang, D., Zhang, G., Ku, M., Soni, A., Siu, S., Chen, H., Chandra, A., Jiang, Z., Arulraj, A., et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation. InEMNLP, pp. 2105–2123, 2024

  9. [9]

    Denoising diffusion probabilistic models.Advancesin neural informationprocessing systems, 33:6840–6851, 2020

    Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models.Advancesin neural informationprocessing systems, 33:6840–6851, 2020

  10. [10]

    GPT-4o System Card

    Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card.arXivpreprint arXiv:2410.21276, 2024

  11. [11]

    Genai arena: An open evaluation platform for generative models

    Jiang, D., Ku, M., Li, T., Ni, Y., Sun, S., Fan, R., and Chen, W. Genai arena: An open evaluation platform for generative models. Advancesin NeuralInformationProcessingSystems, 37:79889–79908, 2024

  12. [12]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation.NeurIPS, 36:36652–36663, 2023

    Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., and Levy, O. Pick-a-pic: An open dataset of user preferences for text-to-image generation.NeurIPS, 36:36652–36663, 2023. 10

  13. [13]

    Labs, B. F. Flux.https://github.com/black-forest-labs/flux, 2024

  14. [14]

    C., Prabhudesai, M., Duggal, S., Brown, E., and Pathak, D

    Li, A. C., Prabhudesai, M., Duggal, S., Brown, E., and Pathak, D. Your diffusion model is secretly a zero-shot classifier. In Proceedings ofthe IEEE/CVFInternational Conferenceon ComputerVision, pp. 2206–2217, 2023

  15. [15]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Li, J., Li, D., Xiong, C., and Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conferenceonmachinelearning, pp. 12888–12900. PMLR, 2022

  16. [16]

    Rich human feedback for text-to-image generation

    Liang, Y., He, J., Li, G., Li, P., Klimovskiy, A., Carolan, N., Sun, J., Pont-Tuset, J., Young, S., Yang, F., et al. Rich human feedback for text-to-image generation. InCVPR, 2024

  17. [17]

    Flow Matching for Generative Modeling

    Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  18. [18]

    Flow Matching Guide and Code

    Lipman, Y., Havasi, M., Holderrieth, P., Shaul, N., Le, M., Karrer, B., Chen, R. T., Lopez-Paz, D., Ben-Hamu, H., and Gat, I. Flow matching guide and code.arXivpreprint arXiv:2412.06264, 2024

  19. [19]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Liu, J., Liu, G., Liang, J., Li, Y., Liu, J., Wang, X., Wan, P., Zhang, D., and Ouyang, W. Flow-grpo: Training flow matching models via online rl.arXiv:2505.05470, 2025

  20. [20]

    Improving Video Generation with Human Feedback

    Liu, J., Liu, G., Liang, J., Yuan, Z., Liu, X., Zheng, M., Wu, X., Wang, Q., Xia, M., Wang, X., et al. Improving video generation with human feedback.arXiv:2501.13918, 2025

  21. [21]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  22. [22]

    J., Xie, X., and Lai, J.-H

    Lu, Y., Ren, Y., Xia, X., Lin, S., Wang, X., Xiao, X., Ma, A. J., Xie, X., and Lai, J.-H. Adversarial distribution matching for diffusion distillation towards efficient image and video synthesis. InProceedingsoftheIEEE/CVFInternational Conferenceon Computer Vision, pp. 16818–16829, 2025

  23. [23]

    Hpsv3: Towards wide-spectrum human preference score

    Ma, Y., Wu, X., Sun, K., and Li, H. Hpsv3: Towards wide-spectrum human preference score. InICCV, pp. 15086–15095, 2025

  24. [24]

    Video generation models are good latent reward models.arXiv:2511.21541, 2025

    Mi, X., Yu, W., Lian, J., Jie, S., Zhong, R., Liu, Z., Zhang, G., Zhou, Z., Xu, Z., Zhou, Y., et al. Video generation models are good latent reward models.arXiv:2511.21541, 2025

  25. [25]

    and Xie, S

    Peebles, W. and Xie, S. Scalable diffusion models with transformers. InProceedingsoftheIEEE/CVFinternational conference oncomputer vision, pp. 4195–4205, 2023

  26. [26]

    Film: Visual reasoning with a general conditioning layer

    Perez, E., Strub, F., De Vries, H., Dumoulin, V., and Courville, A. Film: Visual reasoning with a general conditioning layer. In Proceedings ofthe AAAIconferenceonartificialintelligence, volume 32, 2018

  27. [27]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., and Rombach, R. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

  28. [28]

    arXiv:2310.03739, 2023

    Prabhudesai,M.,Goyal,A.,Pathak,D.,andFragkiadaki,K.Aligningtext-to-imagediffusionmodelswithrewardbackpropagation. arXiv:2310.03739, 2023

  29. [29]

    Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

    Qiu, Z., Wang, Z., Zheng, B., Huang, Z., Wen, K., Yang, S., Men, R., Yu, L., Huang, F., Huang, S., et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXivpreprint arXiv:2505.06708, 2025

  30. [30]

    W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al

    Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. InInternational conferenceon machinelearning, pp. 8748–8763. PmLR, 2021

  31. [31]

    High-resolution image synthesis with latent diffusion models

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedingsof theIEEE/CVFconferenceoncomputervision andpattern recognition, pp. 10684–10695, 2022

  32. [32]

    Fast high-resolution image synthesis with latent adversarial diffusion distillation

    Sauer, A., Boesel, F., Dockhorn, T., Blattmann, A., Esser, P., and Rombach, R. Fast high-resolution image synthesis with latent adversarial diffusion distillation. InSIGGRAPH Asia2024 ConferencePapers, pp. 1–11, 2024

  33. [33]

    Seedream 4.0: Toward Next-generation Multimodal Image Generation

    Seedream, T., Chen, Y., Gao, Y., Gong, L., Guo, M., Guo, Q., Guo, Z., Hou, X., Huang, W., Huang, Y., et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv:2509.20427, 2025

  34. [34]

    P., Kumar, A., Ermon, S., and Poole, B

    Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. InICLR, 2021

  35. [35]

    Thurstone, L. L. A law of comparative judgment. InScaling, pp. 81–92. Routledge, 2017

  36. [36]

    Frank: a ranking method with fidelity loss

    Tsai, M.-F., Liu, T.-Y., Qin, T., Chen, H.-H., and Ma, W.-Y. Frank: a ranking method with fidelity loss. InProceedingsofthe 30th annualinternational ACMSIGIR conferenceonResearchanddevelopmentininformationretrieval, pp. 383–390, 2007. 11

  37. [37]

    Diffusion model alignment using direct preference optimization

    Wallace, B., Dang, M., Rafailov, R., Zhou, L., Lou, A., Purushwalkam, S., Ermon, S., Xiong, C., Joty, S., andNaik, N. Diffusion model alignment using direct preference optimization. InCVPR, pp. 8228–8238, 2024

  38. [38]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.-W., Chen, D., Yu, F., Zhao, H., Yang, J., et al. Wan: Open and advanced large-scale video generative models.arXiv:2503.20314, 2025

  39. [39]

    Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv:2510.22319, 2025

    Wang, J., Liang, J., Liu, J., Liu, H., Liu, G., Zheng, J., Pang, W., Ma, A., Xie, Z., Wang, X., et al. Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv:2510.22319, 2025

  40. [40]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXivpreprint arXiv:2409.12191, 2024

  41. [41]

    Vr-thinker: Boosting video reward models through thinking-with-image reasoning.arXiv:2510.10518, 2025

    Wang, Q., Liu, J., Liang, J., Jiang, Y., Zhang, Y., Chen, J., Zheng, Y., Wang, X., Wan, P., Yue, X., et al. Vr-thinker: Boosting video reward models through thinking-with-image reasoning.arXiv:2510.10518, 2025

  42. [42]

    Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv:2505.03318, 2025

    Wang, Y., Li, Z., Zang, Y., Wang, C., Lu, Q., Jin, C., and Wang, J. Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv:2505.03318, 2025

  43. [43]

    Unified Reward Model for Multimodal Understanding and Generation

    Wang, Y., Zang, Y., Li, H., Jin, C., and Wang, J. Unified reward model for multimodal understanding and generation. arXiv:2503.05236, 2025

  44. [44]

    Qwen-Image Technical Report

    Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.-m., Bai, S., Xu, X., Chen, Y., et al. Qwen-image technical report. arXiv:2508.02324, 2025

  45. [45]

    Rewarddance: Reward scaling in visual generation

    Wu, J., Gao, Y., Ye, Z., Li, M., Li, L., Guo, H., Liu, J., Xue, Z., Hou, X., Liu, W., et al. Rewarddance: Reward scaling in visual generation. arXiv:2509.08826, 2025

  46. [46]

    Visualquality-r1: Reasoning-inducedimagequalityassessmentviareinforcement learning to rank.arXiv:2505.14460, 2025

    Wu,T.,Zou,J.,Liang,J.,Zhang,L.,andMa,K. Visualquality-r1: Reasoning-inducedimagequalityassessmentviareinforcement learning to rank.arXiv:2505.14460, 2025

  47. [47]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Wu, X., Hao, Y., Sun, K., Chen, Y., Zhu, F., Zhao, R., and Li, H. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv:2306.09341, 2023

  48. [48]

    Human preference score: Better aligning text-to-image models with human preference

    Wu, X., Sun, K., Zhu, F., Zhao, R., and Li, H. Human preference score: Better aligning text-to-image models with human preference. InICCV, pp. 2096–2105, 2023

  49. [49]

    Denoising diffusion autoencoders are unified self-supervised learners

    Xiang, W., Yang, H., Huang, D., and Wang, Y. Denoising diffusion autoencoders are unified self-supervised learners. In Proceedings ofthe IEEE/CVFInternational Conferenceon ComputerVision, pp. 15802–15812, 2023

  50. [50]

    Imagereward: Learning and evaluating human preferences for text-to-image generation.NeurIPS, 36:15903–15935, 2023

    Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., and Dong, Y. Imagereward: Learning and evaluating human preferences for text-to-image generation.NeurIPS, 36:15903–15935, 2023

  51. [51]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Xue, Z., Wu, J., Gao, Y., Kong, F., Zhu, L., Chen, M., Liu, Z., Liu, W., Guo, Q., Huang, W., et al. Dancegrpo: Unleashing grpo on visual generation.arXiv:2505.07818, 2025

  52. [52]

    Using human feedback to fine-tune diffusion models without any reward model

    Yang, K., Tao, J., Lyu, J., Ge, C., Chen, J., Shen, W., Zhu, X., and Li, X. Using human feedback to fine-tune diffusion models without any reward model. InCVPR, pp. 8941–8951, 2024

  53. [53]

    Improved distribution matching distillation for fast image synthesis.Advancesinneural informationprocessing systems, 37:47455–47487, 2024

    Yin, T., Gharbi, M., Park, T., Zhang, R., Shechtman, E., Durand, F., and Freeman, B. Improved distribution matching distillation for fast image synthesis.Advancesinneural informationprocessing systems, 37:47455–47487, 2024

  54. [54]

    Mind-v: Hierarchical video generation for long-horizon robotic manipulation with rl-based physical alignment.arXivpreprint arXiv:2512.06628, 2025

    Zhang, R., Zhang, M., Zhou, J., Guo, Z., Liu, X., Xu, Z., Zhong, Z., Yan, P., Luo, H., and Li, X. Mind-v: Hierarchical video generation for long-horizon robotic manipulation with rl-based physical alignment.arXivpreprint arXiv:2512.06628, 2025

  55. [55]

    Learning multi-dimensional human preference for text-to-image generation

    Zhang, S., Wang, B., Wu, J., Li, Y., Gao, T., Zhang, D., and Wang, Z. Learning multi-dimensional human preference for text-to-image generation. InCVPR, pp. 8018–8027, 2024

  56. [56]

    Diffusionmodelasanoise-aware latent reward model for step-level preference optimization.arXiv:2502.01051, 2025

    Zhang,T.,Da,C.,Ding,K.,Yang,H.,Jin,K.,Li,Y.,Gao,T.,Zhang,D.,Xiang,S.,andPan,C. Diffusionmodelasanoise-aware latent reward model for step-level preference optimization.arXiv:2502.01051, 2025

  57. [57]

    golden metric

    Zhang, X., Zhang, X., Wu, Y., Cao, Y., Zhang, R., Chu, R., Yang, L., and Yang, Y. Generative universal verifier as multimodal meta-reasoner. arXivpreprint arXiv:2510.13804, 2025. 12 Appendix A Uncertainty Analysis Our reward model is evaluated on noise-conditioned inputs. Each evaluation samples Gaussian noise to construct a perturbed statex𝑡. Consequentl...