Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

Bo Yang; Didan Deng; Gongye Liu; Han Gao; Hongbo Fu; Kaihao Zhang; Lei Ke; Wenhan Luo; Yida Zhi; Yongxiang Huang

arxiv: 2602.11146 · v2 · pith:TXFCXWZ4new · submitted 2026-02-11 · 💻 cs.CV · cs.AI

Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

Gongye Liu , Bo Yang , Yida Zhi , Zhizhou Zhong , Lei Ke , Didan Deng , Han Gao , Yongxiang Huang

show 3 more authors

Kaihao Zhang Hongbo Fu Wenhan Luo

This is my paper

Pith reviewed 2026-05-25 06:39 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords diffusion reward modellatent preference learningnoise-calibrated Thurstoneimage alignmentpreference optimizationVLM alternativediffusion models

0 comments

The pith

DiNa-LRM learns preference rewards directly on noisy latent diffusion states, matching VLM results at lower cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models supply rewards for aligning diffusion generators but incur high compute costs and introduce a mismatch between pixel rewards and latent generation. DiNa-LRM instead trains a reward head on a pretrained latent diffusion backbone, operating directly on noisy states at varying timesteps. It replaces standard likelihoods with a noise-calibrated Thurstone model whose uncertainty scales with diffusion noise level. The same head supports noise ensembling at inference time. On image alignment benchmarks the resulting signals outperform prior diffusion reward methods and reach parity with leading VLMs while using far less computation during both reward evaluation and subsequent preference optimization.

Core claim

DiNa-LRM formulates preference learning directly on noisy diffusion states in latent space, employing a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty, supported by a timestep-conditioned reward head on a pretrained latent diffusion backbone and inference-time noise ensembling, to provide efficient and robust rewards for aligning diffusion models without VLM guidance or pixel-space supervision.

What carries the argument

Noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty, applied through a timestep-conditioned reward head on noisy latent states.

If this is right

DiNa-LRM outperforms existing diffusion-based reward baselines across image alignment benchmarks.
It reaches performance competitive with state-of-the-art VLMs at a fraction of the computational cost.
Preference optimization using DiNa-LRM improves alignment dynamics and reduces overall resource use.
Inference-time noise ensembling supplies a built-in mechanism for test-time scaling of the reward signal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The latent formulation may allow reward signals to be computed inside the same forward pass used for generation, removing an external reward call entirely.
Timestep conditioning opens the possibility that different denoising stages could receive differently calibrated preferences without retraining.
If the noise-dependent uncertainty model transfers, similar heads could be attached to flow-matching backbones with minimal change.

Load-bearing premise

Preference learning directly on noisy diffusion states with a noise-calibrated Thurstone likelihood and diffusion-noise-dependent uncertainty is sufficient to eliminate domain mismatch and produce reliable preference signals without additional pixel-space supervision or external VLM guidance.

What would settle it

An image alignment benchmark result in which DiNa-LRM requires comparable or greater compute than a VLM to reach equivalent preference accuracy would falsify the efficiency claim.

read the original abstract

Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiNa-LRM puts a small timestep-conditioned reward head on a latent diffusion backbone and trains it with a noise-aware Thurstone loss, which is a clean way to stay inside the generator rather than calling VLMs.

read the letter

The paper's main move is to do preference learning directly on noisy latent states instead of pixel-space VLM scores. It adds a noise-calibrated Thurstone likelihood that lets uncertainty grow with diffusion noise level, plus inference-time ensembling over noise samples. That combination is presented as new and it does solve the domain-mismatch issue that comes from routing gradients through an external model at different noise scales. Staying in latent space also cuts the obvious compute cost at inference, which matters when you are aligning large diffusion or flow models repeatedly. The abstract says the method beats prior diffusion reward baselines and matches strong VLMs while using far less compute, and that it improves the stability of the subsequent preference optimization step. If those numbers check out with proper controls, the efficiency claim is worth attention. The description of the architecture itself is internally consistent; nothing in the stated likelihood or head design looks circular or contradictory. The soft spot is that the abstract supplies no tables, no baseline names, no dataset sizes, and no ablation results, so the performance claims cannot be checked from what is written here. The full manuscript presumably contains the experiments, but without them the central empirical argument remains unverified. This is for groups already running latent diffusion training loops and looking for cheaper reward signals. It is worth sending to review because the architectural alternative is distinct and the efficiency motivation is concrete, even if the gains need scrutiny in the full results.

Referee Report

1 major / 2 minor

Summary. The paper proposes DiNa-LRM, a diffusion-native latent reward model for preference optimization of diffusion and flow-matching models. It formulates preference learning directly on noisy latent diffusion states via a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty, using a pretrained latent diffusion backbone plus a lightweight timestep-conditioned reward head, and supports inference-time noise ensembling. The central claims are that DiNa-LRM substantially outperforms existing diffusion-based reward baselines, achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost across image alignment benchmarks, and improves preference optimization dynamics.

Significance. If the empirical claims hold with rigorous validation, the work could be significant for enabling more efficient and domain-consistent reward modeling in diffusion alignment, reducing reliance on expensive pixel-space VLMs while introducing a native mechanism for test-time scaling via noise ensembling.

major comments (1)

[Abstract] Abstract: the central claims of substantial outperformance over diffusion-based reward baselines and competitiveness with VLMs at a fraction of the cost are stated without any quantitative results, baseline details, statistical tests, ablation studies, or experimental setup. This absence makes the primary performance assertions impossible to verify and is load-bearing for the contribution.

minor comments (2)

The definition and derivation of the noise-calibrated Thurstone likelihood and its diffusion-noise-dependent uncertainty term should be presented with explicit equations and variable definitions to allow reproducibility.
Clarify whether the timestep-conditioned reward head introduces additional free parameters beyond the listed ones and how they are trained.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater specificity in the abstract. We address this point directly below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims of substantial outperformance over diffusion-based reward baselines and competitiveness with VLMs at a fraction of the cost are stated without any quantitative results, baseline details, statistical tests, ablation studies, or experimental setup. This absence makes the primary performance assertions impossible to verify and is load-bearing for the contribution.

Authors: We agree that the abstract would be stronger with concrete quantitative anchors. In the revised manuscript we will insert the key reported metrics (e.g., win-rate deltas versus diffusion baselines and VLM cost ratios on the primary alignment benchmarks), name the main baselines, and briefly note the evaluation protocol, all within standard abstract length limits. The full experimental details, including statistical tests, ablations, and setup, remain in Sections 4 and 5 as before. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces DiNa-LRM as an independent architectural proposal: preference learning formulated directly on noisy latent diffusion states via a noise-calibrated Thurstone likelihood with timestep-dependent uncertainty, implemented with a pretrained diffusion backbone plus lightweight reward head. No equations, performance claims, or central premises in the provided text reduce to quantities fitted from the evaluation data itself, nor do they rest on self-citation chains or imported uniqueness results. The efficiency and performance arguments follow directly from the stated construction without circular reduction. This is the common case of a self-contained method proposal.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on a new likelihood formulation and reuse of a pretrained diffusion backbone; no external benchmarks or machine-checked proofs are referenced in the abstract.

free parameters (1)

timestep-conditioned reward head weights
Learned parameters of the added reward head that are fitted during training.

axioms (1)

standard math Thurstone model for modeling pairwise preferences
Standard probabilistic model for preference data invoked to define the likelihood.

invented entities (1)

noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty no independent evidence
purpose: To provide a diffusion-native preference probability that accounts for varying noise levels
New formulation introduced to operate directly on noisy latent states.

pith-pipeline@v0.9.0 · 5750 in / 1321 out tokens · 35371 ms · 2026-05-25T06:39:12.012996+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration
cs.CV 2026-05 unverdicted novelty 7.0

KVPO aligns streaming autoregressive video generators with human preferences via ODE-native GRPO, using KV cache for semantic exploration and TVE for velocity-based policy modeling, yielding gains in quality and alignment.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 1 Pith paper · 17 internal anchors

[1]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952. ISSN 00063444. URLhttp://www.jstor.org/stable/2334029

work page arXiv 1952
[3]

Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

work page 2024
[4]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Cai, H., Cao, S., Du, R., Gao, P., Hoi, S., Huang, S., Hou, Z., Jiang, D., Jin, X., Li, L., et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv:2511.22699, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Toward generalized image quality assessment: Relaxing the perfect reference quality assumption

Chen, D., Wu, T., Ma, K., and Zhang, L. Toward generalized image quality assessment: Relaxing the perfect reference quality assumption. InProceedings ofthe ComputerVisionandPattern RecognitionConference, pp. 12742–12752, 2025

work page 2025
[6]

Scaling rectified flow transformers for high-resolution image synthesis

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

work page 2024
[7]

Onereward: Unified mask-guided image generation via multi-task human preference learning.arXiv:2508.21066, 2025

Gong, Y., Wang, X., Wu, J., Wang, S., Wang, Y., and Wu, X. Onereward: Unified mask-guided image generation via multi-task human preference learning.arXiv:2508.21066, 2025

work page arXiv 2025
[8]

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation

He, X., Jiang, D., Zhang, G., Ku, M., Soni, A., Siu, S., Chen, H., Chandra, A., Jiang, Z., Arulraj, A., et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation. InEMNLP, pp. 2105–2123, 2024

work page 2024
[9]

Denoising diffusion probabilistic models.Advancesin neural informationprocessing systems, 33:6840–6851, 2020

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models.Advancesin neural informationprocessing systems, 33:6840–6851, 2020

work page 2020
[10]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card.arXivpreprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Genai arena: An open evaluation platform for generative models

Jiang, D., Ku, M., Li, T., Ni, Y., Sun, S., Fan, R., and Chen, W. Genai arena: An open evaluation platform for generative models. Advancesin NeuralInformationProcessingSystems, 37:79889–79908, 2024

work page 2024
[12]

Pick-a-pic: An open dataset of user preferences for text-to-image generation.NeurIPS, 36:36652–36663, 2023

Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., and Levy, O. Pick-a-pic: An open dataset of user preferences for text-to-image generation.NeurIPS, 36:36652–36663, 2023. 10

work page 2023
[13]

Labs, B. F. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024
[14]

C., Prabhudesai, M., Duggal, S., Brown, E., and Pathak, D

Li, A. C., Prabhudesai, M., Duggal, S., Brown, E., and Pathak, D. Your diffusion model is secretly a zero-shot classifier. In Proceedings ofthe IEEE/CVFInternational Conferenceon ComputerVision, pp. 2206–2217, 2023

work page 2023
[15]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Li, J., Li, D., Xiong, C., and Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conferenceonmachinelearning, pp. 12888–12900. PMLR, 2022

work page 2022
[16]

Rich human feedback for text-to-image generation

Liang, Y., He, J., Li, G., Li, P., Klimovskiy, A., Carolan, N., Sun, J., Pont-Tuset, J., Young, S., Yang, F., et al. Rich human feedback for text-to-image generation. InCVPR, 2024

work page 2024
[17]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Flow Matching Guide and Code

Lipman, Y., Havasi, M., Holderrieth, P., Shaul, N., Le, M., Karrer, B., Chen, R. T., Lopez-Paz, D., Ben-Hamu, H., and Gat, I. Flow matching guide and code.arXivpreprint arXiv:2412.06264, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Flow-GRPO: Training Flow Matching Models via Online RL

Liu, J., Liu, G., Liang, J., Li, Y., Liu, J., Wang, X., Wan, P., Zhang, D., and Ouyang, W. Flow-grpo: Training flow matching models via online rl.arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Improving Video Generation with Human Feedback

Liu, J., Liu, G., Liang, J., Yuan, Z., Liu, X., Zheng, M., Wu, X., Wang, Q., Xia, M., Wang, X., et al. Improving video generation with human feedback.arXiv:2501.13918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

J., Xie, X., and Lai, J.-H

Lu, Y., Ren, Y., Xia, X., Lin, S., Wang, X., Xiao, X., Ma, A. J., Xie, X., and Lai, J.-H. Adversarial distribution matching for diffusion distillation towards efficient image and video synthesis. InProceedingsoftheIEEE/CVFInternational Conferenceon Computer Vision, pp. 16818–16829, 2025

work page 2025
[23]

Hpsv3: Towards wide-spectrum human preference score

Ma, Y., Wu, X., Sun, K., and Li, H. Hpsv3: Towards wide-spectrum human preference score. InICCV, pp. 15086–15095, 2025

work page 2025
[24]

Video generation models are good latent reward models.arXiv:2511.21541, 2025

Mi, X., Yu, W., Lian, J., Jie, S., Zhong, R., Liu, Z., Zhang, G., Zhou, Z., Xu, Z., Zhou, Y., et al. Video generation models are good latent reward models.arXiv:2511.21541, 2025

work page arXiv 2025
[25]

and Xie, S

Peebles, W. and Xie, S. Scalable diffusion models with transformers. InProceedingsoftheIEEE/CVFinternational conference oncomputer vision, pp. 4195–4205, 2023

work page 2023
[26]

Film: Visual reasoning with a general conditioning layer

Perez, E., Strub, F., De Vries, H., Dumoulin, V., and Courville, A. Film: Visual reasoning with a general conditioning layer. In Proceedings ofthe AAAIconferenceonartificialintelligence, volume 32, 2018

work page 2018
[27]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., and Rombach, R. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

arXiv:2310.03739, 2023

Prabhudesai,M.,Goyal,A.,Pathak,D.,andFragkiadaki,K.Aligningtext-to-imagediffusionmodelswithrewardbackpropagation. arXiv:2310.03739, 2023

work page arXiv 2023
[29]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Qiu, Z., Wang, Z., Zheng, B., Huang, Z., Wen, K., Yang, S., Men, R., Yu, L., Huang, F., Huang, S., et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXivpreprint arXiv:2505.06708, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. InInternational conferenceon machinelearning, pp. 8748–8763. PmLR, 2021

work page 2021
[31]

High-resolution image synthesis with latent diffusion models

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedingsof theIEEE/CVFconferenceoncomputervision andpattern recognition, pp. 10684–10695, 2022

work page 2022
[32]

Fast high-resolution image synthesis with latent adversarial diffusion distillation

Sauer, A., Boesel, F., Dockhorn, T., Blattmann, A., Esser, P., and Rombach, R. Fast high-resolution image synthesis with latent adversarial diffusion distillation. InSIGGRAPH Asia2024 ConferencePapers, pp. 1–11, 2024

work page 2024
[33]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Seedream, T., Chen, Y., Gao, Y., Gong, L., Guo, M., Guo, Q., Guo, Z., Hou, X., Huang, W., Huang, Y., et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv:2509.20427, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

P., Kumar, A., Ermon, S., and Poole, B

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. InICLR, 2021

work page 2021
[35]

Thurstone, L. L. A law of comparative judgment. InScaling, pp. 81–92. Routledge, 2017

work page 2017
[36]

Frank: a ranking method with fidelity loss

Tsai, M.-F., Liu, T.-Y., Qin, T., Chen, H.-H., and Ma, W.-Y. Frank: a ranking method with fidelity loss. InProceedingsofthe 30th annualinternational ACMSIGIR conferenceonResearchanddevelopmentininformationretrieval, pp. 383–390, 2007. 11

work page 2007
[37]

Diffusion model alignment using direct preference optimization

Wallace, B., Dang, M., Rafailov, R., Zhou, L., Lou, A., Purushwalkam, S., Ermon, S., Xiong, C., Joty, S., andNaik, N. Diffusion model alignment using direct preference optimization. InCVPR, pp. 8228–8238, 2024

work page 2024
[38]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.-W., Chen, D., Yu, F., Zhao, H., Yang, J., et al. Wan: Open and advanced large-scale video generative models.arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv:2510.22319, 2025

Wang, J., Liang, J., Liu, J., Liu, H., Liu, G., Zheng, J., Pang, W., Ma, A., Xie, Z., Wang, X., et al. Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv:2510.22319, 2025

work page arXiv 2025
[40]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXivpreprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Vr-thinker: Boosting video reward models through thinking-with-image reasoning.arXiv:2510.10518, 2025

Wang, Q., Liu, J., Liang, J., Jiang, Y., Zhang, Y., Chen, J., Zheng, Y., Wang, X., Wan, P., Yue, X., et al. Vr-thinker: Boosting video reward models through thinking-with-image reasoning.arXiv:2510.10518, 2025

work page arXiv 2025
[42]

Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv:2505.03318, 2025

Wang, Y., Li, Z., Zang, Y., Wang, C., Lu, Q., Jin, C., and Wang, J. Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv:2505.03318, 2025

work page arXiv 2025
[43]

Unified Reward Model for Multimodal Understanding and Generation

Wang, Y., Zang, Y., Li, H., Jin, C., and Wang, J. Unified reward model for multimodal understanding and generation. arXiv:2503.05236, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Qwen-Image Technical Report

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.-m., Bai, S., Xu, X., Chen, Y., et al. Qwen-image technical report. arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Rewarddance: Reward scaling in visual generation

Wu, J., Gao, Y., Ye, Z., Li, M., Li, L., Guo, H., Liu, J., Xue, Z., Hou, X., Liu, W., et al. Rewarddance: Reward scaling in visual generation. arXiv:2509.08826, 2025

work page arXiv 2025
[46]

Visualquality-r1: Reasoning-inducedimagequalityassessmentviareinforcement learning to rank.arXiv:2505.14460, 2025

Wu,T.,Zou,J.,Liang,J.,Zhang,L.,andMa,K. Visualquality-r1: Reasoning-inducedimagequalityassessmentviareinforcement learning to rank.arXiv:2505.14460, 2025

work page arXiv 2025
[47]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Wu, X., Hao, Y., Sun, K., Chen, Y., Zhu, F., Zhao, R., and Li, H. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv:2306.09341, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Human preference score: Better aligning text-to-image models with human preference

Wu, X., Sun, K., Zhu, F., Zhao, R., and Li, H. Human preference score: Better aligning text-to-image models with human preference. InICCV, pp. 2096–2105, 2023

work page 2096
[49]

Denoising diffusion autoencoders are unified self-supervised learners

Xiang, W., Yang, H., Huang, D., and Wang, Y. Denoising diffusion autoencoders are unified self-supervised learners. In Proceedings ofthe IEEE/CVFInternational Conferenceon ComputerVision, pp. 15802–15812, 2023

work page 2023
[50]

Imagereward: Learning and evaluating human preferences for text-to-image generation.NeurIPS, 36:15903–15935, 2023

Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., and Dong, Y. Imagereward: Learning and evaluating human preferences for text-to-image generation.NeurIPS, 36:15903–15935, 2023

work page 2023
[51]

DanceGRPO: Unleashing GRPO on Visual Generation

Xue, Z., Wu, J., Gao, Y., Kong, F., Zhu, L., Chen, M., Liu, Z., Liu, W., Guo, Q., Huang, W., et al. Dancegrpo: Unleashing grpo on visual generation.arXiv:2505.07818, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Using human feedback to fine-tune diffusion models without any reward model

Yang, K., Tao, J., Lyu, J., Ge, C., Chen, J., Shen, W., Zhu, X., and Li, X. Using human feedback to fine-tune diffusion models without any reward model. InCVPR, pp. 8941–8951, 2024

work page 2024
[53]

Improved distribution matching distillation for fast image synthesis.Advancesinneural informationprocessing systems, 37:47455–47487, 2024

Yin, T., Gharbi, M., Park, T., Zhang, R., Shechtman, E., Durand, F., and Freeman, B. Improved distribution matching distillation for fast image synthesis.Advancesinneural informationprocessing systems, 37:47455–47487, 2024

work page 2024
[54]

Mind-v: Hierarchical video generation for long-horizon robotic manipulation with rl-based physical alignment.arXivpreprint arXiv:2512.06628, 2025

Zhang, R., Zhang, M., Zhou, J., Guo, Z., Liu, X., Xu, Z., Zhong, Z., Yan, P., Luo, H., and Li, X. Mind-v: Hierarchical video generation for long-horizon robotic manipulation with rl-based physical alignment.arXivpreprint arXiv:2512.06628, 2025

work page arXiv 2025
[55]

Learning multi-dimensional human preference for text-to-image generation

Zhang, S., Wang, B., Wu, J., Li, Y., Gao, T., Zhang, D., and Wang, Z. Learning multi-dimensional human preference for text-to-image generation. InCVPR, pp. 8018–8027, 2024

work page 2024
[56]

Diffusionmodelasanoise-aware latent reward model for step-level preference optimization.arXiv:2502.01051, 2025

Zhang,T.,Da,C.,Ding,K.,Yang,H.,Jin,K.,Li,Y.,Gao,T.,Zhang,D.,Xiang,S.,andPan,C. Diffusionmodelasanoise-aware latent reward model for step-level preference optimization.arXiv:2502.01051, 2025

work page arXiv 2025
[57]

golden metric

Zhang, X., Zhang, X., Wu, Y., Cao, Y., Zhang, R., Chu, R., Yang, L., and Yang, Y. Generative universal verifier as multimodal meta-reasoner. arXivpreprint arXiv:2510.13804, 2025. 12 Appendix A Uncertainty Analysis Our reward model is evaluated on noise-conditioned inputs. Each evaluation samples Gaussian noise to construct a perturbed statex𝑡. Consequentl...

work page arXiv 2025

[1] [1]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952. ISSN 00063444. URLhttp://www.jstor.org/stable/2334029

work page arXiv 1952

[3] [3]

Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

work page 2024

[4] [4]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Cai, H., Cao, S., Du, R., Gao, P., Hoi, S., Huang, S., Hou, Z., Jiang, D., Jin, X., Li, L., et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv:2511.22699, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Toward generalized image quality assessment: Relaxing the perfect reference quality assumption

Chen, D., Wu, T., Ma, K., and Zhang, L. Toward generalized image quality assessment: Relaxing the perfect reference quality assumption. InProceedings ofthe ComputerVisionandPattern RecognitionConference, pp. 12742–12752, 2025

work page 2025

[6] [6]

Scaling rectified flow transformers for high-resolution image synthesis

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

work page 2024

[7] [7]

Onereward: Unified mask-guided image generation via multi-task human preference learning.arXiv:2508.21066, 2025

Gong, Y., Wang, X., Wu, J., Wang, S., Wang, Y., and Wu, X. Onereward: Unified mask-guided image generation via multi-task human preference learning.arXiv:2508.21066, 2025

work page arXiv 2025

[8] [8]

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation

He, X., Jiang, D., Zhang, G., Ku, M., Soni, A., Siu, S., Chen, H., Chandra, A., Jiang, Z., Arulraj, A., et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation. InEMNLP, pp. 2105–2123, 2024

work page 2024

[9] [9]

Denoising diffusion probabilistic models.Advancesin neural informationprocessing systems, 33:6840–6851, 2020

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models.Advancesin neural informationprocessing systems, 33:6840–6851, 2020

work page 2020

[10] [10]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card.arXivpreprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Genai arena: An open evaluation platform for generative models

Jiang, D., Ku, M., Li, T., Ni, Y., Sun, S., Fan, R., and Chen, W. Genai arena: An open evaluation platform for generative models. Advancesin NeuralInformationProcessingSystems, 37:79889–79908, 2024

work page 2024

[12] [12]

Pick-a-pic: An open dataset of user preferences for text-to-image generation.NeurIPS, 36:36652–36663, 2023

Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., and Levy, O. Pick-a-pic: An open dataset of user preferences for text-to-image generation.NeurIPS, 36:36652–36663, 2023. 10

work page 2023

[13] [13]

Labs, B. F. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024

[14] [14]

C., Prabhudesai, M., Duggal, S., Brown, E., and Pathak, D

Li, A. C., Prabhudesai, M., Duggal, S., Brown, E., and Pathak, D. Your diffusion model is secretly a zero-shot classifier. In Proceedings ofthe IEEE/CVFInternational Conferenceon ComputerVision, pp. 2206–2217, 2023

work page 2023

[15] [15]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Li, J., Li, D., Xiong, C., and Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conferenceonmachinelearning, pp. 12888–12900. PMLR, 2022

work page 2022

[16] [16]

Rich human feedback for text-to-image generation

Liang, Y., He, J., Li, G., Li, P., Klimovskiy, A., Carolan, N., Sun, J., Pont-Tuset, J., Young, S., Yang, F., et al. Rich human feedback for text-to-image generation. InCVPR, 2024

work page 2024

[17] [17]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [18]

Flow Matching Guide and Code

Lipman, Y., Havasi, M., Holderrieth, P., Shaul, N., Le, M., Karrer, B., Chen, R. T., Lopez-Paz, D., Ben-Hamu, H., and Gat, I. Flow matching guide and code.arXivpreprint arXiv:2412.06264, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Flow-GRPO: Training Flow Matching Models via Online RL

Liu, J., Liu, G., Liang, J., Li, Y., Liu, J., Wang, X., Wan, P., Zhang, D., and Ouyang, W. Flow-grpo: Training flow matching models via online rl.arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Improving Video Generation with Human Feedback

Liu, J., Liu, G., Liang, J., Yuan, Z., Liu, X., Zheng, M., Wu, X., Wang, Q., Xia, M., Wang, X., et al. Improving video generation with human feedback.arXiv:2501.13918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[22] [22]

J., Xie, X., and Lai, J.-H

Lu, Y., Ren, Y., Xia, X., Lin, S., Wang, X., Xiao, X., Ma, A. J., Xie, X., and Lai, J.-H. Adversarial distribution matching for diffusion distillation towards efficient image and video synthesis. InProceedingsoftheIEEE/CVFInternational Conferenceon Computer Vision, pp. 16818–16829, 2025

work page 2025

[23] [23]

Hpsv3: Towards wide-spectrum human preference score

Ma, Y., Wu, X., Sun, K., and Li, H. Hpsv3: Towards wide-spectrum human preference score. InICCV, pp. 15086–15095, 2025

work page 2025

[24] [24]

Video generation models are good latent reward models.arXiv:2511.21541, 2025

Mi, X., Yu, W., Lian, J., Jie, S., Zhong, R., Liu, Z., Zhang, G., Zhou, Z., Xu, Z., Zhou, Y., et al. Video generation models are good latent reward models.arXiv:2511.21541, 2025

work page arXiv 2025

[25] [25]

and Xie, S

Peebles, W. and Xie, S. Scalable diffusion models with transformers. InProceedingsoftheIEEE/CVFinternational conference oncomputer vision, pp. 4195–4205, 2023

work page 2023

[26] [26]

Film: Visual reasoning with a general conditioning layer

Perez, E., Strub, F., De Vries, H., Dumoulin, V., and Courville, A. Film: Visual reasoning with a general conditioning layer. In Proceedings ofthe AAAIconferenceonartificialintelligence, volume 32, 2018

work page 2018

[27] [27]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., and Rombach, R. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

arXiv:2310.03739, 2023

Prabhudesai,M.,Goyal,A.,Pathak,D.,andFragkiadaki,K.Aligningtext-to-imagediffusionmodelswithrewardbackpropagation. arXiv:2310.03739, 2023

work page arXiv 2023

[29] [29]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Qiu, Z., Wang, Z., Zheng, B., Huang, Z., Wen, K., Yang, S., Men, R., Yu, L., Huang, F., Huang, S., et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXivpreprint arXiv:2505.06708, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. InInternational conferenceon machinelearning, pp. 8748–8763. PmLR, 2021

work page 2021

[31] [31]

High-resolution image synthesis with latent diffusion models

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedingsof theIEEE/CVFconferenceoncomputervision andpattern recognition, pp. 10684–10695, 2022

work page 2022

[32] [32]

Fast high-resolution image synthesis with latent adversarial diffusion distillation

Sauer, A., Boesel, F., Dockhorn, T., Blattmann, A., Esser, P., and Rombach, R. Fast high-resolution image synthesis with latent adversarial diffusion distillation. InSIGGRAPH Asia2024 ConferencePapers, pp. 1–11, 2024

work page 2024

[33] [33]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Seedream, T., Chen, Y., Gao, Y., Gong, L., Guo, M., Guo, Q., Guo, Z., Hou, X., Huang, W., Huang, Y., et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv:2509.20427, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

P., Kumar, A., Ermon, S., and Poole, B

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. InICLR, 2021

work page 2021

[35] [35]

Thurstone, L. L. A law of comparative judgment. InScaling, pp. 81–92. Routledge, 2017

work page 2017

[36] [36]

Frank: a ranking method with fidelity loss

Tsai, M.-F., Liu, T.-Y., Qin, T., Chen, H.-H., and Ma, W.-Y. Frank: a ranking method with fidelity loss. InProceedingsofthe 30th annualinternational ACMSIGIR conferenceonResearchanddevelopmentininformationretrieval, pp. 383–390, 2007. 11

work page 2007

[37] [37]

Diffusion model alignment using direct preference optimization

Wallace, B., Dang, M., Rafailov, R., Zhou, L., Lou, A., Purushwalkam, S., Ermon, S., Xiong, C., Joty, S., andNaik, N. Diffusion model alignment using direct preference optimization. InCVPR, pp. 8228–8238, 2024

work page 2024

[38] [38]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.-W., Chen, D., Yu, F., Zhao, H., Yang, J., et al. Wan: Open and advanced large-scale video generative models.arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv:2510.22319, 2025

Wang, J., Liang, J., Liu, J., Liu, H., Liu, G., Zheng, J., Pang, W., Ma, A., Xie, Z., Wang, X., et al. Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv:2510.22319, 2025

work page arXiv 2025

[40] [40]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXivpreprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Vr-thinker: Boosting video reward models through thinking-with-image reasoning.arXiv:2510.10518, 2025

Wang, Q., Liu, J., Liang, J., Jiang, Y., Zhang, Y., Chen, J., Zheng, Y., Wang, X., Wan, P., Yue, X., et al. Vr-thinker: Boosting video reward models through thinking-with-image reasoning.arXiv:2510.10518, 2025

work page arXiv 2025

[42] [42]

Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv:2505.03318, 2025

Wang, Y., Li, Z., Zang, Y., Wang, C., Lu, Q., Jin, C., and Wang, J. Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv:2505.03318, 2025

work page arXiv 2025

[43] [43]

Unified Reward Model for Multimodal Understanding and Generation

Wang, Y., Zang, Y., Li, H., Jin, C., and Wang, J. Unified reward model for multimodal understanding and generation. arXiv:2503.05236, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

Qwen-Image Technical Report

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.-m., Bai, S., Xu, X., Chen, Y., et al. Qwen-image technical report. arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Rewarddance: Reward scaling in visual generation

Wu, J., Gao, Y., Ye, Z., Li, M., Li, L., Guo, H., Liu, J., Xue, Z., Hou, X., Liu, W., et al. Rewarddance: Reward scaling in visual generation. arXiv:2509.08826, 2025

work page arXiv 2025

[46] [46]

Visualquality-r1: Reasoning-inducedimagequalityassessmentviareinforcement learning to rank.arXiv:2505.14460, 2025

Wu,T.,Zou,J.,Liang,J.,Zhang,L.,andMa,K. Visualquality-r1: Reasoning-inducedimagequalityassessmentviareinforcement learning to rank.arXiv:2505.14460, 2025

work page arXiv 2025

[47] [47]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Wu, X., Hao, Y., Sun, K., Chen, Y., Zhu, F., Zhao, R., and Li, H. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv:2306.09341, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [48]

Human preference score: Better aligning text-to-image models with human preference

Wu, X., Sun, K., Zhu, F., Zhao, R., and Li, H. Human preference score: Better aligning text-to-image models with human preference. InICCV, pp. 2096–2105, 2023

work page 2096

[49] [49]

Denoising diffusion autoencoders are unified self-supervised learners

Xiang, W., Yang, H., Huang, D., and Wang, Y. Denoising diffusion autoencoders are unified self-supervised learners. In Proceedings ofthe IEEE/CVFInternational Conferenceon ComputerVision, pp. 15802–15812, 2023

work page 2023

[50] [50]

Imagereward: Learning and evaluating human preferences for text-to-image generation.NeurIPS, 36:15903–15935, 2023

Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., and Dong, Y. Imagereward: Learning and evaluating human preferences for text-to-image generation.NeurIPS, 36:15903–15935, 2023

work page 2023

[51] [51]

DanceGRPO: Unleashing GRPO on Visual Generation

Xue, Z., Wu, J., Gao, Y., Kong, F., Zhu, L., Chen, M., Liu, Z., Liu, W., Guo, Q., Huang, W., et al. Dancegrpo: Unleashing grpo on visual generation.arXiv:2505.07818, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

Using human feedback to fine-tune diffusion models without any reward model

Yang, K., Tao, J., Lyu, J., Ge, C., Chen, J., Shen, W., Zhu, X., and Li, X. Using human feedback to fine-tune diffusion models without any reward model. InCVPR, pp. 8941–8951, 2024

work page 2024

[53] [53]

Improved distribution matching distillation for fast image synthesis.Advancesinneural informationprocessing systems, 37:47455–47487, 2024

Yin, T., Gharbi, M., Park, T., Zhang, R., Shechtman, E., Durand, F., and Freeman, B. Improved distribution matching distillation for fast image synthesis.Advancesinneural informationprocessing systems, 37:47455–47487, 2024

work page 2024

[54] [54]

Mind-v: Hierarchical video generation for long-horizon robotic manipulation with rl-based physical alignment.arXivpreprint arXiv:2512.06628, 2025

Zhang, R., Zhang, M., Zhou, J., Guo, Z., Liu, X., Xu, Z., Zhong, Z., Yan, P., Luo, H., and Li, X. Mind-v: Hierarchical video generation for long-horizon robotic manipulation with rl-based physical alignment.arXivpreprint arXiv:2512.06628, 2025

work page arXiv 2025

[55] [55]

Learning multi-dimensional human preference for text-to-image generation

Zhang, S., Wang, B., Wu, J., Li, Y., Gao, T., Zhang, D., and Wang, Z. Learning multi-dimensional human preference for text-to-image generation. InCVPR, pp. 8018–8027, 2024

work page 2024

[56] [56]

Diffusionmodelasanoise-aware latent reward model for step-level preference optimization.arXiv:2502.01051, 2025

Zhang,T.,Da,C.,Ding,K.,Yang,H.,Jin,K.,Li,Y.,Gao,T.,Zhang,D.,Xiang,S.,andPan,C. Diffusionmodelasanoise-aware latent reward model for step-level preference optimization.arXiv:2502.01051, 2025

work page arXiv 2025

[57] [57]

golden metric

Zhang, X., Zhang, X., Wu, Y., Cao, Y., Zhang, R., Chu, R., Yang, L., and Yang, Y. Generative universal verifier as multimodal meta-reasoner. arXivpreprint arXiv:2510.13804, 2025. 12 Appendix A Uncertainty Analysis Our reward model is evaluated on noise-conditioned inputs. Each evaluation samples Gaussian noise to construct a perturbed statex𝑡. Consequentl...

work page arXiv 2025