pith. machine review for the scientific record. sign in

arxiv: 2604.14910 · v3 · submitted 2026-04-16 · 💻 cs.CV

Recognition: unknown

Reward-Aware Trajectory Shaping for Few-step Visual Generation

Bingyu Li, Chi Zhang, Haibin Huang, Rui Li, Xuelong Li, Yuanzhi Liang

Pith reviewed 2026-05-10 11:40 UTC · model grok-4.3

classification 💻 cs.CV
keywords few-step generationtrajectory distillationpreference alignmentreward-aware gatingvisual generationdiffusion modelsgenerative modelingstudent-teacher distillation
0
0 comments X

The pith

Reward-aware trajectory shaping lets few-step generators surpass their multi-step teachers by following human preferences instead of strict imitation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard distillation forces few-step student models to copy a stronger multi-step teacher, creating an artificial performance ceiling. By adding preference alignment, the student instead optimizes for outputs that score higher on a reward model reflecting human judgments. The proposed method aligns trajectories at selected denoising stages and uses an adaptive gate to tighten or loosen teacher guidance depending on which model currently earns the higher reward. If the approach holds, fast generators could deliver better results than slow ones without any extra cost during use. This shifts the focus from pure speed-quality trade-offs to direct optimization of preferred generation quality.

Core claim

The central claim is that introducing preference alignment awareness into trajectory distillation enables the few-step student to optimize toward reward-preferred generation quality and potentially exceed the multi-step teacher, rather than remaining bounded by rigid imitation. Teacher and student latent trajectories are aligned at key denoising stages through horizon matching, while a reward-aware gate adaptively strengthens teacher guidance when the teacher scores higher and relaxes it when the student matches or surpasses, allowing continued reward-driven improvement. This combination transfers preference-relevant knowledge from high-step generators with no added test-time overhead.

What carries the argument

The reward-aware gate, which adaptively regulates teacher guidance strength according to the relative reward scores of teacher and student outputs during trajectory shaping.

If this is right

  • Few-step models can achieve higher generation quality than their multi-step teachers.
  • Preference knowledge transfers from high-step generators without any increase in test-time computation.
  • The efficiency-quality trade-off improves because the student is no longer capped by teacher imitation.
  • Training remains stable enough for the student to continue improving whenever it matches or exceeds the teacher on reward.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reward-guided relaxation logic could apply to accelerating other iterative generative processes beyond images.
  • Combining this gate with stronger or multi-modal reward models might further widen the quality gap between few-step and multi-step outputs.
  • The approach implies that reward models can serve as dynamic training signals rather than only as post-hoc evaluators.

Load-bearing premise

A reward model can reliably score generations according to human preferences, and the adaptive gate will permit ongoing improvement without introducing artifacts or training instability.

What would settle it

Train the model and check whether student reward scores keep rising past the teacher's level after the gate relaxes, or whether quality drops and artifacts appear once the gate begins relaxing guidance.

Figures

Figures reproduced from arXiv: 2604.14910 by Bingyu Li, Chi Zhang, Haibin Huang, Rui Li, Xuelong Li, Yuanzhi Liang.

Figure 1
Figure 1. Figure 1: Comparison of our framework with existing few [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Reward-Aware Trajectory Shaping (RATS). Given the same initial noise and text prompt, a few-step [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results on Flux. Few-step image generation: left—Hyper-Flux, middle—SenseFlow, [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison between our method with 8 NFEs and Wan with 50 NFEs. Our method produces consistently [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Teacher–student quality comparison on the first frames of generated videos throughout training. Our method [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Smoothed reward comparison (top) and reward [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results on Flux. 8-NFE image generation: left—Hyper-Flux, middle—SenseFlow, right—Ours [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative results on Flux. 5-NFE image generation: left—Hyper-Flux, middle—SenseFlow, right—Ours [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative results on Flux. 3-NFE image generation: left—Hyper-Flux, middle—SenseFlow, right—Ours [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Additional qualitative comparisons on Wan-based video generation under [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Additional qualitative comparisons on Wan-based video generation under [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
read the original abstract

Achieving high-fidelity generation in extremely few sampling steps has long been a central goal of generative modeling. Existing approaches largely rely on distillation-based frameworks to compress the original multi-step denoising process into a few-step generator. However, such methods inherently constrain the student to imitate a stronger multi-step teacher, imposing the teacher as an upper bound on student performance. We argue that introducing \textbf{preference alignment awareness} enables the student to optimize toward reward-preferred generation quality, potentially surpassing the teacher instead of being restricted to rigid teacher imitation. To this end, we propose \textbf{Reward-Aware Trajectory Shaping (RATS)}, a lightweight framework for preference-aligned few-step generation. Specifically, teacher and student latent trajectories are aligned at key denoising stages through horizon matching, while a \textbf{reward-aware gate} is introduced to adaptively regulate teacher guidance based on their relative reward performance. Trajectory shaping is strengthened when the teacher achieves higher rewards, and relaxed when the student matches or surpasses the teacher, thereby enabling continued reward-driven improvement. By seamlessly integrating trajectory distillation, reward-aware gating, and preference alignment, RATS effectively transfers preference-relevant knowledge from high-step generators without incurring additional test-time computational overhead. Experimental results demonstrate that RATS substantially improves the efficiency--quality trade-off in few-step visual generation, significantly narrowing the gap between few-step students and stronger multi-step generators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Reward-Aware Trajectory Shaping (RATS) for few-step visual generation. It claims that standard distillation frameworks limit students to imitating a multi-step teacher, and introduces horizon matching to align teacher and student latent trajectories at key denoising stages together with a reward-aware gate that adaptively regulates teacher guidance according to relative reward performance. When the teacher scores higher the gate strengthens shaping; when the student matches or exceeds the teacher the gate relaxes, purportedly enabling continued reward-driven improvement and preference alignment without extra test-time cost. The abstract states that experiments show substantial gains in the efficiency-quality trade-off.

Significance. If the mechanism is shown to work, the result would be significant for diffusion-based generation: it offers a lightweight way to break the conventional teacher upper bound in few-step distillation by incorporating preference signals, potentially improving quality for applications that require fast sampling.

major comments (2)
  1. [Abstract / §3 (method)] Abstract and method description: the central claim that the reward-aware gate enables the student to surpass the teacher rests on the assertion that relaxation supplies an 'alternative direction' for reward-driven improvement. No explicit reward term, RL objective, or preference loss appears in the student training objective; the framework is described as a gated distillation loss. Without such a term, relaxation alone weakens the teacher signal but does not supply a gradient toward higher-reward outputs, undermining the claim that the student can reliably exceed the teacher.
  2. [Experiments] Experiments section: the abstract asserts 'substantial improvements' and 'significantly narrowing the gap' with multi-step generators, yet supplies no quantitative metrics, ablation results, baseline comparisons, or implementation details (e.g., reward model architecture, gate formulation, training hyperparameters). This absence makes it impossible to assess whether the data support the surpassing-teacher claim.
minor comments (2)
  1. [§3] Notation for the reward-aware gate and horizon matching should be formalized with equations rather than prose descriptions to allow reproducibility.
  2. The abstract uses boldface for 'preference alignment awareness' and 'reward-aware gate'; ensure consistent typographic treatment of new terms throughout the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments have identified key areas for improving the clarity of our claims and the completeness of our experimental reporting. We address each major comment below and outline the planned revisions.

read point-by-point responses
  1. Referee: [Abstract / §3 (method)] Abstract and method description: the central claim that the reward-aware gate enables the student to surpass the teacher rests on the assertion that relaxation supplies an 'alternative direction' for reward-driven improvement. No explicit reward term, RL objective, or preference loss appears in the student training objective; the framework is described as a gated distillation loss. Without such a term, relaxation alone weakens the teacher signal but does not supply a gradient toward higher-reward outputs, undermining the claim that the student can reliably exceed the teacher.

    Authors: We appreciate the referee highlighting this important distinction. The reward-aware gate modulates the distillation strength according to relative reward performance between teacher and student trajectories at matched horizons. Relaxation occurs when the student matches or exceeds the teacher, which is intended to prevent the student from being forced back onto the teacher's path and thereby allow it to retain any reward advantages it has discovered. However, we agree that the core objective remains a gated distillation loss without an explicit reward maximization term, RL objective, or preference loss. The potential for surpassing the teacher is therefore indirect, arising from the adaptive relaxation rather than from direct gradient signals toward higher rewards. We will revise the abstract and Section 3 to clarify this mechanism, avoid overstating the role of the gate as supplying an 'alternative direction' via reward gradients, and provide the precise mathematical formulation of the gate and loss to make the distinction transparent. revision: partial

  2. Referee: [Experiments] Experiments section: the abstract asserts 'substantial improvements' and 'significantly narrowing the gap' with multi-step generators, yet supplies no quantitative metrics, ablation results, baseline comparisons, or implementation details (e.g., reward model architecture, gate formulation, training hyperparameters). This absence makes it impossible to assess whether the data support the surpassing-teacher claim.

    Authors: We fully agree that the current manuscript lacks the quantitative evidence and implementation details necessary to substantiate the claims. In the revised version we will expand the Experiments section to report concrete metrics (FID, CLIP similarity, and human preference scores), direct comparisons against the multi-step teacher and relevant few-step baselines, ablation studies on horizon matching and the reward-aware gate, and complete implementation details including the reward model architecture, gate formulation, and all training hyperparameters. These additions will enable readers to evaluate the empirical support for the reported improvements. revision: yes

Circularity Check

0 steps flagged

No circularity detected in the derivation chain

full rationale

The paper proposes RATS as an additive framework combining horizon matching for trajectory alignment with a reward-aware gate that strengthens or relaxes teacher guidance based on relative reward scores. The central claim that this enables the student to surpass the teacher via continued reward-driven improvement is presented as a consequence of the adaptive mechanism rather than a mathematical reduction. No equations are exhibited that render any prediction equivalent to its inputs by construction, no parameters are fitted to a subset and then renamed as predictions, and no load-bearing self-citations or imported uniqueness theorems appear in the description. The approach extends standard distillation without self-definitional or tautological steps, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; full technical details, equations, and experimental setup unavailable, so ledger entries are minimal and provisional.

axioms (1)
  • domain assumption A reward model can serve as a reliable proxy for human-preferred generation quality during training.
    Invoked when the reward-aware gate decides whether to strengthen or relax teacher guidance.
invented entities (1)
  • Reward-aware gate no independent evidence
    purpose: Adaptively regulates the strength of teacher guidance based on relative reward scores of teacher and student trajectories.
    New component introduced to enable the student to surpass the teacher.

pith-pipeline@v0.9.0 · 5554 in / 1270 out tokens · 42593 ms · 2026-05-10T11:40:55.182029+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 31 canonical work pages · 11 internal anchors

  1. [1]

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine

  2. [2]

    Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301(2023)

  3. [3]

    Clement Chadebec, Onur Tasar, Eyal Benaroche, and Benjamin Aubin. [n. d.]. Flash diffusion: Accelerating any conditional diffusion model for few steps image generation. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 15686–15695

  4. [4]

    Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Enze Xie, and Song Han. 2025. Sana-sprint: One-step diffusion with continuous-time consistency distillation.arXiv preprint arXiv:2503.09641(2025)

  5. [5]

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences.Advances in neural information processing systems30 (2017)

  6. [6]

    Kenji Doya. 2000. Reinforcement learning in continuous time and space.Neural computation12, 1 (2000), 219–245

  7. [7]

    Xingtong Ge, Xin Zhang, Tongda Xu, Yi Zhang, Xinjie Zhang, Yan Wang, and Jun Zhang. 2025. SenseFlow: Scaling Distribution Matching for Flow-based Text-to-Image Distillation.arXiv preprint arXiv:2506.00523(2025)

  8. [8]

    Xiaoxuan He, Siming Fu, Yuke Zhao, Wanli Li, Jian Yang, Dacheng Yin, Fengyun Rao, and Bo Zhang. 2025. Tempflow-grpo: When timing matters for grpo in flow models.arXiv preprint arXiv:2508.04324(2025)

  9. [9]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models.Iclr1, 2 (2022), 3

  10. [10]

    Jacek Karwowski, Oliver Hayman, Xingjian Bai, Klaus Kiendlhofer, Charlie Grif- fin, and Joar Skalse. 2023. Goodhart’s law in reinforcement learning.arXiv preprint arXiv:2310.09144(2023)

  11. [11]

    Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. 2024. Consis- tency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion. InICLR. OpenReview.net

  12. [12]

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. 2023. Pick-a-pic: An open dataset of user preferences for text-to- image generation.Advances in neural information processing systems36 (2023), 36652–36663

  13. [13]

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. 2024. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603 (2024)

  14. [14]

    Boyi Li, Yifan Shen, Yuanzhe Liu, Yifan Xu, Jiateng Liu, Xinzhuo Li, Zhengyuan Li, Jingyuan Zhu, Yunhan Zhong, Fangzhou Lan, et al. 2026. Toward Cognitive Su- persensing in Multimodal Large Language Model.arXiv preprint arXiv:2602.01541 (2026)

  15. [15]

    Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, and Zhao Zhong. 2025. MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE.arXiv preprint arXiv:2507.21802(2025)

  16. [16]

    Jiachen Li, Qian Long, Jian Zheng, Xiaofeng Gao, Robinson Piramuthu, Wenhu Chen, and William Yang Wang. 2024. T2v-turbo-v2: Enhancing video generation model post-training through data, reward, and conditional guidance design.arXiv preprint arXiv:2410.05677(2024)

  17. [17]

    Shanchuan Lin, Anran Wang, and Xiao Yang. 2024. Sdxl-lightning: Progressive adversarial diffusion distillation.arXiv preprint arXiv:2402.13929(2024)

  18. [18]

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. 2025. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470(2025)

  19. [19]

    Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. 2025. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918(2025)

  20. [20]

    Cheng Lu and Yang Song. 2024. Simplifying, stabilizing and scaling continuous- time consistency models.arXiv preprint arXiv:2410.11081(2024)

  21. [21]

    Eric Luhman and Troy Luhman. 2021. Knowledge distillation in iterative gen- erative models for improved sampling speed.arXiv preprint arXiv:2101.02388 (2021)

  22. [22]

    Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. 2023. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378(2023)

  23. [23]

    Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, and Yang You. 2025. Enhance-A-Video: Better Generated Video for Free.arXiv preprint arXiv:2502.07508(2025)

  24. [24]

    Mihir Prabhudesai, Russell Mendonca, Zheyang Qin, Katerina Fragkiadaki, and Deepak Pathak. 2024. Video diffusion alignment via reward gradients.arXiv preprint arXiv:2407.08737(2024)

  25. [25]

    Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba

  26. [26]

    Sequence level training with recurrent neural networks.arXiv preprint arXiv:1511.06732(2015)

  27. [27]

    Yuxi Ren, Xin Xia, Yanzuo Lu, Jiacheng Zhang, Jie Wu, Pan Xie, Xing Wang, and Xuefeng Xiao. 2024. Hyper-sd: Trajectory segmented consistency model for efficient image synthesis.arXiv preprint arXiv:2404.13686(2024)

  28. [28]

    Tim Salimans and Jonathan Ho. 2022. Progressive Distillation for Fast Sam- pling of Diffusion Models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=TIdIXIpzhoI

  29. [29]

    Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. 2024. Adversarial diffusion distillation. InEuropean Conference on Computer Vision. Springer, 87–103

  30. [30]

    Wolfram Schultz, Peter Dayan, and P Read Montague. 1997. A neural substrate of prediction and reward.Science275, 5306 (1997), 1593–1599

  31. [31]

    Huiyang Shao, Xin Xia, Yuhong Yang, Yuxi Ren, Xing Wang, and Xuefeng Xiao

  32. [32]

    RayFlow: Instance-Aware Diffusion Acceleration via Adaptive Flow Trajec- tories.arXiv preprint arXiv:2503.07699(2025)

  33. [33]

    Yifan Shen, Jiateng Liu, Xinzhuo Li, Yuanzhe Liu, Bingxuan Li, Houze Yang, Wenqi Jia, Yijiang Li, Tianjiao Yu, James Matthew Rehg, et al . 2026. EgoForge: Goal- Directed Egocentric World Simulator.arXiv preprint arXiv:2603.20169(2026)

  34. [34]

    Yifan Shen, Yuanzhe Liu, Jingyuan Zhu, Xu Cao, Xiaofeng Zhang, Yixiao He, Wenming Ye, James Matthew Rehg, and Ismini Lourentzou. 2025. Fine-grained preference optimization improves spatial reasoning in vlms.arXiv preprint arXiv:2506.21656(2025)

  35. [35]

    Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, et al. 2024. Motion- i2v: Consistent and controllable image-to-video generation with explicit motion modeling. InACM SIGGRAPH 2024 Conference Papers. 1–11

  36. [36]

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. 2023. Consistency Models. InInternational Conference on Machine Learning. PMLR, 32211–32252. Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al

  37. [37]

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik

  38. [38]

    Diffusion Model Alignment Using Direct Preference Optimization. InCVPR. IEEE, 8228–8238

  39. [39]

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. 2025. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314(2025)

  40. [40]

    Jin Wang, Jianxiang Lu, Guangzheng Xu, Comi Chen, Haoyu Yang, Linqing Wang, Peng Chen, Mingtao Chen, Zhichao Hu, Longhuang Wu, et al. 2026. TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment. arXiv preprint arXiv:2601.05729(2026)

  41. [41]

    Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. 2025. Pref-GRPO: Pairwise Preference Reward- based GRPO for Stable Text-to-Image Reinforcement Learning.arXiv preprint arXiv:2508.20751(2025)

  42. [42]

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. 2023. Human preference score v2: A solid benchmark for evaluat- ing human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341 (2023)

  43. [43]

    Kingma, Tingbo Hou, Ying Nian Wu, Kevin P

    Sirui Xie, Zhisheng Xiao, Diederik P. Kingma, Tingbo Hou, Ying Nian Wu, Kevin P. Murphy, Tim Salimans, Ben Poole, and Ruiqi Gao. 2024. EM Distillation for One- step Diffusion Models. InNeurIPS

  44. [44]

    Feng Xu, Guangyao Zhai, Xin Kong, Tingzhong Fu, Daniel FN Gordon, Xueli An, and Benjamin Busam. 2025. STARE-VLA: Progressive Stage-Aware Reinforcement for Fine-Tuning Vision-Language-Action Models.arXiv preprint arXiv:2512.05107 (2025)

  45. [45]

    Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, et al . 2024. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation.arXiv preprint arXiv:2412.21059(2024)

  46. [46]

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. 2023. Imagereward: Learning and evaluating human prefer- ences for text-to-image generation.Advances in Neural Information Processing Systems36 (2023), 15903–15935

  47. [47]

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al . 2025. DanceGRPO: Unleashing GRPO on Visual Generation.arXiv preprint arXiv:2505.07818(2025)

  48. [48]

    Xiaomeng Yang, Zhiyu Tan, and Hao Li. 2025. IPO: Iterative preference optimiza- tion for text-to-video generation.arXiv preprint arXiv:2502.02088(2025)

  49. [49]

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. 2024. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems37 (2024), 47455–47487

  50. [50]

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. 2024. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6613–6623

  51. [51]

    Zhenyu Yu, Mohd Yamani Idna Idris, and Pei Wang. 2025. Visualizing Our Changing Earth: A Creative AI Framework for Democratizing Environmental Storytelling Through Satellite Imagery. InNeurIPS 2025

  52. [52]

    Zhenyu Yu, Mohd Yamani Idna Idris, Pei Wang, and Rizwan Qureshi. 2026. DINOv3-Powered Multi-Task Foundation Model for Quantitative Remote Sensing Estimation.AAAI 202640, 48 (2026), 41455–41456

  53. [53]

    Zhenyu Yu, Haoran Jiang, Pei Wang, Zizhen Lin, and Yong Xiang. 2026. Spa- tiotemporal Alignment for Remote Sensing Image Recovery via Terrain-Aware Diffusion.ICASSP 2026(2026)

  54. [54]

    Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, and Dong Ni. 2024. Instructvideo: Instructing video diffusion models with human feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6463–6474

  55. [55]

    Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, et al . 2025. R1-reward: Training multimodal reward model through stable reinforcement learning.arXiv preprint arXiv:2505.02835(2025). Due to the page constraint of the main paper, the supplemen- tary material provides additional method...