arxiv: 2604.14910 · v3 · submitted 2026-04-16 · 💻 cs.CV

Recognition: unknown

Reward-Aware Trajectory Shaping for Few-step Visual Generation

Bingyu Li, Chi Zhang, Haibin Huang, Rui Li, Xuelong Li, Yuanzhi Liang

Pith reviewed 2026-05-10 11:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords few-step generationtrajectory distillationpreference alignmentreward-aware gatingvisual generationdiffusion modelsgenerative modelingstudent-teacher distillation

0 comments

The pith

Reward-aware trajectory shaping lets few-step generators surpass their multi-step teachers by following human preferences instead of strict imitation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard distillation forces few-step student models to copy a stronger multi-step teacher, creating an artificial performance ceiling. By adding preference alignment, the student instead optimizes for outputs that score higher on a reward model reflecting human judgments. The proposed method aligns trajectories at selected denoising stages and uses an adaptive gate to tighten or loosen teacher guidance depending on which model currently earns the higher reward. If the approach holds, fast generators could deliver better results than slow ones without any extra cost during use. This shifts the focus from pure speed-quality trade-offs to direct optimization of preferred generation quality.

Core claim

The central claim is that introducing preference alignment awareness into trajectory distillation enables the few-step student to optimize toward reward-preferred generation quality and potentially exceed the multi-step teacher, rather than remaining bounded by rigid imitation. Teacher and student latent trajectories are aligned at key denoising stages through horizon matching, while a reward-aware gate adaptively strengthens teacher guidance when the teacher scores higher and relaxes it when the student matches or surpasses, allowing continued reward-driven improvement. This combination transfers preference-relevant knowledge from high-step generators with no added test-time overhead.

What carries the argument

The reward-aware gate, which adaptively regulates teacher guidance strength according to the relative reward scores of teacher and student outputs during trajectory shaping.

If this is right

Few-step models can achieve higher generation quality than their multi-step teachers.
Preference knowledge transfers from high-step generators without any increase in test-time computation.
The efficiency-quality trade-off improves because the student is no longer capped by teacher imitation.
Training remains stable enough for the student to continue improving whenever it matches or exceeds the teacher on reward.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reward-guided relaxation logic could apply to accelerating other iterative generative processes beyond images.
Combining this gate with stronger or multi-modal reward models might further widen the quality gap between few-step and multi-step outputs.
The approach implies that reward models can serve as dynamic training signals rather than only as post-hoc evaluators.

Load-bearing premise

A reward model can reliably score generations according to human preferences, and the adaptive gate will permit ongoing improvement without introducing artifacts or training instability.

What would settle it

Train the model and check whether student reward scores keep rising past the teacher's level after the gate relaxes, or whether quality drops and artifacts appear once the gate begins relaxing guidance.

Figures

Figures reproduced from arXiv: 2604.14910 by Bingyu Li, Chi Zhang, Haibin Huang, Rui Li, Xuelong Li, Yuanzhi Liang.

**Figure 2.** Figure 2: Overview of Reward-Aware Trajectory Shaping (RATS). Given the same initial noise and text prompt, a few-step [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative results on Flux. Few-step image generation: left—Hyper-Flux, middle—SenseFlow, [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison between our method with 8 NFEs and Wan with 50 NFEs. Our method produces consistently [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Teacher–student quality comparison on the first frames of generated videos throughout training. Our method [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Smoothed reward comparison (top) and reward [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results on Flux. 8-NFE image generation: left—Hyper-Flux, middle—SenseFlow, right—Ours [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative results on Flux. 5-NFE image generation: left—Hyper-Flux, middle—SenseFlow, right—Ours [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative results on Flux. 3-NFE image generation: left—Hyper-Flux, middle—SenseFlow, right—Ours [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Additional qualitative comparisons on Wan-based video generation under [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Additional qualitative comparisons on Wan-based video generation under [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

read the original abstract

Achieving high-fidelity generation in extremely few sampling steps has long been a central goal of generative modeling. Existing approaches largely rely on distillation-based frameworks to compress the original multi-step denoising process into a few-step generator. However, such methods inherently constrain the student to imitate a stronger multi-step teacher, imposing the teacher as an upper bound on student performance. We argue that introducing \textbf{preference alignment awareness} enables the student to optimize toward reward-preferred generation quality, potentially surpassing the teacher instead of being restricted to rigid teacher imitation. To this end, we propose \textbf{Reward-Aware Trajectory Shaping (RATS)}, a lightweight framework for preference-aligned few-step generation. Specifically, teacher and student latent trajectories are aligned at key denoising stages through horizon matching, while a \textbf{reward-aware gate} is introduced to adaptively regulate teacher guidance based on their relative reward performance. Trajectory shaping is strengthened when the teacher achieves higher rewards, and relaxed when the student matches or surpasses the teacher, thereby enabling continued reward-driven improvement. By seamlessly integrating trajectory distillation, reward-aware gating, and preference alignment, RATS effectively transfers preference-relevant knowledge from high-step generators without incurring additional test-time computational overhead. Experimental results demonstrate that RATS substantially improves the efficiency--quality trade-off in few-step visual generation, significantly narrowing the gap between few-step students and stronger multi-step generators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RATS tries to beat the teacher in few-step generation with a reward gate, but the abstract leaves the actual improvement mechanism unclear.

read the letter

The core pitch is that standard distillation caps the student at the teacher's performance, and RATS fixes this by adding a reward-aware gate that dials down teacher guidance once the student matches or beats the reward score. Horizon matching aligns the trajectories at key points, and the gate supposedly lets the student keep improving on its own. That combination is the main new piece on offer here, and it directly targets a real bottleneck in making diffusion models fast enough for practical use without losing quality.

Referee Report

2 major / 2 minor

Summary. The paper proposes Reward-Aware Trajectory Shaping (RATS) for few-step visual generation. It claims that standard distillation frameworks limit students to imitating a multi-step teacher, and introduces horizon matching to align teacher and student latent trajectories at key denoising stages together with a reward-aware gate that adaptively regulates teacher guidance according to relative reward performance. When the teacher scores higher the gate strengthens shaping; when the student matches or exceeds the teacher the gate relaxes, purportedly enabling continued reward-driven improvement and preference alignment without extra test-time cost. The abstract states that experiments show substantial gains in the efficiency-quality trade-off.

Significance. If the mechanism is shown to work, the result would be significant for diffusion-based generation: it offers a lightweight way to break the conventional teacher upper bound in few-step distillation by incorporating preference signals, potentially improving quality for applications that require fast sampling.

major comments (2)

[Abstract / §3 (method)] Abstract and method description: the central claim that the reward-aware gate enables the student to surpass the teacher rests on the assertion that relaxation supplies an 'alternative direction' for reward-driven improvement. No explicit reward term, RL objective, or preference loss appears in the student training objective; the framework is described as a gated distillation loss. Without such a term, relaxation alone weakens the teacher signal but does not supply a gradient toward higher-reward outputs, undermining the claim that the student can reliably exceed the teacher.
[Experiments] Experiments section: the abstract asserts 'substantial improvements' and 'significantly narrowing the gap' with multi-step generators, yet supplies no quantitative metrics, ablation results, baseline comparisons, or implementation details (e.g., reward model architecture, gate formulation, training hyperparameters). This absence makes it impossible to assess whether the data support the surpassing-teacher claim.

minor comments (2)

[§3] Notation for the reward-aware gate and horizon matching should be formalized with equations rather than prose descriptions to allow reproducibility.
The abstract uses boldface for 'preference alignment awareness' and 'reward-aware gate'; ensure consistent typographic treatment of new terms throughout the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments have identified key areas for improving the clarity of our claims and the completeness of our experimental reporting. We address each major comment below and outline the planned revisions.

read point-by-point responses

Referee: [Abstract / §3 (method)] Abstract and method description: the central claim that the reward-aware gate enables the student to surpass the teacher rests on the assertion that relaxation supplies an 'alternative direction' for reward-driven improvement. No explicit reward term, RL objective, or preference loss appears in the student training objective; the framework is described as a gated distillation loss. Without such a term, relaxation alone weakens the teacher signal but does not supply a gradient toward higher-reward outputs, undermining the claim that the student can reliably exceed the teacher.

Authors: We appreciate the referee highlighting this important distinction. The reward-aware gate modulates the distillation strength according to relative reward performance between teacher and student trajectories at matched horizons. Relaxation occurs when the student matches or exceeds the teacher, which is intended to prevent the student from being forced back onto the teacher's path and thereby allow it to retain any reward advantages it has discovered. However, we agree that the core objective remains a gated distillation loss without an explicit reward maximization term, RL objective, or preference loss. The potential for surpassing the teacher is therefore indirect, arising from the adaptive relaxation rather than from direct gradient signals toward higher rewards. We will revise the abstract and Section 3 to clarify this mechanism, avoid overstating the role of the gate as supplying an 'alternative direction' via reward gradients, and provide the precise mathematical formulation of the gate and loss to make the distinction transparent. revision: partial
Referee: [Experiments] Experiments section: the abstract asserts 'substantial improvements' and 'significantly narrowing the gap' with multi-step generators, yet supplies no quantitative metrics, ablation results, baseline comparisons, or implementation details (e.g., reward model architecture, gate formulation, training hyperparameters). This absence makes it impossible to assess whether the data support the surpassing-teacher claim.

Authors: We fully agree that the current manuscript lacks the quantitative evidence and implementation details necessary to substantiate the claims. In the revised version we will expand the Experiments section to report concrete metrics (FID, CLIP similarity, and human preference scores), direct comparisons against the multi-step teacher and relevant few-step baselines, ablation studies on horizon matching and the reward-aware gate, and complete implementation details including the reward model architecture, gate formulation, and all training hyperparameters. These additions will enable readers to evaluate the empirical support for the reported improvements. revision: yes

Circularity Check

0 steps flagged

No circularity detected in the derivation chain

full rationale

The paper proposes RATS as an additive framework combining horizon matching for trajectory alignment with a reward-aware gate that strengthens or relaxes teacher guidance based on relative reward scores. The central claim that this enables the student to surpass the teacher via continued reward-driven improvement is presented as a consequence of the adaptive mechanism rather than a mathematical reduction. No equations are exhibited that render any prediction equivalent to its inputs by construction, no parameters are fitted to a subset and then renamed as predictions, and no load-bearing self-citations or imported uniqueness theorems appear in the description. The approach extends standard distillation without self-definitional or tautological steps, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; full technical details, equations, and experimental setup unavailable, so ledger entries are minimal and provisional.

axioms (1)

domain assumption A reward model can serve as a reliable proxy for human-preferred generation quality during training.
Invoked when the reward-aware gate decides whether to strengthen or relax teacher guidance.

invented entities (1)

Reward-aware gate no independent evidence
purpose: Adaptively regulates the strength of teacher guidance based on relative reward scores of teacher and student trajectories.
New component introduced to enable the student to surpass the teacher.

pith-pipeline@v0.9.0 · 5554 in / 1270 out tokens · 42593 ms · 2026-05-10T11:40:55.182029+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 31 canonical work pages · 11 internal anchors

[1]

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine
[2]

Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301(2023)

work page internal anchor Pith review arXiv 2023
[3]

Clement Chadebec, Onur Tasar, Eyal Benaroche, and Benjamin Aubin. [n. d.]. Flash diffusion: Accelerating any conditional diffusion model for few steps image generation. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 15686–15695
[4]

Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Enze Xie, and Song Han. 2025. Sana-sprint: One-step diffusion with continuous-time consistency distillation.arXiv preprint arXiv:2503.09641(2025)

work page arXiv 2025
[5]

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences.Advances in neural information processing systems30 (2017)

2017
[6]

Kenji Doya. 2000. Reinforcement learning in continuous time and space.Neural computation12, 1 (2000), 219–245

2000
[7]

Xingtong Ge, Xin Zhang, Tongda Xu, Yi Zhang, Xinjie Zhang, Yan Wang, and Jun Zhang. 2025. SenseFlow: Scaling Distribution Matching for Flow-based Text-to-Image Distillation.arXiv preprint arXiv:2506.00523(2025)

work page arXiv 2025
[8]

Xiaoxuan He, Siming Fu, Yuke Zhao, Wanli Li, Jian Yang, Dacheng Yin, Fengyun Rao, and Bo Zhang. 2025. Tempflow-grpo: When timing matters for grpo in flow models.arXiv preprint arXiv:2508.04324(2025)

work page arXiv 2025
[9]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models.Iclr1, 2 (2022), 3

2022
[10]

Jacek Karwowski, Oliver Hayman, Xingjian Bai, Klaus Kiendlhofer, Charlie Grif- fin, and Joar Skalse. 2023. Goodhart’s law in reinforcement learning.arXiv preprint arXiv:2310.09144(2023)

work page arXiv 2023
[11]

Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. 2024. Consis- tency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion. InICLR. OpenReview.net

2024
[12]

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. 2023. Pick-a-pic: An open dataset of user preferences for text-to- image generation.Advances in neural information processing systems36 (2023), 36652–36663

2023
[13]

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. 2024. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Boyi Li, Yifan Shen, Yuanzhe Liu, Yifan Xu, Jiateng Liu, Xinzhuo Li, Zhengyuan Li, Jingyuan Zhu, Yunhan Zhong, Fangzhou Lan, et al. 2026. Toward Cognitive Su- persensing in Multimodal Large Language Model.arXiv preprint arXiv:2602.01541 (2026)

work page arXiv 2026
[15]

Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, and Zhao Zhong. 2025. MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE.arXiv preprint arXiv:2507.21802(2025)

work page internal anchor Pith review arXiv 2025
[16]

Jiachen Li, Qian Long, Jian Zheng, Xiaofeng Gao, Robinson Piramuthu, Wenhu Chen, and William Yang Wang. 2024. T2v-turbo-v2: Enhancing video generation model post-training through data, reward, and conditional guidance design.arXiv preprint arXiv:2410.05677(2024)

work page arXiv 2024
[17]

Shanchuan Lin, Anran Wang, and Xiao Yang. 2024. Sdxl-lightning: Progressive adversarial diffusion distillation.arXiv preprint arXiv:2402.13929(2024)

work page arXiv 2024
[18]

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. 2025. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470(2025)

work page internal anchor Pith review arXiv 2025
[19]

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. 2025. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918(2025)

work page internal anchor Pith review arXiv 2025
[20]

Cheng Lu and Yang Song. 2024. Simplifying, stabilizing and scaling continuous- time consistency models.arXiv preprint arXiv:2410.11081(2024)

work page internal anchor Pith review arXiv 2024
[21]

Eric Luhman and Troy Luhman. 2021. Knowledge distillation in iterative gen- erative models for improved sampling speed.arXiv preprint arXiv:2101.02388 (2021)

work page arXiv 2021
[22]

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. 2023. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378(2023)

work page internal anchor Pith review arXiv 2023
[23]

Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, and Yang You. 2025. Enhance-A-Video: Better Generated Video for Free.arXiv preprint arXiv:2502.07508(2025)

work page arXiv 2025
[24]

Mihir Prabhudesai, Russell Mendonca, Zheyang Qin, Katerina Fragkiadaki, and Deepak Pathak. 2024. Video diffusion alignment via reward gradients.arXiv preprint arXiv:2407.08737(2024)

work page arXiv 2024
[25]

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba
[26]

Sequence level training with recurrent neural networks.arXiv preprint arXiv:1511.06732(2015)

work page arXiv 2015
[27]

Yuxi Ren, Xin Xia, Yanzuo Lu, Jiacheng Zhang, Jie Wu, Pan Xie, Xing Wang, and Xuefeng Xiao. 2024. Hyper-sd: Trajectory segmented consistency model for efficient image synthesis.arXiv preprint arXiv:2404.13686(2024)

work page arXiv 2024
[28]

Tim Salimans and Jonathan Ho. 2022. Progressive Distillation for Fast Sam- pling of Diffusion Models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=TIdIXIpzhoI

2022
[29]

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. 2024. Adversarial diffusion distillation. InEuropean Conference on Computer Vision. Springer, 87–103

2024
[30]

Wolfram Schultz, Peter Dayan, and P Read Montague. 1997. A neural substrate of prediction and reward.Science275, 5306 (1997), 1593–1599

1997
[31]

Huiyang Shao, Xin Xia, Yuhong Yang, Yuxi Ren, Xing Wang, and Xuefeng Xiao
[32]

RayFlow: Instance-Aware Diffusion Acceleration via Adaptive Flow Trajec- tories.arXiv preprint arXiv:2503.07699(2025)

work page arXiv 2025
[33]

Yifan Shen, Jiateng Liu, Xinzhuo Li, Yuanzhe Liu, Bingxuan Li, Houze Yang, Wenqi Jia, Yijiang Li, Tianjiao Yu, James Matthew Rehg, et al . 2026. EgoForge: Goal- Directed Egocentric World Simulator.arXiv preprint arXiv:2603.20169(2026)

work page arXiv 2026
[34]

Yifan Shen, Yuanzhe Liu, Jingyuan Zhu, Xu Cao, Xiaofeng Zhang, Yixiao He, Wenming Ye, James Matthew Rehg, and Ismini Lourentzou. 2025. Fine-grained preference optimization improves spatial reasoning in vlms.arXiv preprint arXiv:2506.21656(2025)

work page arXiv 2025
[35]

Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, et al. 2024. Motion- i2v: Consistent and controllable image-to-video generation with explicit motion modeling. InACM SIGGRAPH 2024 Conference Papers. 1–11

2024
[36]

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. 2023. Consistency Models. InInternational Conference on Machine Learning. PMLR, 32211–32252. Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al

2023
[37]

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik
[38]

Diffusion Model Alignment Using Direct Preference Optimization. InCVPR. IEEE, 8228–8238
[39]

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. 2025. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Jin Wang, Jianxiang Lu, Guangzheng Xu, Comi Chen, Haoyu Yang, Linqing Wang, Peng Chen, Mingtao Chen, Zhichao Hu, Longhuang Wu, et al. 2026. TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment. arXiv preprint arXiv:2601.05729(2026)

work page arXiv 2026
[41]

Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. 2025. Pref-GRPO: Pairwise Preference Reward- based GRPO for Stable Text-to-Image Reinforcement Learning.arXiv preprint arXiv:2508.20751(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. 2023. Human preference score v2: A solid benchmark for evaluat- ing human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341 (2023)

work page internal anchor Pith review arXiv 2023
[43]

Kingma, Tingbo Hou, Ying Nian Wu, Kevin P

Sirui Xie, Zhisheng Xiao, Diederik P. Kingma, Tingbo Hou, Ying Nian Wu, Kevin P. Murphy, Tim Salimans, Ben Poole, and Ruiqi Gao. 2024. EM Distillation for One- step Diffusion Models. InNeurIPS

2024
[44]

Feng Xu, Guangyao Zhai, Xin Kong, Tingzhong Fu, Daniel FN Gordon, Xueli An, and Benjamin Busam. 2025. STARE-VLA: Progressive Stage-Aware Reinforcement for Fine-Tuning Vision-Language-Action Models.arXiv preprint arXiv:2512.05107 (2025)

work page arXiv 2025
[45]

Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, et al . 2024. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation.arXiv preprint arXiv:2412.21059(2024)

work page arXiv 2024
[46]

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. 2023. Imagereward: Learning and evaluating human prefer- ences for text-to-image generation.Advances in Neural Information Processing Systems36 (2023), 15903–15935

2023
[47]

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al . 2025. DanceGRPO: Unleashing GRPO on Visual Generation.arXiv preprint arXiv:2505.07818(2025)

work page internal anchor Pith review arXiv 2025
[48]

Xiaomeng Yang, Zhiyu Tan, and Hao Li. 2025. IPO: Iterative preference optimiza- tion for text-to-video generation.arXiv preprint arXiv:2502.02088(2025)

work page arXiv 2025
[49]

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. 2024. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems37 (2024), 47455–47487

2024
[50]

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. 2024. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6613–6623

2024
[51]

Zhenyu Yu, Mohd Yamani Idna Idris, and Pei Wang. 2025. Visualizing Our Changing Earth: A Creative AI Framework for Democratizing Environmental Storytelling Through Satellite Imagery. InNeurIPS 2025

2025
[52]

Zhenyu Yu, Mohd Yamani Idna Idris, Pei Wang, and Rizwan Qureshi. 2026. DINOv3-Powered Multi-Task Foundation Model for Quantitative Remote Sensing Estimation.AAAI 202640, 48 (2026), 41455–41456

2026
[53]

Zhenyu Yu, Haoran Jiang, Pei Wang, Zizhen Lin, and Yong Xiang. 2026. Spa- tiotemporal Alignment for Remote Sensing Image Recovery via Terrain-Aware Diffusion.ICASSP 2026(2026)

2026
[54]

Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, and Dong Ni. 2024. Instructvideo: Instructing video diffusion models with human feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6463–6474

2024
[55]

Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, et al . 2025. R1-reward: Training multimodal reward model through stable reinforcement learning.arXiv preprint arXiv:2505.02835(2025). Due to the page constraint of the main paper, the supplemen- tary material provides additional method...

work page arXiv 2025