Recognition: unknown
Towards General Preference Alignment: Diffusion Models at Nash Equilibrium
Pith reviewed 2026-05-08 16:48 UTC · model grok-4.3
The pith
Diffusion models can align with human preferences by competing against themselves in a Nash game.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We formulate diffusion alignment from a game-theoretic perspective and propose Diffusion Nash Preference Optimization (Diff.-NPO), an intuitive general preference framework for diffusion alignment. Diff.-NPO encourages the current policy to play against itself to achieve self improvement and lead to a better alignment. Empirically, we demonstrate the effectiveness of Diff.-NPO on the text-to-image generation task via various metrics, where it consistently outperforms existing preference-based diffusion alignment methods.
What carries the argument
Diffusion Nash Preference Optimization (Diff.-NPO), a self-play mechanism in which the diffusion policy is trained to reach Nash equilibrium against a frozen copy of itself, generating its own preference signals.
Load-bearing premise
That a self-play Nash game between a policy and its own copy can represent the full complexity of human preferences without needing reward models or additional selection steps.
What would settle it
A head-to-head test on a preference dataset containing clear intransitive or multi-way choices where Diff.-NPO produces no measurable gain over standard direct preference optimization would falsify the central claim.
Figures
read the original abstract
Reinforcement learning from human feedback (RLHF) has been popular for aligning text-to-image (T2I) diffusion models with human preferences. As a mainstream branch of RLHF, Direct Preference Optimization (DPO) offers a computationally efficient alternative that avoids explicit reward modeling and has been widely adopted in diffusion alignment. However, existing preference-based methods for diffusion alignment still rely on reward-induced preference signals and typically assume that human preferences can be adequately modeled by the Bradley--Terry (BT) model, which may fail to capture the full complexity of human preferences. In this paper, we formulate diffusion alignment from a game-theoretic perspective. We propose Diffusion Nash Preference Optimization (Diff.-NPO), an intuitive general preference framework for diffusion alignment. Diff.-NPO encourages the current policy to play against itself to achieve self improvement and lead to a better alignment. Empirically, we demonstrate the effectiveness of Diff.-NPO on the text-to-image generation task via various metrics. Diff.-NPO consistently outperforms existing preference-based diffusion alignment methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that existing diffusion alignment methods rely on reward-induced signals and the Bradley-Terry model, which may not capture complex human preferences. It proposes Diffusion Nash Preference Optimization (Diff.-NPO), a self-play Nash equilibrium framework in which the current policy competes against itself to achieve self-improvement and general preference alignment without explicit reward modeling or BT assumptions. The authors report that this yields better alignment and empirically outperforms prior preference-based diffusion methods on text-to-image generation across various metrics.
Significance. If the central claim holds—that the Nash self-play objective provides a genuinely general preference framework without reintroducing latent utility functions, auxiliary models, or post-hoc selection rules—it would offer a meaningful advance over DPO-style methods by potentially handling non-transitive or context-dependent preferences. The reported empirical gains on T2I tasks add practical value, though significance hinges on whether the formulation is parameter-free and directly derived from preference data.
major comments (2)
- [§3] §3 (Nash formulation and loss derivation): the payoff function for the self-play game must be shown to be computed directly from raw preference pairs without deriving a latent utility or applying any selection rule that effectively recreates a reward signal; otherwise the generality claim reduces to a reparameterized DPO variant. This is load-bearing for the central contribution.
- [§4] §4 (experimental setup): the outperformance claims require explicit confirmation that no post-hoc hyperparameter tuning or baseline selection was performed after seeing results, and that all compared methods used identical preference data and evaluation protocols.
minor comments (2)
- [§3.2] Notation for the policy and opponent in the self-play game should be introduced with a clear table or diagram to avoid ambiguity when the same network is used for both roles.
- [Abstract] The abstract and introduction should cite the specific prior diffusion DPO papers being outperformed so readers can immediately locate the baselines.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below with clarifications on the formulation and experiments, and will incorporate explicit additions in the revised manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Nash formulation and loss derivation): the payoff function for the self-play game must be shown to be computed directly from raw preference pairs without deriving a latent utility or applying any selection rule that effectively recreates a reward signal; otherwise the generality claim reduces to a reparameterized DPO variant. This is load-bearing for the central contribution.
Authors: In Diff.-NPO the payoff is defined directly on raw preference pairs: for any pair of generations sampled from the current policy and its copy, the payoff is the binary preference label from the data (1 if the first is preferred, 0 otherwise). No latent utility or reward is estimated; the Nash objective is obtained by setting the expected payoff gradient to zero under the self-play distribution. This yields a loss that does not invoke the Bradley-Terry model. We will insert a step-by-step derivation in the revised §3 that starts from the preference pairs and arrives at the equilibrium condition without intermediate reward modeling. revision: yes
-
Referee: [§4] §4 (experimental setup): the outperformance claims require explicit confirmation that no post-hoc hyperparameter tuning or baseline selection was performed after seeing results, and that all compared methods used identical preference data and evaluation protocols.
Authors: All hyperparameters were chosen on a held-out validation split before the final test runs; no values were altered after inspecting the reported numbers. Every baseline was re-run on the identical preference dataset splits, with the same sampling settings, number of generations, and evaluation metrics. We will add a short paragraph in the revised §4 that states these facts explicitly and lists the shared data and protocol details. revision: yes
Circularity Check
No significant circularity; derivation self-contained
full rationale
The paper formulates diffusion alignment as a self-play Nash game and introduces Diff.-NPO as a general preference framework that avoids explicit reward modeling and BT assumptions. No equations or derivation steps are provided in the abstract that reduce by construction to fitted inputs, self-definitions, or self-citations. The central claim introduces independent game-theoretic structure (policy playing against itself for self-improvement) that is not shown to be equivalent to prior DPO-style losses via renaming or ansatz smuggling. The method is positioned against external benchmarks (existing preference-based diffusion methods), making the derivation self-contained rather than circular. No load-bearing self-citation chains or uniqueness theorems imported from the authors' prior work appear in the given text.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Fine-tuning language models to find agreement among humans with diverse preferences
Michiel Bakker, Martin Chadwick, Hannah Sheahan, Michael Tessler, Lucy Campbell- Gillingham, Jan Balaguer, Nat McAleese, Amelia Glaese, John Aslanides, Matt Botvinick, et al. Fine-tuning language models to find agreement among humans with diverse preferences. Advances in neural information processing systems, 35:38176–38189, 2022
2022
-
[2]
Improving image generation with better captions
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023
2023
-
[3]
Training Diffusion Models with Reinforcement Learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023
work page internal anchor Pith review arXiv 2023
-
[4]
Rank analysis of incomplete block designs: I
Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952
1952
-
[5]
Directly fine-tuning diffusion models on differentiable rewards.arXiv preprint arXiv:2309.17400,
Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable rewards.arXiv preprint arXiv:2309.17400, 2023
-
[6]
Emu: Enhanc- ing image generation models using photogenic needles in a haystack
Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xiaofang Wang, Abhimanyu Dubey, et al. Emu: Enhancing image gen- eration models using photogenic needles in a haystack.arXiv preprint arXiv:2309.15807, 2023
-
[7]
Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Pro- cessing Systems, 36:79858–79885, 2023
Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Pro- cessing Systems, 36:79858–79885, 2023
2023
-
[8]
A decision-theoretic generalization of on-line learning and an application to boosting.Journal of computer and system sciences, 55(1):119–139, 1997
Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting.Journal of computer and system sciences, 55(1):119–139, 1997
1997
-
[9]
Yi Gu, Zhendong Wang, Yueqin Yin, Yujia Xie, and Mingyuan Zhou. Diffusion-rpo: Aligning diffusion models through relative preference optimization.arXiv preprint arXiv:2406.06382, 2024
-
[10]
Xiaoxuan He, Siming Fu, Wanli Li, Zhiyuan Li, Dacheng Yin, Kang Rong, Fengyun Rao, and Bo Zhang. Sail: Self-amplified iterative learning for diffusion model alignment with minimal human feedback.arXiv preprint arXiv:2602.05380, 2026
-
[11]
Daniel R Jiang, Alex Nikulkov, Yu-Chia Chen, Yang Bai, and Zheqing Zhu. Improving gen- erative ad text on facebook using reinforcement learning.arXiv preprint arXiv:2507.21983, 2025
-
[12]
Inference-time alignment control for diffusion models with reinforcement learning guidance
Luozhijie Jin, Zijie Qiu, Jie Liu, Zijie Diao, Lifeng Qiao, Ning Ding, Alex Lamb, and Xipeng Qiu. Inference-time alignment control for diffusion models with reinforcement learning guid- ance.arXiv preprint arXiv:2508.21016, 2025
-
[13]
Test-time alignment of diffusion models without reward over-optimization
Sunwoo Kim, Minkyu Kim, and Dongmin Park. Test-time alignment of diffusion models without reward over-optimization. InThe Thirteenth International Conference on Learning Representations
-
[14]
Pick-a-pic: An open dataset of user preferences for text-to-image generation, 2023
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation, 2023. URL https://arxiv.org/abs/2305.01569
-
[15]
Cambridge University Press, 2020
Tor Lattimore and Csaba Szepesv ´ari.Bandit algorithms. Cambridge University Press, 2020. 10
2020
-
[16]
Calibrated multi-preference optimization for aligning diffusion models
Kyungmin Lee, Xiahong Li, Qifei Wang, Junfeng He, Junjie Ke, Ming-Hsuan Yang, Irfan Essa, Jinwoo Shin, Feng Yang, and Yinxiao Li. Calibrated multi-preference optimization for aligning diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18465–18475, 2025
2025
-
[17]
Aligning diffusion models by optimizing human utility.Advances in Neural Information Pro- cessing Systems, 37:24897–24925, 2024
Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, and Kazuki Kozuka. Aligning diffusion models by optimizing human utility.Advances in Neural Information Pro- cessing Systems, 37:24897–24925, 2024
2024
-
[18]
Yuming Li, Qingyu Li, Chengyu Bai, Xiangyang Luo, Zeyue Xue, Wenyu Qin, Meng Wang, Yikai Wang, and Shanghang Zhang. Aegpo: Adaptive entropy-guided policy optimization for diffusion models.arXiv preprint arXiv:2602.06825, 2026
-
[19]
Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, et al. Uniworld-v2: Reinforce image editing with diffu- sion negative-aware finetuning and mllm implicit feedback.arXiv preprint arXiv:2510.16888, 2025
-
[20]
Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Ji Li, and Liang Zheng. Step-aware preference optimization: Aligning preference with denoising performance at each step.arXiv preprint arXiv:2406.04314, 2(5):7, 2024
-
[21]
Flow-GRPO: Training Flow Matching Models via Online RL
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025
work page internal anchor Pith review arXiv 2025
-
[22]
Intransitivity, utility, and the aggregation of preference patterns.Economet- rica: Journal of the Econometric Society, pages 1–13, 1954
Kenneth O May. Intransitivity, utility, and the aggregation of preference patterns.Economet- rica: Journal of the Econometric Society, pages 1–13, 1954
1954
-
[23]
Nash learning from human feedback
R ´emi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Row- land, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, C ˆome Fiegel, et al. Nash learning from human feedback. InForty-first International Conference on Machine Learning, 2024
2024
-
[24]
GPT Image 2.https://developers.openai.com/api/docs/models/ gpt-image-2, 2026
OpenAI. GPT Image 2.https://developers.openai.com/api/docs/models/ gpt-image-2, 2026. Accessed: 2026-04-29
2026
-
[25]
The analysis of permutations.Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975
Robin L Plackett. The analysis of permutations.Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975
1975
-
[26]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023
work page internal anchor Pith review arXiv 2023
-
[27]
Aligning text- to-image diffusion models with reward backpropagation
Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Katerina Fragkiadaki. Aligning text- to-image diffusion models with reward backpropagation. 2023
2023
-
[28]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
2021
-
[29]
Manning, and Chelsea Finn
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model,
-
[30]
URLhttps://arxiv.org/abs/2305.18290
work page internal anchor Pith review arXiv
-
[31]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022
work page internal anchor Pith review arXiv 2022
-
[32]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 11
2022
-
[33]
Laion-aesthetics.https://laion.ai/blog/laion-aesthetics/, 2022
Christoph Schuhmann. Laion-aesthetics.https://laion.ai/blog/laion-aesthetics/, 2022
2022
-
[34]
Tuning-free alignment of diffusion models with direct noise optimization
Zhiwei Tang, Jiangweizhi Peng, Jiasheng Tang, Mingyi Hong, Fan Wang, and Tsung-Hui Chang. Tuning-free alignment of diffusion models with direct noise optimization. InICML 2024 Workshop on Structured Probabilistic Inference{\&}Generative Modeling, 2024
2024
-
[35]
Intransitivity of preferences.Psychological Review, 76(1):31–48, 1969
Amos Tversky. Intransitivity of preferences.Psychological Review, 76(1):31–48, 1969. doi: 10.1037/h0026750
-
[36]
Diffusion model alignment us- ing direct preference optimization
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment us- ing direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024
2024
-
[37]
Fu-Yun Wang, Yunhao Shui, Jingtan Piao, Keqiang Sun, and Hongsheng Li. Diffusion-npo: Negative preference optimization for better preference aligned generation of diffusion models. arXiv preprint arXiv:2505.11245, 2025
-
[38]
Multiplayer nash preference optimization,
Fang Wu, Xu Huang, Weihao Xuan, Zhiwei Zhang, Yijia Xiao, Guancheng Wan, Xiaomin Li, Bing Hu, Peng Xia, Jure Leskovec, and Yejin Choi. Multiplayer nash preference optimization,
-
[39]
URLhttps://arxiv.org/abs/2509.23102
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text- to-image synthesis, 2023. URLhttps://arxiv.org/abs/2306.09341
work page internal anchor Pith review arXiv 2023
-
[41]
Rethinking dpo-style diffusion aligning frameworks
Xun Wu, Shaohan Huang, Lingjie Jiang, and Furu Wei. Rethinking dpo-style diffusion aligning frameworks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18068–18077, 2025
2025
-
[42]
Imagereward: Learning and evaluating human preferences for text-to-image generation
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023
2023
-
[43]
DanceGRPO: Unleashing GRPO on Visual Generation
Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818, 2025
work page internal anchor Pith review arXiv 2025
-
[45]
Using human feedback to fine-tune diffusion models without any reward model
Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941–8951, 2024
2024
-
[46]
Training-free diffusion model alignment with sampling demons.arXiv preprint arXiv:2410.05760, 2024
Po-Hung Yeh, Kuang-Huei Lee, and Jun-Cheng Chen. Training-free diffusion model align- ment with sampling demons.arXiv preprint arXiv:2410.05760, 2024
-
[47]
Scaling autoregressive mod- els for content-rich text-to-image generation, 2022
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Va- sudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive mod- els for content-rich text-to-image generation, 2022. URLhttps://arxiv.org/abs/2206. 10789
2022
-
[48]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Va- sudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022. 12
work page internal anchor Pith review arXiv 2022
-
[49]
Self-play fine-tuning of diffusion models for text-to-image generation.Advances in Neural Information Processing Systems, 37: 73366–73398, 2024
Huizhuo Yuan, Zixiang Chen, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning of diffusion models for text-to-image generation.Advances in Neural Information Processing Systems, 37: 73366–73398, 2024
2024
-
[50]
Kevin Zhai, Utsav Singh, Anirudh Thatipelli, Souradip Chakraborty, Anit Kumar Sahu, Furong Huang, Amrit Singh Bedi, and Mubarak Shah. Mira: Towards mitigating reward hacking in inference-time alignment of t2i diffusion models.arXiv preprint arXiv:2510.01549, 2025
-
[51]
Seppo: Semi-policy preference optimization for diffusion alignment
Daoan Zhang, Guangchen Lan, Dong-Jun Han, Wenlin Yao, Xiaoman Pan, Hongming Zhang, Mingxiao Li, Dong Yu, Christopher Brinton, Jiebo Luo, et al. Seppo: Semi-policy preference optimization for diffusion alignment. 2024
2024
-
[52]
arXiv preprint arXiv:2407.00617 , year=
Yuheng Zhang, Dian Yu, Baolin Peng, Linfeng Song, Ye Tian, Mingyue Huo, Nan Jiang, Haitao Mi, and Dong Yu. Iterative nash policy optimization: Aligning llms with general pref- erences via no-regret learning.arXiv preprint arXiv:2407.00617, 2024
-
[53]
Improving llm general preference alignment via optimistic online mirror descent
Yuheng Zhang, Dian Yu, Tao Ge, Linfeng Song, Zhichen Zeng, Haitao Mi, Nan Jiang, and Dong Yu. Improving llm general preference alignment via optimistic online mirror descent. arXiv preprint arXiv:2502.16852, 2025
-
[54]
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025
work page internal anchor Pith review arXiv 2025
-
[55]
Runlong Zhou, Maryam Fazel, and Simon S Du. Extragradient preference optimization (egpo): Beyond last-iterate convergence for nash learning from human feedback.arXiv preprint arXiv:2503.08942, 2025
-
[56]
"vase""
Huaisheng Zhu, Teng Xiao, and Vasant G Honavar. Dspo: Direct score preference optimization for diffusion model alignment. International Conference on Learning Representations (ICLR 2025), 2025. 13 A Diff.-NPO Pseudo-code The Pseudo-code of Diff.-NPO is shown in Algorithm 1. Algorithm 1Diffusion-NPO 1:Input:stepsS, KL regularizationτ, OMD parameterη, refer...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.