pith. machine review for the scientific record. sign in

arxiv: 2604.08048 · v1 · submitted 2026-04-09 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

Guiding a Diffusion Model by Swapping Its Tokens

Chao Ma, Shanyan Guan, Weijia Zhang, Wei Li, Wu Ran, Yanhao Ge, Yuehao Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:55 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion modelsself-swap guidancetoken swappingclassifier-free guidanceimage generationunconditional generationlatent perturbation
0
0 comments X

The pith

Swapping pairs of the most dissimilar tokens in a diffusion model's latent space steers sampling toward higher-fidelity images without needing text conditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Self-Swap Guidance as a way to create a perturbation direction by exchanging the most semantically dissimilar token latents, either across spatial locations or channels, and then using that difference to adjust the model's prediction during sampling. This produces a CFG-like effect that raises image quality and alignment but does not require an external conditioning signal, so it works for both conditional and unconditional diffusion models. The swaps are selective rather than global, which the authors argue gives finer control and avoids the instability seen in less constrained perturbation methods. Experiments on MS-COCO 2014, MS-COCO 2017, and ImageNet show the approach improves fidelity and prompt adherence over prior condition-free baselines when inserted into popular diffusion architectures. The method is presented as a simple plug-in that can be added at inference time.

Core claim

Self-Swap Guidance creates a perturbed prediction by swapping the most semantically dissimilar token latents in the spatial or channel dimension, then steers the diffusion sampling trajectory using the vector difference between this perturbed prediction and the clean prediction, thereby guiding the model toward higher-fidelity output distributions without any text condition.

What carries the argument

Self-Swap Guidance, which identifies and exchanges pairs of most semantically dissimilar token latents to produce a controlled perturbation direction for steering sampling.

Load-bearing premise

Swapping the most semantically dissimilar token pairs always produces a perturbation direction that points toward higher-fidelity distributions without introducing new artifacts or instability.

What would settle it

Applying the same token-swap procedure to a standard diffusion model on MS-COCO and measuring FID and CLIP scores; if the guided outputs show equal or worse fidelity and alignment than unguided sampling across multiple seeds and perturbation strengths, the claim is falsified.

Figures

Figures reproduced from arXiv: 2604.08048 by Chao Ma, Shanyan Guan, Weijia Zhang, Wei Li, Wu Ran, Yanhao Ge, Yuehao Liu.

Figure 1
Figure 1. Figure 1: Self-Swap Guidance (SSG) generates higher-fidelity images over a wider range of guidance scale. In contrast, exist￾ing methods [1, 18, 19] suffer from poor details at lower guidance scale, or noise, oversaturation, and oversimplified details at higher guidance scale. synthesis problems. A crucial factor behind their success is the use of sampling guidance [6, 8, 15]—an inference￾time mechanism that steers … view at source ↗
Figure 2
Figure 2. Figure 2: Visualisations of guidance patterns and the iteratively denoised images across different timesteps. The text prompt used is “A loft bed with a dresser underneath it”. Despite their effectiveness, CFG and other early guid￾ance approaches [8, 15, 30] require external conditioning information such as text [15, 30] or class labels [8]. This dependency prevents their use in unconditional generation settings, su… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of unconditional image generation by SDXL. without the need for any retraining or architectural modifi￾cation. Under conditional generation set-ups, it is also com￾patible with CFG [15], allowing joint or stand-alone usage depending on the desired trade-off amongst fidelity, diver￾sity, and alignment with prompts. 2. Related Work Diffusion models. Diffusion models (DMs) [45, 47, 48] … view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of conditional image generation by SDXL. strength may enhance structure or texture but often intro￾duces degradation [4, 39]. Spatial self-swap of tokens. To address this lack of gran￾ularity, we aim to introduce perturbations that are strong enough to effectively guide the sampling process towards better-quality outputs, yet fine-grained enough to avoid ex￾cessive or global disrupti… view at source ↗
Figure 5
Figure 5. Figure 5: Effect of varying guidance scale and swap ratio on image quality and prompt alignment. r = 0 0.02 0.05 0.1 0.2 0.5 1.0 “∅” ω = 0 1.0 2.0 2.5 3.0 3.5 4.0 “a close up of a cat on a desk near a sandwich” “a close up of a cat on a desk near a sandwich” “∅” [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualising the effect of different token swap poli￾cies. Swapping dissimilar tokens further refines local details and global coherence compared to random swap. In contrast, swapping similar tokens leads to poor generation that resembles the vanilla diffusion model’s output. uration (SAG, PAG, SEG), or oversimplified details (SAG, PAG, SEG). In contrast, SSG generates higher-quality im￾ages more consistent… view at source ↗
Figure 8
Figure 8. Figure 8: Compatibility between SSG and CFG. SSG can be applied on top of CFG to further refine image quality. Compatibility with CFG. As SSG and CFG operate in orthogonal perturbation spaces (i.e., token space and con￾dition space, respectively), they can be conveniently com￾bined to gain further improvements. To illustrate this, we apply CFG on top of SSG. Quantitative results in [PITH_FULL_IMAGE:figures/full_fig… view at source ↗
read the original abstract

Classifier-Free Guidance (CFG) is a widely used inference-time technique to boost the image quality of diffusion models. Yet, its reliance on text conditions prevents its use in unconditional generation. We propose a simple method to enable CFG-like guidance for both conditional and unconditional generation. The key idea is to generate a perturbed prediction via simple token swap operations, and use the direction between it and the clean prediction to steer sampling towards higher-fidelity distributions. In practice, we swap pairs of most semantically dissimilar token latents in either spatial or channel dimensions. Unlike existing methods that apply perturbation in a global or less constrained manner, our approach selectively exchanges and recomposes token latents, allowing finer control over perturbation and its influence on generated samples. Experiments on MS-COCO 2014, MS-COCO 2017, and ImageNet datasets demonstrate that the proposed Self-Swap Guidance (SSG), when applied to popular diffusion models, outperforms previous condition-free methods in image fidelity and prompt alignment under different set-ups. Its fine-grained perturbation granularity also improves robustness, reducing side-effects across a wider range of perturbation strengths. Overall, SSG extends CFG to a broader scope of applications including both conditional and unconditional generation, and can be readily inserted into any diffusion model as a plug-in to gain immediate improvements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes Self-Swap Guidance (SSG) for diffusion models. The core idea is to compute a guidance direction by swapping pairs of the most semantically dissimilar token latents (in spatial or channel dimensions) to produce a perturbed prediction, then steering sampling using the difference between the perturbed and clean predictions. This is presented as a plug-in that enables CFG-style benefits for both conditional and unconditional generation. Experiments on MS-COCO 2014, MS-COCO 2017, and ImageNet are claimed to show that SSG outperforms prior condition-free guidance methods in image fidelity and prompt alignment, with greater robustness across perturbation strengths.

Significance. If the empirical claims hold, SSG supplies a lightweight, condition-free guidance mechanism that can be inserted into existing diffusion pipelines. The selective token-swap perturbation offers finer granularity than global noise or masking approaches, and the reported robustness to perturbation strength supplies independent practical support. The extension to unconditional models addresses a clear limitation of standard CFG.

minor comments (3)
  1. [Abstract] Abstract: the claim of outperformance on MS-COCO 2014/2017 and ImageNet is stated without any numerical values, error bars, or baseline identifiers. This makes the magnitude and reliability of the gains impossible to judge from the summary alone.
  2. [Section 3] Section 3 (method): the procedure for identifying 'most semantically dissimilar' token pairs is described at a high level but lacks the precise similarity metric (e.g., cosine on which embeddings) and tie-breaking rule. This detail is needed for exact reproduction.
  3. [Section 4] Section 4 (experiments): tables or figures reporting FID, CLIP score, or alignment metrics should include standard deviations over multiple random seeds and explicit statements of the exact baselines and hyper-parameters used for each comparison.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our Self-Swap Guidance (SSG) method and for recommending minor revision. The assessment correctly captures the core idea of using selective token swaps to derive a guidance direction without requiring conditions. Since the report lists no specific major comments, we have no individual points to rebut or revise at this stage. We will incorporate any minor suggestions (e.g., typographical or formatting issues) in the camera-ready version.

Circularity Check

0 steps flagged

No significant circularity detected in derivation or claims

full rationale

The paper introduces Self-Swap Guidance as an operational inference-time procedure: compute a perturbed prediction by swapping pairs of semantically dissimilar token latents (in spatial or channel dimensions) and steer sampling using the direction between this perturbed prediction and the clean one. This is presented directly in the abstract without any equations, fitted parameters, or derivations that reduce the guidance direction to a quantity defined from the target result itself. No self-citations are invoked as load-bearing uniqueness theorems, and the central claims rest on empirical comparisons to prior condition-free methods on MS-COCO and ImageNet benchmarks. The approach is self-contained as a plug-in modification rather than a closed mathematical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities. The approach implicitly assumes that semantic dissimilarity in token latents can be reliably measured and that the resulting perturbation direction is beneficial.

pith-pipeline@v0.9.0 · 5539 in / 1131 out tokens · 31304 ms · 2026-05-10T17:55:56.653987+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 4 canonical work pages · 3 internal anchors

  1. [1]

    Self-rectifying diffu- sion sampling with perturbed-attention guidance

    Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, SeonHwa Kim, Hyun Hee Park, Ky- ong Hwan Jin, and Seungryong Kim. Self-rectifying diffu- sion sampling with perturbed-attention guidance. InECCV,

  2. [2]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 1

  3. [3]

    Pixart-σ: Weak-to-strong training of diffu- sion transformer for 4k text-to-image generation

    Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffu- sion transformer for 4k text-to-image generation. InECCV,

  4. [4]

    Tag: Tangential am- plifying guidance for hallucination-resistant diffusion sam- pling.arXiv preprint arXiv:2510.04533, 2025

    Hyunmin Cho, Donghoon Ahn, Susung Hong, Jee Eun Kim, Seungryong Kim, and Kyong Hwan Jin. Tag: Tangential am- plifying guidance for hallucination-resistant diffusion sam- pling.arXiv preprint arXiv:2510.04533, 2025. 2, 4, 5

  5. [5]

    Diffusion posterior sampling for general noisy inverse problems

    Hyungjin Chung, Jeongsol Kim, Michael T Mccann, Marc L Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. InICLR, 2023. 2, 3

  6. [6]

    Cfg++: Manifold-constrained clas- sifier free guidance for diffusion models

    Hyungjin Chung, Jeongsol Kim, Geon Yeong Park, Hyelin Nam, and Jong Chul Ye. Cfg++: Manifold-constrained clas- sifier free guidance for diffusion models. InICLR, 2025. 1, 3

  7. [7]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 2, 6

  8. [8]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. InNeurIPS, 2021. 1, 2, 3

  9. [9]

    Diffusion self-guidance for control- lable image generation

    Dave Epstein, Allan Jabri, Ben Poole, Alexei Efros, and Aleksander Holynski. Diffusion self-guidance for control- lable image generation. InNeurIPS, 2023. 1

  10. [10]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InICML, 2024. 1

  11. [11]

    Photorealistic video generation with diffusion models

    Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Fei-Fei Li, Irfan Essa, Lu Jiang, and Jos ´e Lezama. Photorealistic video generation with diffusion models. In ECCV, 2024. 1

  12. [12]

    Pre-trained text-to- image diffusion models are versatile representation learners for control

    Gunshi Gupta, Karmesh Yadav, Yarin Gal, Dhruv Batra, Zsolt Kira, Cong Lu, and Tim GJ Rudner. Pre-trained text-to- image diffusion models are versatile representation learners for control. InNeurIPS, 2024. 2

  13. [13]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR,

  14. [14]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. InNeurIPS, 2017. 6

  15. [15]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 1, 2, 3, 4, 6

  16. [16]

    Denoising diffu- sion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020. 1, 3, 4

  17. [17]

    Video dif- fusion models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models. InNeurIPS, 2022. 1

  18. [18]

    Smoothed energy guidance: Guiding diffu- sion models with reduced energy curvature of attention

    Susung Hong. Smoothed energy guidance: Guiding diffu- sion models with reduced energy curvature of attention. In NeurIPS, 2024. 1, 2, 3, 4, 6, 7

  19. [19]

    Improving sample quality of diffusion models us- ing self-attention guidance

    Susung Hong, Gyuseong Lee, Wooseok Jang, and Seungry- ong Kim. Improving sample quality of diffusion models us- ing self-attention guidance. InICCV, 2023. 1, 2, 3, 4, 6, 7

  20. [20]

    Elucidating the design space of diffusion-based generative models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InNeurIPS, 2022. 6

  21. [21]

    Guiding a diffusion model with a bad version of itself

    Tero Karras, Miika Aittala, Tuomas Kynk ¨a¨anniemi, Jaakko Lehtinen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself. InNeurIPS, 2024. 2, 3, 4

  22. [22]

    Reve- lio: Interpreting and leveraging semantic information in dif- fusion models

    Dahye Kim, Xavier Thomas, and Deepti Ghadiyaram. Reve- lio: Interpreting and leveraging semantic information in dif- fusion models. InICCV, 2025. 2

  23. [23]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. In NeurIPS, 2023. 6

  24. [24]

    Improved precision and recall met- ric for assessing generative models

    Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall met- ric for assessing generative models. InNeurIPS, 2019. 6

  25. [25]

    Applying guidance in a limited interval improves sample and distribution quality in diffusion models

    Tuomas Kynk ¨a¨anniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models. InNeurIPS, 2024. 3

  26. [26]

    Magic3d: High-resolution text-to-3d content creation

    Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. InCVPR, 2023. 1

  27. [27]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 2, 6

  28. [28]

    Not all diffusion model activations have been evaluated as discriminative features

    Benyuan Meng, Qianqian Xu, Zitai Wang, Xiaochun Cao, and Qingming Huang. Not all diffusion model activations have been evaluated as discriminative features. InNeurIPS,

  29. [29]

    Boosting the transferability of adversarial attack on vision transformer with adaptive token tuning

    Di Ming, Peng Ren, Yunlong Wang, and Xin Feng. Boosting the transferability of adversarial attack on vision transformer with adaptive token tuning. InNeurIPS, 2024. 5

  30. [30]

    Glide: Towards photorealistic image genera- tion and editing with text-guided diffusion models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image genera- tion and editing with text-guided diffusion models. InICML,

  31. [31]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. 1

  32. [32]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 1, 2, 3, 4, 6

  33. [33]

    DreamFusion: Text-to-3D using 2D Diffusion

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 1

  34. [34]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 6

  35. [35]

    Token perturbation guidance for diffusion mod- els

    Javad Rajabi, Soroush Mehraban, Seyedmorteza Sadat, and Babak Taati. Token perturbation guidance for diffusion mod- els. InNeurIPS, 2025. 2

  36. [36]

    High-resolution image syn- thesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, 2022. 1, 2, 3, 4, 6

  37. [37]

    Beyond first-order tweedie: Solving inverse problems using latent diffusion

    Litu Rout, Yujia Chen, Abhishek Kumar, Constantine Cara- manis, Sanjay Shakkottai, and Wen-Sheng Chu. Beyond first-order tweedie: Solving inverse problems using latent diffusion. InCVPR, 2024. 2, 3

  38. [38]

    Cads: Unleashing the di- versity of diffusion models through condition-annealed sam- pling

    Seyedmorteza Sadat, Jakob Buhmann, Derek Bradley, Otmar Hilliges, and Romann M Weber. Cads: Unleashing the di- versity of diffusion models through condition-annealed sam- pling. InICLR, 2024. 3

  39. [39]

    Eliminating oversaturation and artifacts of high guid- ance scales in diffusion models

    Seyedmorteza Sadat, Otmar Hilliges, and Romann M We- ber. Eliminating oversaturation and artifacts of high guid- ance scales in diffusion models. InICLR, 2024. 2, 3, 5

  40. [40]

    No training, no problem: Rethinking classifier-free guidance for diffusion models

    Seyedmorteza Sadat, Manuel Kansy, Otmar Hilliges, and Romann M Weber. No training, no problem: Rethinking classifier-free guidance for diffusion models. InICLR, 2025. 2, 3, 4

  41. [41]

    Photorealistic text-to-image diffusion models with deep language understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. InNeurIPS, 2022. 1, 2

  42. [42]

    Improved techniques for training gans

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. InNeurIPS, 2016. 6

  43. [43]

    Laion-5b: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. InNeurIPS, 2022. 6

  44. [44]

    Text-to-4d dy- namic scene generation

    Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, et al. Text-to-4d dy- namic scene generation. InICML, 2023. 1

  45. [45]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InICML, 2015. 1, 3, 4

  46. [46]

    Denois- ing diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InICLR, 2021. 3, 6

  47. [47]

    Generative modeling by es- timating gradients of the data distribution

    Yang Song and Stefano Ermon. Generative modeling by es- timating gradients of the data distribution. InNeurIPS, 2019. 3, 4

  48. [48]

    Score-based generative modeling through stochastic differential equa- tions

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. InICLR, 2020. 1, 3, 4

  49. [49]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017. 6

  50. [50]

    A connection between score matching and denoising autoencoders.Neural computation, 23(7):1661– 1674, 2011

    Pascal Vincent. A connection between score matching and denoising autoencoders.Neural computation, 23(7):1661– 1674, 2011. 4

  51. [51]

    Diffusers: State-of-the-art diffu- sion models.https://github.com/huggingface/ diffusers, 2022

    Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, Dhruv Nair, Sayak Paul, William Berman, Yiyi Xu, Steven Liu, and Thomas Wolf. Diffusers: State-of-the-art diffu- sion models.https://github.com/huggingface/ diffusers, 2022. 6

  52. [52]

    Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion

    Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion. InNeurIPS, 2023. 1

  53. [53]

    Imagere- ward: Learning and evaluating human preferences for text- to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation. InNeurIPS, 2023. 6

  54. [54]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InICLR, 2025. 1

  55. [55]

    4real: Towards photorealistic 4d scene generation via video diffusion models

    Heng Yu, Chaoyang Wang, Peiye Zhuang, Willi Mena- pace, Aliaksandr Siarohin, Junli Cao, L ´aszl´o Jeni, Sergey Tulyakov, and Hsin-Ying Lee. 4real: Towards photorealistic 4d scene generation via video diffusion models. InNeurIPS,

  56. [56]

    Improving diffusion inverse problem solving with decoupled noise annealing

    Bingliang Zhang, Wenda Chu, Julius Berner, Chenlin Meng, Anima Anandkumar, and Yang Song. Improving diffusion inverse problem solving with decoupled noise annealing. In CVPR, 2025. 2, 3

  57. [57]

    Generative adversarial train- ing with perturbed token detection for model robustness

    Jiahao Zhao and Wenji Mao. Generative adversarial train- ing with perturbed token detection for model robustness. In EMNLP, 2023. 5