arxiv: 2604.08048 · v1 · submitted 2026-04-09 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

Guiding a Diffusion Model by Swapping Its Tokens

Chao Ma, Shanyan Guan, Weijia Zhang, Wei Li, Wu Ran, Yanhao Ge, Yuehao Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords diffusion modelsself-swap guidancetoken swappingclassifier-free guidanceimage generationunconditional generationlatent perturbation

0 comments

The pith

Swapping pairs of the most dissimilar tokens in a diffusion model's latent space steers sampling toward higher-fidelity images without needing text conditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Self-Swap Guidance as a way to create a perturbation direction by exchanging the most semantically dissimilar token latents, either across spatial locations or channels, and then using that difference to adjust the model's prediction during sampling. This produces a CFG-like effect that raises image quality and alignment but does not require an external conditioning signal, so it works for both conditional and unconditional diffusion models. The swaps are selective rather than global, which the authors argue gives finer control and avoids the instability seen in less constrained perturbation methods. Experiments on MS-COCO 2014, MS-COCO 2017, and ImageNet show the approach improves fidelity and prompt adherence over prior condition-free baselines when inserted into popular diffusion architectures. The method is presented as a simple plug-in that can be added at inference time.

Core claim

Self-Swap Guidance creates a perturbed prediction by swapping the most semantically dissimilar token latents in the spatial or channel dimension, then steers the diffusion sampling trajectory using the vector difference between this perturbed prediction and the clean prediction, thereby guiding the model toward higher-fidelity output distributions without any text condition.

What carries the argument

Self-Swap Guidance, which identifies and exchanges pairs of most semantically dissimilar token latents to produce a controlled perturbation direction for steering sampling.

Load-bearing premise

Swapping the most semantically dissimilar token pairs always produces a perturbation direction that points toward higher-fidelity distributions without introducing new artifacts or instability.

What would settle it

Applying the same token-swap procedure to a standard diffusion model on MS-COCO and measuring FID and CLIP scores; if the guided outputs show equal or worse fidelity and alignment than unguided sampling across multiple seeds and perturbation strengths, the claim is falsified.

Figures

Figures reproduced from arXiv: 2604.08048 by Chao Ma, Shanyan Guan, Weijia Zhang, Wei Li, Wu Ran, Yanhao Ge, Yuehao Liu.

**Figure 1.** Figure 1: Self-Swap Guidance (SSG) generates higher-fidelity images over a wider range of guidance scale. In contrast, existing methods [1, 18, 19] suffer from poor details at lower guidance scale, or noise, oversaturation, and oversimplified details at higher guidance scale. synthesis problems. A crucial factor behind their success is the use of sampling guidance [6, 8, 15]—an inferencetime mechanism that steers … view at source ↗

**Figure 2.** Figure 2: Visualisations of guidance patterns and the iteratively denoised images across different timesteps. The text prompt used is “A loft bed with a dresser underneath it”. Despite their effectiveness, CFG and other early guidance approaches [8, 15, 30] require external conditioning information such as text [15, 30] or class labels [8]. This dependency prevents their use in unconditional generation settings, su… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of unconditional image generation by SDXL. without the need for any retraining or architectural modification. Under conditional generation set-ups, it is also compatible with CFG [15], allowing joint or stand-alone usage depending on the desired trade-off amongst fidelity, diversity, and alignment with prompts. 2. Related Work Diffusion models. Diffusion models (DMs) [45, 47, 48] … view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of conditional image generation by SDXL. strength may enhance structure or texture but often introduces degradation [4, 39]. Spatial self-swap of tokens. To address this lack of granularity, we aim to introduce perturbations that are strong enough to effectively guide the sampling process towards better-quality outputs, yet fine-grained enough to avoid excessive or global disrupti… view at source ↗

**Figure 5.** Figure 5: Effect of varying guidance scale and swap ratio on image quality and prompt alignment. r = 0 0.02 0.05 0.1 0.2 0.5 1.0 “∅” ω = 0 1.0 2.0 2.5 3.0 3.5 4.0 “a close up of a cat on a desk near a sandwich” “a close up of a cat on a desk near a sandwich” “∅” [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Visualising the effect of different token swap policies. Swapping dissimilar tokens further refines local details and global coherence compared to random swap. In contrast, swapping similar tokens leads to poor generation that resembles the vanilla diffusion model’s output. uration (SAG, PAG, SEG), or oversimplified details (SAG, PAG, SEG). In contrast, SSG generates higher-quality images more consistent… view at source ↗

**Figure 8.** Figure 8: Compatibility between SSG and CFG. SSG can be applied on top of CFG to further refine image quality. Compatibility with CFG. As SSG and CFG operate in orthogonal perturbation spaces (i.e., token space and condition space, respectively), they can be conveniently combined to gain further improvements. To illustrate this, we apply CFG on top of SSG. Quantitative results in [PITH_FULL_IMAGE:figures/full_fig… view at source ↗

read the original abstract

Classifier-Free Guidance (CFG) is a widely used inference-time technique to boost the image quality of diffusion models. Yet, its reliance on text conditions prevents its use in unconditional generation. We propose a simple method to enable CFG-like guidance for both conditional and unconditional generation. The key idea is to generate a perturbed prediction via simple token swap operations, and use the direction between it and the clean prediction to steer sampling towards higher-fidelity distributions. In practice, we swap pairs of most semantically dissimilar token latents in either spatial or channel dimensions. Unlike existing methods that apply perturbation in a global or less constrained manner, our approach selectively exchanges and recomposes token latents, allowing finer control over perturbation and its influence on generated samples. Experiments on MS-COCO 2014, MS-COCO 2017, and ImageNet datasets demonstrate that the proposed Self-Swap Guidance (SSG), when applied to popular diffusion models, outperforms previous condition-free methods in image fidelity and prompt alignment under different set-ups. Its fine-grained perturbation granularity also improves robustness, reducing side-effects across a wider range of perturbation strengths. Overall, SSG extends CFG to a broader scope of applications including both conditional and unconditional generation, and can be readily inserted into any diffusion model as a plug-in to gain immediate improvements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Token swapping gives a clean way to steer unconditional diffusion without conditions, but the real test is whether the reported gains hold up in the full results.

read the letter

The core idea is to create a guidance signal by swapping the most dissimilar token latents inside a diffusion model's prediction, then steer sampling along the difference between the clean and swapped versions. This extends the spirit of classifier-free guidance to unconditional models and to cases where you want finer control than global noise or latent perturbations. The authors apply it in either spatial or channel dimensions and claim it works as a plug-in on standard models like those trained on MS-COCO and ImageNet. That mechanical simplicity is the main practical contribution. It avoids fitting extra parameters and stays operational rather than deriving from some self-referential equation. The reported robustness across perturbation strengths is also useful if it survives closer inspection, because many guidance tricks break outside narrow ranges. The experiments are said to show better fidelity and prompt alignment than prior condition-free baselines on three datasets, which would matter for people who generate images without text prompts. The citation pattern looks standard for the diffusion literature, with no obvious missing priors on token-level edits. The soft spots are mostly in the evidence presented so far. The abstract gives no numbers, error bars, or ablation tables, so it is impossible to judge effect sizes or whether the token-swap choice actually avoids new artifacts. If the full paper has only qualitative examples or weak baselines, the outperformance claim weakens. The assumption that swapping dissimilar tokens reliably points toward higher-fidelity distributions also needs checking against cases where it might add instability. Overall this is for diffusion researchers who build or tune unconditional generators and want a lightweight inference tweak. A reader already working on guidance methods would get immediate value from the implementation details if they are reproducible. The work is coherent on its own terms and the central empirical claim is falsifiable, so it deserves a serious referee. I would send it to review but ask the authors to supply the missing quantitative tables and at least one ablation on token selection before acceptance.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes Self-Swap Guidance (SSG) for diffusion models. The core idea is to compute a guidance direction by swapping pairs of the most semantically dissimilar token latents (in spatial or channel dimensions) to produce a perturbed prediction, then steering sampling using the difference between the perturbed and clean predictions. This is presented as a plug-in that enables CFG-style benefits for both conditional and unconditional generation. Experiments on MS-COCO 2014, MS-COCO 2017, and ImageNet are claimed to show that SSG outperforms prior condition-free guidance methods in image fidelity and prompt alignment, with greater robustness across perturbation strengths.

Significance. If the empirical claims hold, SSG supplies a lightweight, condition-free guidance mechanism that can be inserted into existing diffusion pipelines. The selective token-swap perturbation offers finer granularity than global noise or masking approaches, and the reported robustness to perturbation strength supplies independent practical support. The extension to unconditional models addresses a clear limitation of standard CFG.

minor comments (3)

[Abstract] Abstract: the claim of outperformance on MS-COCO 2014/2017 and ImageNet is stated without any numerical values, error bars, or baseline identifiers. This makes the magnitude and reliability of the gains impossible to judge from the summary alone.
[Section 3] Section 3 (method): the procedure for identifying 'most semantically dissimilar' token pairs is described at a high level but lacks the precise similarity metric (e.g., cosine on which embeddings) and tie-breaking rule. This detail is needed for exact reproduction.
[Section 4] Section 4 (experiments): tables or figures reporting FID, CLIP score, or alignment metrics should include standard deviations over multiple random seeds and explicit statements of the exact baselines and hyper-parameters used for each comparison.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our Self-Swap Guidance (SSG) method and for recommending minor revision. The assessment correctly captures the core idea of using selective token swaps to derive a guidance direction without requiring conditions. Since the report lists no specific major comments, we have no individual points to rebut or revise at this stage. We will incorporate any minor suggestions (e.g., typographical or formatting issues) in the camera-ready version.

Circularity Check

0 steps flagged

No significant circularity detected in derivation or claims

full rationale

The paper introduces Self-Swap Guidance as an operational inference-time procedure: compute a perturbed prediction by swapping pairs of semantically dissimilar token latents (in spatial or channel dimensions) and steer sampling using the direction between this perturbed prediction and the clean one. This is presented directly in the abstract without any equations, fitted parameters, or derivations that reduce the guidance direction to a quantity defined from the target result itself. No self-citations are invoked as load-bearing uniqueness theorems, and the central claims rest on empirical comparisons to prior condition-free methods on MS-COCO and ImageNet benchmarks. The approach is self-contained as a plug-in modification rather than a closed mathematical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities. The approach implicitly assumes that semantic dissimilarity in token latents can be reliably measured and that the resulting perturbation direction is beneficial.

pith-pipeline@v0.9.0 · 5539 in / 1131 out tokens · 31304 ms · 2026-05-10T17:55:56.653987+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 4 canonical work pages · 3 internal anchors

[1]

Self-rectifying diffu- sion sampling with perturbed-attention guidance

Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, SeonHwa Kim, Hyun Hee Park, Ky- ong Hwan Jin, and Seungryong Kim. Self-rectifying diffu- sion sampling with perturbed-attention guidance. InECCV,
[2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 1

work page internal anchor Pith review arXiv 2023
[3]

Pixart-σ: Weak-to-strong training of diffu- sion transformer for 4k text-to-image generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffu- sion transformer for 4k text-to-image generation. InECCV,
[4]

Tag: Tangential am- plifying guidance for hallucination-resistant diffusion sam- pling.arXiv preprint arXiv:2510.04533, 2025

Hyunmin Cho, Donghoon Ahn, Susung Hong, Jee Eun Kim, Seungryong Kim, and Kyong Hwan Jin. Tag: Tangential am- plifying guidance for hallucination-resistant diffusion sam- pling.arXiv preprint arXiv:2510.04533, 2025. 2, 4, 5

work page arXiv 2025
[5]

Diffusion posterior sampling for general noisy inverse problems

Hyungjin Chung, Jeongsol Kim, Michael T Mccann, Marc L Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. InICLR, 2023. 2, 3

2023
[6]

Cfg++: Manifold-constrained clas- sifier free guidance for diffusion models

Hyungjin Chung, Jeongsol Kim, Geon Yeong Park, Hyelin Nam, and Jong Chul Ye. Cfg++: Manifold-constrained clas- sifier free guidance for diffusion models. InICLR, 2025. 1, 3

2025
[7]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 2, 6

2009
[8]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. InNeurIPS, 2021. 1, 2, 3

2021
[9]

Diffusion self-guidance for control- lable image generation

Dave Epstein, Allan Jabri, Ben Poole, Alexei Efros, and Aleksander Holynski. Diffusion self-guidance for control- lable image generation. InNeurIPS, 2023. 1

2023
[10]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InICML, 2024. 1

2024
[11]

Photorealistic video generation with diffusion models

Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Fei-Fei Li, Irfan Essa, Lu Jiang, and Jos ´e Lezama. Photorealistic video generation with diffusion models. In ECCV, 2024. 1

2024
[12]

Pre-trained text-to- image diffusion models are versatile representation learners for control

Gunshi Gupta, Karmesh Yadav, Yarin Gal, Dhruv Batra, Zsolt Kira, Cong Lu, and Tim GJ Rudner. Pre-trained text-to- image diffusion models are versatile representation learners for control. InNeurIPS, 2024. 2

2024
[13]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR,
[14]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. InNeurIPS, 2017. 6

2017
[15]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 1, 2, 3, 4, 6

2021
[16]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020. 1, 3, 4

2020
[17]

Video dif- fusion models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models. InNeurIPS, 2022. 1

2022
[18]

Smoothed energy guidance: Guiding diffu- sion models with reduced energy curvature of attention

Susung Hong. Smoothed energy guidance: Guiding diffu- sion models with reduced energy curvature of attention. In NeurIPS, 2024. 1, 2, 3, 4, 6, 7

2024
[19]

Improving sample quality of diffusion models us- ing self-attention guidance

Susung Hong, Gyuseong Lee, Wooseok Jang, and Seungry- ong Kim. Improving sample quality of diffusion models us- ing self-attention guidance. InICCV, 2023. 1, 2, 3, 4, 6, 7

2023
[20]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InNeurIPS, 2022. 6

2022
[21]

Guiding a diffusion model with a bad version of itself

Tero Karras, Miika Aittala, Tuomas Kynk ¨a¨anniemi, Jaakko Lehtinen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself. InNeurIPS, 2024. 2, 3, 4

2024
[22]

Reve- lio: Interpreting and leveraging semantic information in dif- fusion models

Dahye Kim, Xavier Thomas, and Deepti Ghadiyaram. Reve- lio: Interpreting and leveraging semantic information in dif- fusion models. InICCV, 2025. 2

2025
[23]

Pick-a-pic: An open dataset of user preferences for text-to-image generation

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. In NeurIPS, 2023. 6

2023
[24]

Improved precision and recall met- ric for assessing generative models

Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall met- ric for assessing generative models. InNeurIPS, 2019. 6

2019
[25]

Applying guidance in a limited interval improves sample and distribution quality in diffusion models

Tuomas Kynk ¨a¨anniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models. InNeurIPS, 2024. 3

2024
[26]

Magic3d: High-resolution text-to-3d content creation

Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. InCVPR, 2023. 1

2023
[27]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 2, 6

2014
[28]

Not all diffusion model activations have been evaluated as discriminative features

Benyuan Meng, Qianqian Xu, Zitai Wang, Xiaochun Cao, and Qingming Huang. Not all diffusion model activations have been evaluated as discriminative features. InNeurIPS,
[29]

Boosting the transferability of adversarial attack on vision transformer with adaptive token tuning

Di Ming, Peng Ren, Yunlong Wang, and Xin Feng. Boosting the transferability of adversarial attack on vision transformer with adaptive token tuning. InNeurIPS, 2024. 5

2024
[30]

Glide: Towards photorealistic image genera- tion and editing with text-guided diffusion models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image genera- tion and editing with text-guided diffusion models. InICML,
[31]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. 1

2023
[32]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 1, 2, 3, 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

DreamFusion: Text-to-3D using 2D Diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 1

work page internal anchor Pith review arXiv 2022
[34]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 6

2021
[35]

Token perturbation guidance for diffusion mod- els

Javad Rajabi, Soroush Mehraban, Seyedmorteza Sadat, and Babak Taati. Token perturbation guidance for diffusion mod- els. InNeurIPS, 2025. 2

2025
[36]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, 2022. 1, 2, 3, 4, 6

2022
[37]

Beyond first-order tweedie: Solving inverse problems using latent diffusion

Litu Rout, Yujia Chen, Abhishek Kumar, Constantine Cara- manis, Sanjay Shakkottai, and Wen-Sheng Chu. Beyond first-order tweedie: Solving inverse problems using latent diffusion. InCVPR, 2024. 2, 3

2024
[38]

Cads: Unleashing the di- versity of diffusion models through condition-annealed sam- pling

Seyedmorteza Sadat, Jakob Buhmann, Derek Bradley, Otmar Hilliges, and Romann M Weber. Cads: Unleashing the di- versity of diffusion models through condition-annealed sam- pling. InICLR, 2024. 3

2024
[39]

Eliminating oversaturation and artifacts of high guid- ance scales in diffusion models

Seyedmorteza Sadat, Otmar Hilliges, and Romann M We- ber. Eliminating oversaturation and artifacts of high guid- ance scales in diffusion models. InICLR, 2024. 2, 3, 5

2024
[40]

No training, no problem: Rethinking classifier-free guidance for diffusion models

Seyedmorteza Sadat, Manuel Kansy, Otmar Hilliges, and Romann M Weber. No training, no problem: Rethinking classifier-free guidance for diffusion models. InICLR, 2025. 2, 3, 4

2025
[41]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. InNeurIPS, 2022. 1, 2

2022
[42]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. InNeurIPS, 2016. 6

2016
[43]

Laion-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. InNeurIPS, 2022. 6

2022
[44]

Text-to-4d dy- namic scene generation

Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, et al. Text-to-4d dy- namic scene generation. InICML, 2023. 1

2023
[45]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InICML, 2015. 1, 3, 4

2015
[46]

Denois- ing diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InICLR, 2021. 3, 6

2021
[47]

Generative modeling by es- timating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by es- timating gradients of the data distribution. InNeurIPS, 2019. 3, 4

2019
[48]

Score-based generative modeling through stochastic differential equa- tions

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. InICLR, 2020. 1, 3, 4

2020
[49]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017. 6

2017
[50]

A connection between score matching and denoising autoencoders.Neural computation, 23(7):1661– 1674, 2011

Pascal Vincent. A connection between score matching and denoising autoencoders.Neural computation, 23(7):1661– 1674, 2011. 4

2011
[51]

Diffusers: State-of-the-art diffu- sion models.https://github.com/huggingface/ diffusers, 2022

Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, Dhruv Nair, Sayak Paul, William Berman, Yiyi Xu, Steven Liu, and Thomas Wolf. Diffusers: State-of-the-art diffu- sion models.https://github.com/huggingface/ diffusers, 2022. 6

2022
[52]

Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion

Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion. InNeurIPS, 2023. 1

2023
[53]

Imagere- ward: Learning and evaluating human preferences for text- to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation. InNeurIPS, 2023. 6

2023
[54]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InICLR, 2025. 1

2025
[55]

4real: Towards photorealistic 4d scene generation via video diffusion models

Heng Yu, Chaoyang Wang, Peiye Zhuang, Willi Mena- pace, Aliaksandr Siarohin, Junli Cao, L ´aszl´o Jeni, Sergey Tulyakov, and Hsin-Ying Lee. 4real: Towards photorealistic 4d scene generation via video diffusion models. InNeurIPS,
[56]

Improving diffusion inverse problem solving with decoupled noise annealing

Bingliang Zhang, Wenda Chu, Julius Berner, Chenlin Meng, Anima Anandkumar, and Yang Song. Improving diffusion inverse problem solving with decoupled noise annealing. In CVPR, 2025. 2, 3

2025
[57]

Generative adversarial train- ing with perturbed token detection for model robustness

Jiahao Zhao and Wenji Mao. Generative adversarial train- ing with perturbed token detection for model robustness. In EMNLP, 2023. 5

2023