pith. machine review for the scientific record. sign in

arxiv: 2604.23536 · v1 · submitted 2026-04-26 · 💻 cs.CV

Recognition: unknown

Z²-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models

Haosen Li, Kaishen Yuan, Lei Wang, Shaofeng Liang, Wenshuo Chen, Yutao Yue

Pith reviewed 2026-05-08 06:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion modelszigzag samplingsemantic alignmentclassifier-free guidancezero-cost samplingprobability flow ODEcurvature penaltybackward error analysis
0
0 comments X

The pith

Diffusion models gain curvature-aware semantic alignment from zero-cost zigzag trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard classifier-free guidance in diffusion models uses only instantaneous gradients and overlooks the curvature of the data manifold. Zigzag sampling improves alignment by traversing forward-backward paths but at three times the normal cost and with potential drift errors. The paper proves that these paths can be implicitly reduced by algebraically eliminating intermediate states through operator dualities, removing off-manifold errors. Z²-Sampling then pairs this reduction with a cached temporal semantic surrogate based on the probability flow ODE to keep the cost at the standard two neural function evaluations. Backward error analysis shows this process naturally creates a directional derivative curvature penalty, and tests confirm better results across different models and media without extra computation.

Core claim

Z²-Sampling establishes that the explicit zigzag sequence is topologically reducible, allowing algebraic annihilation of intermediate states via operator dualities to create implicit Z-sampling that eliminates approximation errors. By exploiting the temporal coherence of the Probability Flow ODE with a dynamically cached Temporal Semantic Surrogate, this restores the standard 2-NFE baseline while the discrete collapse inherently synthesizes a directional derivative curvature penalty, as proven by backward error analysis.

What carries the argument

The implicit algebraic collapse of zigzag trajectories via operator dualities combined with a cached Temporal Semantic Surrogate from the Probability Flow ODE, which together synthesize the curvature penalty at zero added cost.

If this is right

  • Restores the standard 2-NFE baseline while delivering improved semantic alignment.
  • Applies across U-Net and DiT architectures for both image and video generation.
  • Maintains orthogonality with other alignment methods such as AYS and Diffusion-DPO.
  • Structurally improves the performance-efficiency Pareto frontier.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The algebraic reduction might apply to other multi-step ODE samplers to lower their overhead in similar ways.
  • The synthesized curvature penalty could enhance stability in generation tasks where manifold geometry is critical.
  • Direct integration into existing diffusion pipelines could raise output quality at no added runtime cost.

Load-bearing premise

The explicit zigzag sequence is topologically reducible and intermediate states can be algebraically annihilated via operator dualities without introducing cumulative drift from the true marginal distribution or requiring additional corrections.

What would settle it

A side-by-side run of Z²-Sampling against explicit Z-Sampling at identical 2-NFE budgets, measuring whether semantic alignment scores match or exceed the explicit version without added drift, would falsify the claim if no improvement appears.

Figures

Figures reproduced from arXiv: 2604.23536 by Haosen Li, Kaishen Yuan, Lei Wang, Shaofeng Liang, Wenshuo Chen, Yutao Yue.

Figure 1
Figure 1. Figure 1: Visualizing the structural paradigm shift: Explicit Z-Sampling (middle) incurs a 3 view at source ↗
Figure 2
Figure 2. Figure 2: Methodological compari￾son of explicit Z-Sampling and Z 2 - Sampling. Top: Explicit Z-Sampling tra￾verses a lookahead-and-return path, requir￾ing three CFG evaluations (6 NFEs per step). The intermediate state is pushed off the true data manifold, introducing a per￾sistent spatial approximation error τ (t) dur￾ing inversion. Bottom: Z 2 -Sampling alge￾braically collapses the intermediate traver￾sal and reu… view at source ↗
Figure 3
Figure 3. Figure 3: Cosine similarity across denoising steps. Empirical evaluation across a large number of samples reveals a consistent pattern: video generation requires a 10-step warmup phase to achieve semantic stability, whereas image generation converges rapidly in just 5 steps. Proof. Expanding Φt and substituting x˜t from Theorem 1: x new t−1 = At  xt − Ctγ1∆ϵ t θ (xt)  + Btϵ˜t = Atxt − (AtCt)γ1∆ϵ t θ (xt) + Btϵ˜t (… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of text-to-image generation. From top to bottom: Standard view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of video generation. In each scenario, the top row displays the view at source ↗
Figure 6
Figure 6. Figure 6: Shattering the Performance-Efficiency Pareto Frontier. Both evaluations use HPS v2 on Pick-a-Pic. (a) Robustness to the active zigzag diffusion steps (λ). The vertical axis represents the head-to-head winning rate against the Standard Sampling baseline. Starting from a 50% tie at λ = 0 (where the process degenerates into standard sampling), the generation quality improves monotonically as λ increases, indi… view at source ↗
Figure 7
Figure 7. Figure 7: Visual validation of the curvature penalty. At extreme guidance levels, standard CFG suffers from severe oversaturation. Explicit Z-Sampling mitigates this but introduces spatial drift artifacts. Z 2 -Sampling strictly internalizes the curvature penalty (−hγ1∇∆vv) without off￾manifold error, successfully suppressing the oversaturation caused by implicit double guidance (2γ1) to produce high-fidelity result… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison across varying guidance scales (γ1). The leftmost column displays the Standard Sampling baseline. The subsequent columns, from left to right, illustrate the results of our proposed Z 2 -Sampling with γ1 ∈ {1, 2, 3, 4, }. As γ1 increases, Z 2 -Sampling progressively deepens semantic alignment and visual richness. Crucially, by algebraically elimi￾nating off-manifold spatial drift, it … view at source ↗
Figure 9
Figure 9. Figure 9: Failure cases of the proposed method. E Failure Case Analysis While Z 2 -Sampling structurally eliminates the off-manifold spatial drift τ (t) and significantly miti￾gates the color oversaturation artifacts inherent in explicit Z-Sampling, it is not entirely immune to the stochastic challenges of diffusion manifolds. As illustrated in view at source ↗
read the original abstract

Diffusion models have achieved unprecedented success in text-aligned generation, largely driven by Classifier-Free Guidance (CFG). However, standard CFG operates strictly on instantaneous gradients, omitting the intrinsic curvature of the data manifold. Recent methods like Zigzag-sampling (Z-Sampling) explicitly traverse multi-step forward-backward trajectories to probe this curvature, significantly improving semantic alignment. Yet, these explicit traversals triple the Neural Function Evaluation (NFE) cost and introduce unconstrained truncation errors from off-manifold evaluations, causing cumulative drift from the true marginal distribution. In this paper, we theoretically demonstrate that the explicit zigzag sequence is topologically reducible. We propose Implicit Z-Sampling, rigorously proving that intermediate states can be algebraically annihilated via operator dualities, physically eliminating off-manifold approximation errors. To push sampling efficiency to its theoretical lower bound, we introduce $Z^2$-Sampling (Zero-cost Zigzag Sampling). Exploiting the Probability Flow ODE's temporal coherence, $Z^2$-Sampling couples implicit algebraic collapse with a dynamically cached Temporal Semantic Surrogate. This restores the standard 2-NFE baseline without sacrificing semantic exploration. We formally prove via Backward Error Analysis that this discrete collapse inherently synthesizes a directional derivative curvature penalty. Finally, extensive evaluations demonstrate that $Z^2$-Sampling structurally shatters the performance-efficiency Pareto frontier. We validate its universal applicability across diverse architectures (U-Nets, DiTs) and modalities (image/video), establishing seamless orthogonality with advanced alignment frameworks (AYS, Diffusion-DPO).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes $Z^2$-Sampling (Zero-cost Zigzag Sampling) for diffusion models, claiming that explicit zigzag trajectories are topologically reducible so that intermediate states can be algebraically annihilated via operator dualities. Combined with a dynamically cached Temporal Semantic Surrogate exploiting Probability Flow ODE coherence, the method restores the standard 2-NFE baseline while inheriting a directional-derivative curvature penalty, as formally shown by backward error analysis. Extensive experiments across U-Nets, DiTs, image and video modalities are said to demonstrate that the approach shatters the performance-efficiency Pareto frontier and remains orthogonal to frameworks such as AYS and Diffusion-DPO.

Significance. If the topological reduction and backward-error claims hold without residual drift, the work would be significant: it would eliminate the NFE overhead of prior curvature-aware samplers while improving semantic alignment, thereby establishing a new efficiency baseline for text-aligned generation.

major comments (3)
  1. [Proof of topological reducibility (§3)] The central claim (abstract and §3) that the explicit zigzag sequence is topologically reducible and that intermediate states can be algebraically annihilated via operator dualities without cumulative drift from the true marginal distribution is load-bearing for both the zero-cost assertion and the subsequent backward-error result. In the discrete PF-ODE setting, exact commutation or duality relations typically hold only in the continuous-time limit; finite-step truncation and off-manifold evaluations can leave nonzero residual terms that the Temporal Semantic Surrogate cache may not cancel, undermining the claim that no additional fitted corrections are required.
  2. [Backward Error Analysis (§4)] §4 (Backward Error Analysis): the statement that the discrete collapse 'inherently synthesizes a directional derivative curvature penalty' must be shown to be independent of the particular definition of the Temporal Semantic Surrogate. If the penalty term arises only after the surrogate is introduced, the result is not an intrinsic property of the collapse and the 'parameter-free' character asserted in the abstract is compromised.
  3. [Empirical evaluations (§5–6)] Empirical section (likely §5–6): the claim that $Z^2$-Sampling 'structurally shatters the performance-efficiency Pareto frontier' and restores exact 2-NFE behavior requires tabulated quantitative results (FID, CLIP score, etc.) with error bars, direct comparison against the unmodified 2-NFE baseline, and ablation of the surrogate cache. Without these, the cross-architecture and cross-modality universality statements cannot be evaluated.
minor comments (2)
  1. [Abstract] The abstract asserts 'rigorous proofs' yet contains no equations, derivations, or theorem statements; the reader must reach the body to locate the actual mathematical content.
  2. [Notation and definitions] Notation: 'Temporal Semantic Surrogate' is introduced without an early, self-contained mathematical definition; its precise functional form and caching rule should be stated before the backward-error argument.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for their detailed and insightful comments on our manuscript. We have carefully considered each point and provide point-by-point responses below. Where appropriate, we have made revisions to clarify the theoretical claims and strengthen the empirical evidence.

read point-by-point responses
  1. Referee: [Proof of topological reducibility (§3)] The central claim (abstract and §3) that the explicit zigzag sequence is topologically reducible and that intermediate states can be algebraically annihilated via operator dualities without cumulative drift from the true marginal distribution is load-bearing for both the zero-cost assertion and the subsequent backward-error result. In the discrete PF-ODE setting, exact commutation or duality relations typically hold only in the continuous-time limit; finite-step truncation and off-manifold evaluations can leave nonzero residual terms that the Temporal Semantic Surrogate cache may not cancel, undermining the claim that no additional fitted corrections are required.

    Authors: We thank the referee for highlighting this important aspect of the discrete setting. In Section 3, our proof of topological reducibility relies on the exact commutation properties derived from the Probability Flow ODE's structure, which ensures that the operator dualities annihilate intermediates precisely even in finite steps, without residual drift. The Temporal Semantic Surrogate is introduced later for efficiency and does not affect the annihilation. To further address concerns about truncation, we have added a detailed remark in the revised manuscript explaining why off-manifold errors are eliminated by the algebraic collapse, supported by the backward error framework. This maintains the zero-cost and parameter-free claims. revision: partial

  2. Referee: [Backward Error Analysis (§4)] §4 (Backward Error Analysis): the statement that the discrete collapse 'inherently synthesizes a directional derivative curvature penalty' must be shown to be independent of the particular definition of the Temporal Semantic Surrogate. If the penalty term arises only after the surrogate is introduced, the result is not an intrinsic property of the collapse and the 'parameter-free' character asserted in the abstract is compromised.

    Authors: The backward error analysis in Section 4 is performed on the discrete collapse operation alone. We explicitly derive the directional derivative penalty from the error expansion of the annihilation step, which is independent of the surrogate cache. The surrogate is merely a computational optimization exploiting temporal coherence and does not contribute to the penalty term. We have revised the text to emphasize this separation and included an additional equation showing the penalty derivation without reference to the cache. revision: partial

  3. Referee: [Empirical evaluations (§5–6)] Empirical section (likely §5–6): the claim that $Z^2$-Sampling 'structurally shatters the performance-efficiency Pareto frontier' and restores exact 2-NFE behavior requires tabulated quantitative results (FID, CLIP score, etc.) with error bars, direct comparison against the unmodified 2-NFE baseline, and ablation of the surrogate cache. Without these, the cross-architecture and cross-modality universality statements cannot be evaluated.

    Authors: We agree that providing detailed quantitative results is crucial for validating the claims. In the revised version, we have added comprehensive tables in Sections 5 and 6 reporting FID, CLIP scores, and other metrics with error bars (standard deviation over multiple seeds) for Z²-Sampling compared to the standard 2-NFE baseline across U-Net and DiT architectures, as well as image and video modalities. Additionally, we include an ablation study isolating the effect of the Temporal Semantic Surrogate cache. These results confirm the performance improvements and Pareto frontier advancement. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central claims rest on a formal proof of topological reducibility of the zigzag sequence and algebraic annihilation of intermediates via operator dualities, followed by a Backward Error Analysis establishing an inherent curvature penalty in the discrete collapse. These are presented as independent mathematical results applied to the Probability Flow ODE setting, with the Z²-Sampling construction then shown to restore the 2-NFE baseline. No equations or steps reduce by construction to fitted inputs, self-citations, or renamed empirical patterns; the derivation chain remains self-contained with external mathematical content rather than tautological equivalence to its own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on the topological reducibility of zigzag sequences and the temporal coherence of the Probability Flow ODE for caching, both introduced without independent evidence in the abstract; the Temporal Semantic Surrogate is a new construct whose correctness is not externally validated here.

axioms (2)
  • ad hoc to paper The explicit zigzag sequence is topologically reducible.
    Invoked to justify algebraic annihilation of intermediate states.
  • domain assumption The Probability Flow ODE possesses sufficient temporal coherence to support a dynamically cached Temporal Semantic Surrogate without drift.
    Required to achieve zero additional NFE while preserving semantic exploration.
invented entities (1)
  • Temporal Semantic Surrogate no independent evidence
    purpose: Cached stand-in for semantic direction that enables implicit zigzag at baseline cost.
    New construct introduced to couple implicit collapse with temporal coherence.

pith-pipeline@v0.9.0 · 5596 in / 1594 out tokens · 72079 ms · 2026-05-08T06:34:33.494838+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 49 canonical work pages · 18 internal anchors

  1. [1]

    Zigzag diffusion sampling: Diffusion models can self-improve via self-reflection, 2024

    Lichen Bai, Shitong Shao, Zikai Zhou, Zipeng Qi, Zhiqiang Xu, Haoyi Xiong, and Zeke Xie. Zigzag diffusion sampling: Diffusion models can self-improve via self-reflection, 2024. URL https://arxiv.org/abs/2412.10891. 22

  2. [2]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023. URLhttps://arxiv.org/abs/2311.15127

  3. [3]

    Sato: Stable text-to-motion framework

    Wenshuo chen, Hongru Xiao, Erhang Zhang, Lijie Hu, Lei Wang, Mengyuan Liu, and Chen Chen. Sato: Stable text-to-motion framework. InProceedings of the 32nd ACM International Conference on Multimedia, MM ’24, page 6989–6997. ACM, October 2024. doi: 10.1145/ 3664647.3681034. URLhttp://dx.doi.org/10.1145/3664647.3681034

  4. [4]

    Polaris: Projection-orthogonal least squares for robust and adaptive inversion in diffusion models, 2025

    Wenshuo Chen, Haosen Li, Shaofeng Liang, Lei Wang, Haozhe Jia, Kaishen Yuan, Jieming Wu, Bowen Tian, and Yutao Yue. Polaris: Projection-orthogonal least squares for robust and adaptive inversion in diffusion models, 2025. URLhttps://arxiv.org/abs/2512.00369

  5. [5]

    Ant: Adaptive neural temporal- aware text-to-motion model

    Wenshuo Chen, Kuimou Yu, Jia Haozhe, Kaishen Yuan, Zexu Huang, Bowen Tian, Songning Lai, Hongru Xiao, Erhang Zhang, Lei Wang, and Yutao Yue. Ant: Adaptive neural temporal- aware text-to-motion model. InProceedings of the 33rd ACM International Conference on Multimedia, MM ’25, page 9852–9861. ACM, October 2025. doi: 10.1145/3746027.3755168. URLhttp://dx.d...

  6. [6]

    arXiv preprint arXiv:2406.08070(2024)

    Hyungjin Chung, Jeongsol Kim, Geon Yeong Park, Hyelin Nam, and Jong Chul Ye. Cfg++: Manifold-constrained classifier free guidance for diffusion models, 2024. URLhttps://arxiv. org/abs/2406.08070

  7. [7]

    Geneval: An object-focused frame- work for evaluating text-to-image alignment, 2023

    Dhruba Ghosh, Hanna Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused frame- work for evaluating text-to-image alignment, 2023. URLhttps://arxiv.org/abs/2310. 11513

  8. [8]

    Springer, Berlin (2002)

    Ernst Hairer, Gerhard Wanner, and Christian Lubich.Backward Error Analysis and Struc- ture Preservation, pages 337–388. Springer Berlin Heidelberg, Berlin, Heidelberg, 2006. ISBN 978-3-540-30666-5. doi: 10.1007/3-540-30666-8 9. URLhttps://doi.org/10.1007/ 3-540-30666-8_9

  9. [9]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. URLhttps://arxiv. org/abs/2207.12598

  10. [10]

    Denoising Diffusion Probabilistic Models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020. URLhttps://arxiv.org/abs/2006.11239

  11. [11]

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diffusion models, 2022. URLhttps://arxiv.org/abs/2204.03458

  12. [12]

    Smoothed energy guidance: Guiding diffusion models with reduced energy curvature of attention, 2024

    Susung Hong. Smoothed energy guidance: Guiding diffusion models with reduced energy curvature of attention, 2024. URLhttps://arxiv.org/abs/2408.00760

  13. [13]

    Physics-informed representation alignment for sparse radio-map reconstruction, 2025

    Haozhe Jia, Wenshuo Chen, Zhihui Huang, Lei Wang, Hongru Xiao, Nanqian Jia, Keming Wu, Songning Lai, Bowen Tian, and Yutao Yue. Physics-informed representation alignment for sparse radio-map reconstruction, 2025. URLhttps://arxiv.org/abs/2501.19160

  14. [14]

    arXiv preprint arXiv:2310.01506 (2023)

    Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Direct inversion: Boosting diffusion-based editing with 3 lines of code, 2023. URLhttps://arxiv.org/abs/2310.01506. 23

  15. [15]

    Elucidating the Design Space of Diffusion-Based Generative Models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models, 2022. URLhttps://arxiv.org/abs/2206.00364

  16. [16]

    Kingma, Tim Salimans, Ben Poole, and Jonathan Ho

    Diederik P. Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models,

  17. [17]

    URLhttps://arxiv.org/abs/2107.00630

  18. [18]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation, 2023

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation, 2023. URL https://arxiv.org/abs/2305.01569

  19. [19]

    Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding, 2024

    Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue,...

  20. [20]

    arXiv.csabs/2305.08891(2023) 1

    Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common diffusion noise schedules and sample steps are flawed, 2024. URLhttps://arxiv.org/abs/2305.08891

  21. [21]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, 2023. URLhttps://arxiv.org/abs/2210.02747

  22. [22]

    Tenenbaum

    Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B. Tenenbaum. Compositional visual generation with composable diffusion models, 2023. URLhttps://arxiv.org/abs/ 2206.01714

  23. [23]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022. URLhttps://arxiv.org/abs/2209.03003

  24. [24]

    Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, Lifang He, and Lichao Sun. Sora: A review on background, technology, limitations, and opportunities of large vision models, 2024. URL https://arxiv.org/abs/2402.17177

  25. [25]

    Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps, 2022. URL https://arxiv.org/abs/2206.00927

  26. [26]

    Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Re- search, 22(4):730–751, June 2025

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Re- search, 22(4):730–751, June 2025. ISSN 2731-5398. doi: 10.1007/s11633-025-1562-4. URL http://dx.doi.org/10.1007/s11633-025-1562-4

  27. [27]

    Repaint: Inpainting using denoising diffusion probabilistic models, 2022

    Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models, 2022. URLhttps: //arxiv.org/abs/2201.09865

  28. [28]

    The lottery ticket hypothesis in denoising: Towards semantic-driven initialization, 2024

    Jiafeng Mao, Xueting Wang, and Kiyoharu Aizawa. The lottery ticket hypothesis in denoising: Towards semantic-driven initialization, 2024. URLhttps://arxiv.org/abs/2312.08872. 24

  29. [29]

    Understanding ssim.arXiv preprint arXiv:2006.13846, 2020

    Jim Nilsson and Tomas Akenine-M¨ oller. Understanding ssim, 2020. URLhttps://arxiv. org/abs/2006.13846

  30. [30]

    A., and Ertugrul, I

    Mang Ning, Mingxiao Li, Jianlin Su, Haozhe Jia, Lanmiao Liu, Martin Beneˇ s, Wenshuo Chen, Albert Ali Salah, and Itir Onal Ertugrul. Dctdiff: Intriguing properties of image generative modeling in the dct space, 2025. URLhttps://arxiv.org/abs/2412.15032

  31. [31]

    Synthetic shifts to initial seed vector exposes the brittle nature of latent-based diffusion models, 2023

    Mao Po-Yuan, Shashank Kotyan, Tham Yik Foong, and Danilo Vasconcellos Vargas. Synthetic shifts to initial seed vector exposes the brittle nature of latent-based diffusion models, 2023. URLhttps://arxiv.org/abs/2312.11473

  32. [32]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M¨ uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. URLhttps://arxiv.org/abs/2307.01952

  33. [33]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020

  34. [34]

    High-Resolution Image Synthesis with Latent Diffusion Models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨ orn Ommer. High-resolution image synthesis with latent diffusion models, 2022. URLhttps://arxiv. org/abs/2112.10752

  35. [35]

    Align your steps: Optimizing sampling schedules in diffusion models, 2024

    Amirmojtaba Sabour, Sanja Fidler, and Karsten Kreis. Align your steps: Optimizing sampling schedules in diffusion models, 2024. URLhttps://arxiv.org/abs/2404.14507

  36. [36]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to- image diffusion models with deep language understanding, 2022. URLhttps://arxiv.org/ abs/2205.11487

  37. [37]

    Generating images of rare concepts using pre-trained diffusion models, 2023

    Dvir Samuel, Rami Ben-Ari, Simon Raviv, Nir Darshan, and Gal Chechik. Generating images of rare concepts using pre-trained diffusion models, 2023. URLhttps://arxiv.org/abs/ 2304.14530

  38. [38]

    arXiv preprint arXiv:2311.17042 (2023)

    Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation, 2023. URLhttps://arxiv.org/abs/2311.17042

  39. [39]

    LAION-5B: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmar- czyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text model...

  40. [40]

    Rethinking the spatial inconsistency in classifier-free diffusion guidance, 2024

    Dazhong Shen, Guanglu Song, Zeyue Xue, Fu-Yun Wang, and Yu Liu. Rethinking the spatial inconsistency in classifier-free diffusion guidance, 2024. URLhttps://arxiv.org/abs/2404. 05384. 25

  41. [41]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make- a-video: Text-to-video generation without text-video data, 2022. URLhttps://arxiv.org/ abs/2209.14792

  42. [42]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models, 2022. URLhttps://arxiv.org/abs/2010.02502

  43. [43]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations, 2021. URLhttps://arxiv.org/abs/2011.13456

  44. [44]

    Diffusion model alignment using direct preference optimization, 2023

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization, 2023. URLhttps://arxiv.org/abs/2311.12908

  45. [45]

    ModelScope Text-to-Video Technical Report

    Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023

  46. [46]

    Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation, 2023

    Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation, 2023. URLhttps://arxiv.org/abs/2212. 11565

  47. [47]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text- to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

  48. [48]

    Imagereward: Learning and evaluating human preferences for text-to-image generation,

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation,

  49. [49]

    URLhttps://arxiv.org/abs/2304.05977

  50. [50]

    Good seed makes a good crop: Discovering secret seeds in text-to-image diffusion models, 2025

    Katherine Xu, Lingzhi Zhang, and Jianbo Shi. Good seed makes a good crop: Discovering secret seeds in text-to-image diffusion models, 2025. URLhttps://arxiv.org/abs/2405. 14828

  51. [51]

    Restart sampling for improving generative processes, 2023

    Yilun Xu, Mingyang Deng, Xiang Cheng, Yonglong Tian, Ziming Liu, and Tommi Jaakkola. Restart sampling for improving generative processes, 2023. URLhttps://arxiv.org/abs/ 2306.14878

  52. [52]

    Text-to-image rectified flow as plug-and-play priors, 2025

    Xiaofeng Yang, Cheng Chen, Xulei Yang, Fayao Liu, and Guosheng Lin. Text-to-image rectified flow as plug-and-play priors, 2025. URLhttps://arxiv.org/abs/2406.03293

  53. [53]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  54. [54]

    Co- emogen: Towards semantically-coherent and scalable emotional image content generation,

    Kaishen Yuan, Yuting Zhang, Shang Gao, Yijie Zhu, Wenshuo Chen, and Yutao Yue. Co- emogen: Towards semantically-coherent and scalable emotional image content generation,

  55. [55]

    URLhttps://arxiv.org/abs/2508.03535. 26

  56. [56]

    Chronomagic-bench: A benchmark for meta- morphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Systems, 37:21236–21270, 2024

    Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Rui-Jie Zhu, Xinhua Cheng, Jiebo Luo, and Li Yuan. Chronomagic-bench: A benchmark for meta- morphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Systems, 37:21236–21270, 2024

  57. [57]

    The unrea- sonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

  58. [58]

    In: Advances in Neural Informa- tion Processing Systems (NeurIPS) (2023),https://arxiv.org/abs/2302.04867

    Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor- corrector framework for fast sampling of diffusion models, 2023. URLhttps://arxiv.org/ abs/2302.04867

  59. [59]

    Golden noise for diffusion models: A learning framework

    Zikai Zhou, Shitong Shao, Lichen Bai, Shufei Zhang, Zhiqiang Xu, Bo Han, and Zeke Xie. Golden noise for diffusion models: A learning framework, 2025. URLhttps://arxiv.org/ abs/2411.09502. 27