arxiv: 2604.23536 · v1 · submitted 2026-04-26 · 💻 cs.CV

Recognition: unknown

Z²-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models

Haosen Li, Kaishen Yuan, Lei Wang, Shaofeng Liang, Wenshuo Chen, Yutao Yue

Pith reviewed 2026-05-08 06:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords diffusion modelszigzag samplingsemantic alignmentclassifier-free guidancezero-cost samplingprobability flow ODEcurvature penaltybackward error analysis

0 comments

The pith

Diffusion models gain curvature-aware semantic alignment from zero-cost zigzag trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard classifier-free guidance in diffusion models uses only instantaneous gradients and overlooks the curvature of the data manifold. Zigzag sampling improves alignment by traversing forward-backward paths but at three times the normal cost and with potential drift errors. The paper proves that these paths can be implicitly reduced by algebraically eliminating intermediate states through operator dualities, removing off-manifold errors. Z²-Sampling then pairs this reduction with a cached temporal semantic surrogate based on the probability flow ODE to keep the cost at the standard two neural function evaluations. Backward error analysis shows this process naturally creates a directional derivative curvature penalty, and tests confirm better results across different models and media without extra computation.

Core claim

Z²-Sampling establishes that the explicit zigzag sequence is topologically reducible, allowing algebraic annihilation of intermediate states via operator dualities to create implicit Z-sampling that eliminates approximation errors. By exploiting the temporal coherence of the Probability Flow ODE with a dynamically cached Temporal Semantic Surrogate, this restores the standard 2-NFE baseline while the discrete collapse inherently synthesizes a directional derivative curvature penalty, as proven by backward error analysis.

What carries the argument

The implicit algebraic collapse of zigzag trajectories via operator dualities combined with a cached Temporal Semantic Surrogate from the Probability Flow ODE, which together synthesize the curvature penalty at zero added cost.

If this is right

Restores the standard 2-NFE baseline while delivering improved semantic alignment.
Applies across U-Net and DiT architectures for both image and video generation.
Maintains orthogonality with other alignment methods such as AYS and Diffusion-DPO.
Structurally improves the performance-efficiency Pareto frontier.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The algebraic reduction might apply to other multi-step ODE samplers to lower their overhead in similar ways.
The synthesized curvature penalty could enhance stability in generation tasks where manifold geometry is critical.
Direct integration into existing diffusion pipelines could raise output quality at no added runtime cost.

Load-bearing premise

The explicit zigzag sequence is topologically reducible and intermediate states can be algebraically annihilated via operator dualities without introducing cumulative drift from the true marginal distribution or requiring additional corrections.

What would settle it

A side-by-side run of Z²-Sampling against explicit Z-Sampling at identical 2-NFE budgets, measuring whether semantic alignment scores match or exceed the explicit version without added drift, would falsify the claim if no improvement appears.

Figures

Figures reproduced from arXiv: 2604.23536 by Haosen Li, Kaishen Yuan, Lei Wang, Shaofeng Liang, Wenshuo Chen, Yutao Yue.

**Figure 1.** Figure 1: Visualizing the structural paradigm shift: Explicit Z-Sampling (middle) incurs a 3 view at source ↗

**Figure 2.** Figure 2: Methodological comparison of explicit Z-Sampling and Z 2 - Sampling. Top: Explicit Z-Sampling traverses a lookahead-and-return path, requiring three CFG evaluations (6 NFEs per step). The intermediate state is pushed off the true data manifold, introducing a persistent spatial approximation error τ (t) during inversion. Bottom: Z 2 -Sampling algebraically collapses the intermediate traversal and reu… view at source ↗

**Figure 3.** Figure 3: Cosine similarity across denoising steps. Empirical evaluation across a large number of samples reveals a consistent pattern: video generation requires a 10-step warmup phase to achieve semantic stability, whereas image generation converges rapidly in just 5 steps. Proof. Expanding Φt and substituting x˜t from Theorem 1: x new t−1 = At xt − Ctγ1∆ϵ t θ (xt) + Btϵ˜t = Atxt − (AtCt)γ1∆ϵ t θ (xt) + Btϵ˜t (… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of text-to-image generation. From top to bottom: Standard view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of video generation. In each scenario, the top row displays the view at source ↗

**Figure 6.** Figure 6: Shattering the Performance-Efficiency Pareto Frontier. Both evaluations use HPS v2 on Pick-a-Pic. (a) Robustness to the active zigzag diffusion steps (λ). The vertical axis represents the head-to-head winning rate against the Standard Sampling baseline. Starting from a 50% tie at λ = 0 (where the process degenerates into standard sampling), the generation quality improves monotonically as λ increases, indi… view at source ↗

**Figure 7.** Figure 7: Visual validation of the curvature penalty. At extreme guidance levels, standard CFG suffers from severe oversaturation. Explicit Z-Sampling mitigates this but introduces spatial drift artifacts. Z 2 -Sampling strictly internalizes the curvature penalty (−hγ1∇∆vv) without offmanifold error, successfully suppressing the oversaturation caused by implicit double guidance (2γ1) to produce high-fidelity result… view at source ↗

**Figure 8.** Figure 8: Qualitative comparison across varying guidance scales (γ1). The leftmost column displays the Standard Sampling baseline. The subsequent columns, from left to right, illustrate the results of our proposed Z 2 -Sampling with γ1 ∈ {1, 2, 3, 4, }. As γ1 increases, Z 2 -Sampling progressively deepens semantic alignment and visual richness. Crucially, by algebraically eliminating off-manifold spatial drift, it … view at source ↗

**Figure 9.** Figure 9: Failure cases of the proposed method. E Failure Case Analysis While Z 2 -Sampling structurally eliminates the off-manifold spatial drift τ (t) and significantly mitigates the color oversaturation artifacts inherent in explicit Z-Sampling, it is not entirely immune to the stochastic challenges of diffusion manifolds. As illustrated in view at source ↗

read the original abstract

Diffusion models have achieved unprecedented success in text-aligned generation, largely driven by Classifier-Free Guidance (CFG). However, standard CFG operates strictly on instantaneous gradients, omitting the intrinsic curvature of the data manifold. Recent methods like Zigzag-sampling (Z-Sampling) explicitly traverse multi-step forward-backward trajectories to probe this curvature, significantly improving semantic alignment. Yet, these explicit traversals triple the Neural Function Evaluation (NFE) cost and introduce unconstrained truncation errors from off-manifold evaluations, causing cumulative drift from the true marginal distribution. In this paper, we theoretically demonstrate that the explicit zigzag sequence is topologically reducible. We propose Implicit Z-Sampling, rigorously proving that intermediate states can be algebraically annihilated via operator dualities, physically eliminating off-manifold approximation errors. To push sampling efficiency to its theoretical lower bound, we introduce $Z^2$-Sampling (Zero-cost Zigzag Sampling). Exploiting the Probability Flow ODE's temporal coherence, $Z^2$-Sampling couples implicit algebraic collapse with a dynamically cached Temporal Semantic Surrogate. This restores the standard 2-NFE baseline without sacrificing semantic exploration. We formally prove via Backward Error Analysis that this discrete collapse inherently synthesizes a directional derivative curvature penalty. Finally, extensive evaluations demonstrate that $Z^2$-Sampling structurally shatters the performance-efficiency Pareto frontier. We validate its universal applicability across diverse architectures (U-Nets, DiTs) and modalities (image/video), establishing seamless orthogonality with advanced alignment frameworks (AYS, Diffusion-DPO).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Z²-Sampling collapses explicit zigzags to an implicit algebraic form with a cached surrogate to hit zero extra cost, but the no-drift claim in discrete steps is the part that needs direct verification.

read the letter

The paper's main move is to take the explicit forward-backward trajectories of Z-Sampling, which improve semantic alignment in CFG by probing manifold curvature, and reduce them to an implicit algebraic version that runs at the standard two NFE cost. They claim the zigzag sequence is topologically reducible so that intermediate states annihilate via operator dualities, removing off-manifold truncation errors. A dynamically cached temporal semantic surrogate then exploits the probability flow ODE's temporal coherence to keep the whole thing at baseline cost. The backward error analysis is used to argue that this discrete collapse itself produces a directional derivative curvature penalty. They report that the result holds across U-Nets and DiTs for images and video and combines with other alignment methods without conflict. This algebraic reduction plus the surrogate cache is the concrete new piece relative to prior zigzag work, and if the derivations are sound it cleanly removes the triple-cost drawback while keeping the alignment benefit. The evaluations are described as extensive, which is a positive if the tables show consistent gains without hidden fitting. The soft spot sits exactly on the reduction step. The assertion that annihilation occurs without cumulative drift from the true marginal depends on the operator dualities holding exactly in the finite discrete setting. In practice, non-commuting terms from step truncation can leave residuals, and it is not obvious whether the surrogate cache cancels them or merely masks them. The paper states rigorous proofs, but the strength of the zero-cost and inherent-penalty claims rests on those specific error terms and commutation relations being shown explicitly. Without seeing the derivations or ablations that isolate the curvature effect, the Pareto-frontier language remains hard to weigh. This is for people working on diffusion sampling and CFG variants who need cheap alignment improvements. A practitioner implementing new samplers would get direct value from testing the surrogate cache and the claimed orthogonality. It deserves peer review because the target problem is real, the proposed mechanism is specific, and the claims are falsifiable enough for referees to check the math and numbers directly.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes $Z^2$-Sampling (Zero-cost Zigzag Sampling) for diffusion models, claiming that explicit zigzag trajectories are topologically reducible so that intermediate states can be algebraically annihilated via operator dualities. Combined with a dynamically cached Temporal Semantic Surrogate exploiting Probability Flow ODE coherence, the method restores the standard 2-NFE baseline while inheriting a directional-derivative curvature penalty, as formally shown by backward error analysis. Extensive experiments across U-Nets, DiTs, image and video modalities are said to demonstrate that the approach shatters the performance-efficiency Pareto frontier and remains orthogonal to frameworks such as AYS and Diffusion-DPO.

Significance. If the topological reduction and backward-error claims hold without residual drift, the work would be significant: it would eliminate the NFE overhead of prior curvature-aware samplers while improving semantic alignment, thereby establishing a new efficiency baseline for text-aligned generation.

major comments (3)

[Proof of topological reducibility (§3)] The central claim (abstract and §3) that the explicit zigzag sequence is topologically reducible and that intermediate states can be algebraically annihilated via operator dualities without cumulative drift from the true marginal distribution is load-bearing for both the zero-cost assertion and the subsequent backward-error result. In the discrete PF-ODE setting, exact commutation or duality relations typically hold only in the continuous-time limit; finite-step truncation and off-manifold evaluations can leave nonzero residual terms that the Temporal Semantic Surrogate cache may not cancel, undermining the claim that no additional fitted corrections are required.
[Backward Error Analysis (§4)] §4 (Backward Error Analysis): the statement that the discrete collapse 'inherently synthesizes a directional derivative curvature penalty' must be shown to be independent of the particular definition of the Temporal Semantic Surrogate. If the penalty term arises only after the surrogate is introduced, the result is not an intrinsic property of the collapse and the 'parameter-free' character asserted in the abstract is compromised.
[Empirical evaluations (§5–6)] Empirical section (likely §5–6): the claim that $Z^2$-Sampling 'structurally shatters the performance-efficiency Pareto frontier' and restores exact 2-NFE behavior requires tabulated quantitative results (FID, CLIP score, etc.) with error bars, direct comparison against the unmodified 2-NFE baseline, and ablation of the surrogate cache. Without these, the cross-architecture and cross-modality universality statements cannot be evaluated.

minor comments (2)

[Abstract] The abstract asserts 'rigorous proofs' yet contains no equations, derivations, or theorem statements; the reader must reach the body to locate the actual mathematical content.
[Notation and definitions] Notation: 'Temporal Semantic Surrogate' is introduced without an early, self-contained mathematical definition; its precise functional form and caching rule should be stated before the backward-error argument.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for their detailed and insightful comments on our manuscript. We have carefully considered each point and provide point-by-point responses below. Where appropriate, we have made revisions to clarify the theoretical claims and strengthen the empirical evidence.

read point-by-point responses

Referee: [Proof of topological reducibility (§3)] The central claim (abstract and §3) that the explicit zigzag sequence is topologically reducible and that intermediate states can be algebraically annihilated via operator dualities without cumulative drift from the true marginal distribution is load-bearing for both the zero-cost assertion and the subsequent backward-error result. In the discrete PF-ODE setting, exact commutation or duality relations typically hold only in the continuous-time limit; finite-step truncation and off-manifold evaluations can leave nonzero residual terms that the Temporal Semantic Surrogate cache may not cancel, undermining the claim that no additional fitted corrections are required.

Authors: We thank the referee for highlighting this important aspect of the discrete setting. In Section 3, our proof of topological reducibility relies on the exact commutation properties derived from the Probability Flow ODE's structure, which ensures that the operator dualities annihilate intermediates precisely even in finite steps, without residual drift. The Temporal Semantic Surrogate is introduced later for efficiency and does not affect the annihilation. To further address concerns about truncation, we have added a detailed remark in the revised manuscript explaining why off-manifold errors are eliminated by the algebraic collapse, supported by the backward error framework. This maintains the zero-cost and parameter-free claims. revision: partial
Referee: [Backward Error Analysis (§4)] §4 (Backward Error Analysis): the statement that the discrete collapse 'inherently synthesizes a directional derivative curvature penalty' must be shown to be independent of the particular definition of the Temporal Semantic Surrogate. If the penalty term arises only after the surrogate is introduced, the result is not an intrinsic property of the collapse and the 'parameter-free' character asserted in the abstract is compromised.

Authors: The backward error analysis in Section 4 is performed on the discrete collapse operation alone. We explicitly derive the directional derivative penalty from the error expansion of the annihilation step, which is independent of the surrogate cache. The surrogate is merely a computational optimization exploiting temporal coherence and does not contribute to the penalty term. We have revised the text to emphasize this separation and included an additional equation showing the penalty derivation without reference to the cache. revision: partial
Referee: [Empirical evaluations (§5–6)] Empirical section (likely §5–6): the claim that $Z^2$-Sampling 'structurally shatters the performance-efficiency Pareto frontier' and restores exact 2-NFE behavior requires tabulated quantitative results (FID, CLIP score, etc.) with error bars, direct comparison against the unmodified 2-NFE baseline, and ablation of the surrogate cache. Without these, the cross-architecture and cross-modality universality statements cannot be evaluated.

Authors: We agree that providing detailed quantitative results is crucial for validating the claims. In the revised version, we have added comprehensive tables in Sections 5 and 6 reporting FID, CLIP scores, and other metrics with error bars (standard deviation over multiple seeds) for Z²-Sampling compared to the standard 2-NFE baseline across U-Net and DiT architectures, as well as image and video modalities. Additionally, we include an ablation study isolating the effect of the Temporal Semantic Surrogate cache. These results confirm the performance improvements and Pareto frontier advancement. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central claims rest on a formal proof of topological reducibility of the zigzag sequence and algebraic annihilation of intermediates via operator dualities, followed by a Backward Error Analysis establishing an inherent curvature penalty in the discrete collapse. These are presented as independent mathematical results applied to the Probability Flow ODE setting, with the Z²-Sampling construction then shown to restore the 2-NFE baseline. No equations or steps reduce by construction to fitted inputs, self-citations, or renamed empirical patterns; the derivation chain remains self-contained with external mathematical content rather than tautological equivalence to its own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on the topological reducibility of zigzag sequences and the temporal coherence of the Probability Flow ODE for caching, both introduced without independent evidence in the abstract; the Temporal Semantic Surrogate is a new construct whose correctness is not externally validated here.

axioms (2)

ad hoc to paper The explicit zigzag sequence is topologically reducible.
Invoked to justify algebraic annihilation of intermediate states.
domain assumption The Probability Flow ODE possesses sufficient temporal coherence to support a dynamically cached Temporal Semantic Surrogate without drift.
Required to achieve zero additional NFE while preserving semantic exploration.

invented entities (1)

Temporal Semantic Surrogate no independent evidence
purpose: Cached stand-in for semantic direction that enables implicit zigzag at baseline cost.
New construct introduced to couple implicit collapse with temporal coherence.

pith-pipeline@v0.9.0 · 5596 in / 1594 out tokens · 72079 ms · 2026-05-08T06:34:33.494838+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 49 canonical work pages · 18 internal anchors

[1]

Zigzag diffusion sampling: Diffusion models can self-improve via self-reflection, 2024

Lichen Bai, Shitong Shao, Zikai Zhou, Zipeng Qi, Zhiqiang Xu, Haoyi Xiong, and Zeke Xie. Zigzag diffusion sampling: Diffusion models can self-improve via self-reflection, 2024. URL https://arxiv.org/abs/2412.10891. 22

work page arXiv 2024
[2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023. URLhttps://arxiv.org/abs/2311.15127

work page internal anchor Pith review arXiv 2023
[3]

Sato: Stable text-to-motion framework

Wenshuo chen, Hongru Xiao, Erhang Zhang, Lijie Hu, Lei Wang, Mengyuan Liu, and Chen Chen. Sato: Stable text-to-motion framework. InProceedings of the 32nd ACM International Conference on Multimedia, MM ’24, page 6989–6997. ACM, October 2024. doi: 10.1145/ 3664647.3681034. URLhttp://dx.doi.org/10.1145/3664647.3681034

work page doi:10.1145/3664647.3681034 2024
[4]

Polaris: Projection-orthogonal least squares for robust and adaptive inversion in diffusion models, 2025

Wenshuo Chen, Haosen Li, Shaofeng Liang, Lei Wang, Haozhe Jia, Kaishen Yuan, Jieming Wu, Bowen Tian, and Yutao Yue. Polaris: Projection-orthogonal least squares for robust and adaptive inversion in diffusion models, 2025. URLhttps://arxiv.org/abs/2512.00369

work page arXiv 2025
[5]

Ant: Adaptive neural temporal- aware text-to-motion model

Wenshuo Chen, Kuimou Yu, Jia Haozhe, Kaishen Yuan, Zexu Huang, Bowen Tian, Songning Lai, Hongru Xiao, Erhang Zhang, Lei Wang, and Yutao Yue. Ant: Adaptive neural temporal- aware text-to-motion model. InProceedings of the 33rd ACM International Conference on Multimedia, MM ’25, page 9852–9861. ACM, October 2025. doi: 10.1145/3746027.3755168. URLhttp://dx.d...

work page doi:10.1145/3746027.3755168 2025
[6]

arXiv preprint arXiv:2406.08070(2024)

Hyungjin Chung, Jeongsol Kim, Geon Yeong Park, Hyelin Nam, and Jong Chul Ye. Cfg++: Manifold-constrained classifier free guidance for diffusion models, 2024. URLhttps://arxiv. org/abs/2406.08070

work page arXiv 2024
[7]

Geneval: An object-focused frame- work for evaluating text-to-image alignment, 2023

Dhruba Ghosh, Hanna Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused frame- work for evaluating text-to-image alignment, 2023. URLhttps://arxiv.org/abs/2310. 11513

2023
[8]

Springer, Berlin (2002)

Ernst Hairer, Gerhard Wanner, and Christian Lubich.Backward Error Analysis and Struc- ture Preservation, pages 337–388. Springer Berlin Heidelberg, Berlin, Heidelberg, 2006. ISBN 978-3-540-30666-5. doi: 10.1007/3-540-30666-8 9. URLhttps://doi.org/10.1007/ 3-540-30666-8_9

work page doi:10.1007/3-540-30666-8 2006
[9]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. URLhttps://arxiv. org/abs/2207.12598

work page internal anchor Pith review arXiv 2022
[10]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020. URLhttps://arxiv.org/abs/2006.11239

work page internal anchor Pith review arXiv 2020
[11]

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diffusion models, 2022. URLhttps://arxiv.org/abs/2204.03458

work page internal anchor Pith review arXiv 2022
[12]

Smoothed energy guidance: Guiding diffusion models with reduced energy curvature of attention, 2024

Susung Hong. Smoothed energy guidance: Guiding diffusion models with reduced energy curvature of attention, 2024. URLhttps://arxiv.org/abs/2408.00760

work page arXiv 2024
[13]

Physics-informed representation alignment for sparse radio-map reconstruction, 2025

Haozhe Jia, Wenshuo Chen, Zhihui Huang, Lei Wang, Hongru Xiao, Nanqian Jia, Keming Wu, Songning Lai, Bowen Tian, and Yutao Yue. Physics-informed representation alignment for sparse radio-map reconstruction, 2025. URLhttps://arxiv.org/abs/2501.19160

work page arXiv 2025
[14]

arXiv preprint arXiv:2310.01506 (2023)

Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Direct inversion: Boosting diffusion-based editing with 3 lines of code, 2023. URLhttps://arxiv.org/abs/2310.01506. 23

work page arXiv 2023
[15]

Elucidating the Design Space of Diffusion-Based Generative Models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models, 2022. URLhttps://arxiv.org/abs/2206.00364

work page internal anchor Pith review arXiv 2022
[16]

Kingma, Tim Salimans, Ben Poole, and Jonathan Ho

Diederik P. Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models,
[17]

URLhttps://arxiv.org/abs/2107.00630

work page arXiv
[18]

Pick-a-pic: An open dataset of user preferences for text-to-image generation, 2023

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation, 2023. URL https://arxiv.org/abs/2305.01569

work page arXiv 2023
[19]

Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding, 2024

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue,...

2024
[20]

arXiv.csabs/2305.08891(2023) 1

Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common diffusion noise schedules and sample steps are flawed, 2024. URLhttps://arxiv.org/abs/2305.08891

work page arXiv 2024
[21]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, 2023. URLhttps://arxiv.org/abs/2210.02747

work page internal anchor Pith review arXiv 2023
[22]

Tenenbaum

Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B. Tenenbaum. Compositional visual generation with composable diffusion models, 2023. URLhttps://arxiv.org/abs/ 2206.01714

work page arXiv 2023
[23]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022. URLhttps://arxiv.org/abs/2209.03003

work page internal anchor Pith review arXiv 2022
[24]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, Lifang He, and Lichao Sun. Sora: A review on background, technology, limitations, and opportunities of large vision models, 2024. URL https://arxiv.org/abs/2402.17177

work page internal anchor Pith review arXiv 2024
[25]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps, 2022. URL https://arxiv.org/abs/2206.00927

work page arXiv 2022
[26]

Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Re- search, 22(4):730–751, June 2025

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Re- search, 22(4):730–751, June 2025. ISSN 2731-5398. doi: 10.1007/s11633-025-1562-4. URL http://dx.doi.org/10.1007/s11633-025-1562-4

work page doi:10.1007/s11633-025-1562-4 2025
[27]

Repaint: Inpainting using denoising diffusion probabilistic models, 2022

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models, 2022. URLhttps: //arxiv.org/abs/2201.09865

work page arXiv 2022
[28]

The lottery ticket hypothesis in denoising: Towards semantic-driven initialization, 2024

Jiafeng Mao, Xueting Wang, and Kiyoharu Aizawa. The lottery ticket hypothesis in denoising: Towards semantic-driven initialization, 2024. URLhttps://arxiv.org/abs/2312.08872. 24

work page arXiv 2024
[29]

Understanding ssim.arXiv preprint arXiv:2006.13846, 2020

Jim Nilsson and Tomas Akenine-M¨ oller. Understanding ssim, 2020. URLhttps://arxiv. org/abs/2006.13846

work page arXiv 2020
[30]

A., and Ertugrul, I

Mang Ning, Mingxiao Li, Jianlin Su, Haozhe Jia, Lanmiao Liu, Martin Beneˇ s, Wenshuo Chen, Albert Ali Salah, and Itir Onal Ertugrul. Dctdiff: Intriguing properties of image generative modeling in the dct space, 2025. URLhttps://arxiv.org/abs/2412.15032

work page arXiv 2025
[31]

Synthetic shifts to initial seed vector exposes the brittle nature of latent-based diffusion models, 2023

Mao Po-Yuan, Shashank Kotyan, Tham Yik Foong, and Danilo Vasconcellos Vargas. Synthetic shifts to initial seed vector exposes the brittle nature of latent-based diffusion models, 2023. URLhttps://arxiv.org/abs/2312.11473

work page arXiv 2023
[32]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M¨ uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. URLhttps://arxiv.org/abs/2307.01952

work page internal anchor Pith review arXiv 2023
[33]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020

work page internal anchor Pith review arXiv 2021
[34]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨ orn Ommer. High-resolution image synthesis with latent diffusion models, 2022. URLhttps://arxiv. org/abs/2112.10752

work page Pith review arXiv 2022
[35]

Align your steps: Optimizing sampling schedules in diffusion models, 2024

Amirmojtaba Sabour, Sanja Fidler, and Karsten Kreis. Align your steps: Optimizing sampling schedules in diffusion models, 2024. URLhttps://arxiv.org/abs/2404.14507

work page arXiv 2024
[36]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to- image diffusion models with deep language understanding, 2022. URLhttps://arxiv.org/ abs/2205.11487

work page internal anchor Pith review arXiv 2022
[37]

Generating images of rare concepts using pre-trained diffusion models, 2023

Dvir Samuel, Rami Ben-Ari, Simon Raviv, Nir Darshan, and Gal Chechik. Generating images of rare concepts using pre-trained diffusion models, 2023. URLhttps://arxiv.org/abs/ 2304.14530

work page arXiv 2023
[38]

arXiv preprint arXiv:2311.17042 (2023)

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation, 2023. URLhttps://arxiv.org/abs/2311.17042

work page arXiv 2023
[39]

LAION-5B: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmar- czyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text model...

work page internal anchor Pith review arXiv 2022
[40]

Rethinking the spatial inconsistency in classifier-free diffusion guidance, 2024

Dazhong Shen, Guanglu Song, Zeyue Xue, Fu-Yun Wang, and Yu Liu. Rethinking the spatial inconsistency in classifier-free diffusion guidance, 2024. URLhttps://arxiv.org/abs/2404. 05384. 25

2024
[41]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make- a-video: Text-to-video generation without text-video data, 2022. URLhttps://arxiv.org/ abs/2209.14792

work page internal anchor Pith review arXiv 2022
[42]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models, 2022. URLhttps://arxiv.org/abs/2010.02502

work page internal anchor Pith review arXiv 2022
[43]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations, 2021. URLhttps://arxiv.org/abs/2011.13456

work page internal anchor Pith review arXiv 2021
[44]

Diffusion model alignment using direct preference optimization, 2023

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization, 2023. URLhttps://arxiv.org/abs/2311.12908

work page arXiv 2023
[45]

ModelScope Text-to-Video Technical Report

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023

work page internal anchor Pith review arXiv 2023
[46]

Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation, 2023

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation, 2023. URLhttps://arxiv.org/abs/2212. 11565

2023
[47]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text- to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

work page internal anchor Pith review arXiv 2023
[48]

Imagereward: Learning and evaluating human preferences for text-to-image generation,

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation,
[49]

URLhttps://arxiv.org/abs/2304.05977

work page arXiv
[50]

Good seed makes a good crop: Discovering secret seeds in text-to-image diffusion models, 2025

Katherine Xu, Lingzhi Zhang, and Jianbo Shi. Good seed makes a good crop: Discovering secret seeds in text-to-image diffusion models, 2025. URLhttps://arxiv.org/abs/2405. 14828

2025
[51]

Restart sampling for improving generative processes, 2023

Yilun Xu, Mingyang Deng, Xiang Cheng, Yonglong Tian, Ziming Liu, and Tommi Jaakkola. Restart sampling for improving generative processes, 2023. URLhttps://arxiv.org/abs/ 2306.14878

work page arXiv 2023
[52]

Text-to-image rectified flow as plug-and-play priors, 2025

Xiaofeng Yang, Cheng Chen, Xulei Yang, Fayao Liu, and Guosheng Lin. Text-to-image rectified flow as plug-and-play priors, 2025. URLhttps://arxiv.org/abs/2406.03293

work page arXiv 2025
[53]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review arXiv 2024
[54]

Co- emogen: Towards semantically-coherent and scalable emotional image content generation,

Kaishen Yuan, Yuting Zhang, Shang Gao, Yijie Zhu, Wenshuo Chen, and Yutao Yue. Co- emogen: Towards semantically-coherent and scalable emotional image content generation,
[55]

URLhttps://arxiv.org/abs/2508.03535. 26

work page arXiv
[56]

Chronomagic-bench: A benchmark for meta- morphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Systems, 37:21236–21270, 2024

Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Rui-Jie Zhu, Xinhua Cheng, Jiebo Luo, and Li Yuan. Chronomagic-bench: A benchmark for meta- morphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Systems, 37:21236–21270, 2024

2024
[57]

The unrea- sonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

2018
[58]

In: Advances in Neural Informa- tion Processing Systems (NeurIPS) (2023),https://arxiv.org/abs/2302.04867

Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor- corrector framework for fast sampling of diffusion models, 2023. URLhttps://arxiv.org/ abs/2302.04867

work page arXiv 2023
[59]

Golden noise for diffusion models: A learning framework

Zikai Zhou, Shitong Shao, Lichen Bai, Shufei Zhang, Zhiqiang Xu, Bo Han, and Zeke Xie. Golden noise for diffusion models: A learning framework, 2025. URLhttps://arxiv.org/ abs/2411.09502. 27

work page arXiv 2025