Cross-Space Distillation: Teaching One-Step Students with Modern Diffusion Teachers

Anh Nguyen; Anh Tran; Chi Tran; Cuong Pham; Dimitris Metaxas; Duc Vu; Khoi Nguyen; Kien Nguyen; Ngan Nguyen; Phong Nguyen

arxiv: 2606.32020 · v1 · pith:XWAHWAHBnew · submitted 2026-06-30 · 💻 cs.CV

Cross-Space Distillation: Teaching One-Step Students with Modern Diffusion Teachers

Anh Nguyen , Ngan Nguyen , Duc Vu , Trung Dao , Viet Nguyen , Quan Dao , Kien Nguyen , Chi Tran

show 6 more authors

Phong Nguyen Khoi Nguyen Cuong Pham Dimitris Metaxas Vishal M. Patel Anh Tran

This is my paper

Pith reviewed 2026-07-01 05:33 UTC · model grok-4.3

classification 💻 cs.CV

keywords diffusion modelsknowledge distillationlatent spaceone-step generationimage synthesisVAEmodel compressioncross-space transfer

0 comments

The pith

A lightweight Bridge aligns mismatched latent spaces so modern diffusion teachers can train compact one-step students.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the shared-latent-space assumption in timestep distillation blocks transfer from high-capacity teachers such as SD 3.5 and Flux into compact students such as SD 1.5. It introduces the Bridge, a frozen-student-VAE-plus-projector module, to map student latents into teacher space using only reconstruction and attention-fidelity losses. Experiments show this interface raises SD 1.5 from 5.4 to 9.4 HPSv3 while leaving the student backbone, one-step speed, and ecosystem compatibility unchanged. The result demonstrates that heterogeneous teachers can be distilled into deployment-friendly backbones through a small latent-space adapter.

Core claim

Cross-space distillation becomes feasible when a lightweight Bridge maps student latents into teacher space; the Bridge freezes the student VAE decoder as a spatial prior, adds a compact learnable projector, and trains the pair on latent reconstruction plus attention fidelity so that modern teachers can supervise one-step students whose resolution and VAE differ from the teacher.

What carries the argument

The Bridge: a frozen Student VAE decoder paired with a compact learnable projector that maps between mismatched latent spaces using reconstruction and attention fidelity objectives.

If this is right

Modern teachers can supervise older one-step students without forcing the student to adopt the teacher's VAE or resolution.
One-step inference speed and broad ecosystem compatibility of the student remain unchanged after distillation.
The same Bridge construction works across multiple modern teachers with different latent characteristics.
Alignment objectives based on latent reconstruction and attention fidelity suffice to stabilize the mapping.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The Bridge could be reused as a modular adapter when newer teachers appear, avoiding full student retraining.
Similar lightweight interfaces might address space mismatches in other generative tasks such as video or audio synthesis.
If the projector is made even smaller, the method could support on-device distillation pipelines.

Load-bearing premise

A small projector trained only on reconstruction and attention fidelity can produce stable, useful alignment between student and teacher latent spaces even when their resolutions and VAE parameterizations differ.

What would settle it

Training the Bridge on a new teacher-student pair yields no measurable improvement in the student's downstream quality metrics compared with the same student trained without the teacher.

Figures

Figures reproduced from arXiv: 2606.32020 by Anh Nguyen, Anh Tran, Chi Tran, Cuong Pham, Dimitris Metaxas, Duc Vu, Khoi Nguyen, Kien Nguyen, Ngan Nguyen, Phong Nguyen, Quan Dao, Trung Dao, Viet Nguyen, Vishal M. Patel.

**Figure 1.** Figure 1: When Latents Don’t Match. (A) Existing distribution-based distillation methods rely on a Shared-Space constraint, assuming Teacher and Student share the same latent resolution and VAE. This prevents transferring knowledge from highresolution teachers (e.g., 10242 ) to compact students (e.g., 512 2 ), as their latent tensors are inherently incompatible. (B) We formalize this setting as Cross-Space Distilla… view at source ↗

**Figure 2.** Figure 2: Bridge for Cross-Space Distillation. Left. Bridge Training: [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative Comparison of Cross-Space Distillation. [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparisons. (Left) Pruning vs. Bridge distillation. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

Modern one-step diffusion models achieve impressive quality through distribution-based timestep distillation. Yet, they rely on a critical assumption: Teacher and Student must inhabit the same latent space. This Shared-Space constraint prevents knowledge transfer from modern high-capacity Teachers (e.g., SD 3.5 and Flux) into compact, deployment-friendly Students such as SD 1.5, whose latent resolution and VAE parameterization differ from the Teacher. We formalize this overlooked regime as Cross-Space Distillation, where Teacher and Student differ in both latent resolution and VAE space. To enable distillation under this mismatch, we introduce the Bridge, a lightweight latent interface that maps Student latents into the Teacher space without modifying the Student backbone. Bridge combines a frozen Student VAE decoder as a spatial prior with a compact learnable projector, and is trained with latent reconstruction and attention fidelity objectives for stable Teacher-space alignment. Across diverse modern Teachers, Bridge enables substantial gains for compact one-step Students; for example, it improves SD 1.5 from 5.4 to 9.4 HPSv3 while preserving one-step inference, low latency, and broad ecosystem compatibility. These results show that heterogeneous large Teachers can be distilled into efficient, deployable backbones through a lightweight latent-space interface.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames cross-space distillation as a distinct problem and proposes a simple Bridge using a frozen student decoder plus projector to align mismatched latents, but the evidence for its effectiveness stays high-level.

read the letter

The main point is that the authors treat the shared-latent-space requirement as a real blocker for distilling from newer teachers like SD 3.5 or Flux into older one-step students like SD 1.5, and they introduce the Bridge to work around it without changing the student backbone.

What is new is naming this regime Cross-Space Distillation and building the Bridge from a frozen student VAE decoder as spatial prior plus a compact projector, trained only on latent reconstruction and attention fidelity losses. The approach keeps one-step inference and ecosystem compatibility intact, which matches a practical need.

The paper does well at identifying a constraint that actually matters for deployment and at describing a lightweight interface that could let people reuse modern teachers on older backbones. The claimed lift from 5.4 to 9.4 HPSv3 on SD 1.5 is the kind of number that would matter if it holds.

The soft spots are that everything stays at the level of the abstract: no protocol details, no baseline tables, no ablations on the projector or the two training objectives, and no discussion of how well alignment survives large resolution or VAE differences. The central assumption—that reconstruction plus attention fidelity produces stable, useful Teacher-space alignment—looks plausible but is not yet shown to be robust. Without those checks the gains could be narrower than stated.

This is for engineers and researchers who need compact one-step models and want to borrow capacity from newer diffusion teachers. A reader working on efficient generation pipelines could test the Bridge idea quickly.

It deserves a serious referee because the problem is concrete, the proposed fix is simple, and the practical stakes are clear even if the current write-up needs more experimental grounding.

Referee Report

2 major / 1 minor

Summary. The manuscript formalizes Cross-Space Distillation as the regime in which modern high-capacity diffusion teachers (SD 3.5, Flux) and compact one-step students (SD 1.5) differ in both latent resolution and VAE parameterization, violating the shared-space assumption required by prior timestep-distillation methods. To enable transfer, the authors introduce the Bridge: a lightweight latent-space interface that freezes the student VAE decoder as a spatial prior, adds a compact learnable projector, and is trained solely on latent reconstruction plus attention-fidelity objectives. The central empirical claim is that this interface permits substantial quality gains (e.g., SD 1.5 HPSv3 rising from 5.4 to 9.4) while preserving one-step inference, low latency, and ecosystem compatibility across multiple modern teachers.

Significance. If the reported gains are reproducible and the alignment remains stable under resolution/VAE mismatch, the work would be significant: it removes a previously hard constraint that has limited distillation to matched or older models, thereby allowing state-of-the-art teachers to improve efficient, deployable students without architectural changes. The design choice of a frozen decoder prior plus explicit attention fidelity is a concrete, lightweight mechanism that could be adopted broadly.

major comments (2)

[Abstract] Abstract: the headline performance numbers (SD 1.5 HPSv3 from 5.4 to 9.4) are load-bearing for the central claim, yet the manuscript supplies no experimental protocol, baseline comparisons, ablation studies, or error analysis; without these the data-to-claim link cannot be evaluated.
[Abstract] Abstract, paragraph describing the Bridge: the claim that a frozen student VAE decoder plus compact projector, trained only on reconstruction and attention fidelity, produces stable and useful alignment when resolution and VAE parameterization differ is the key assumption enabling cross-space distillation, but no quantitative alignment metrics, failure-case analysis, or ablation on the frozen-decoder prior are provided to support stability.

minor comments (1)

[Abstract] The abstract would benefit from a short statement of how many teachers and students were tested to substantiate the phrase 'across diverse modern Teachers.'

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency in the abstract. We will revise the abstract to better contextualize our claims while preserving its brevity, and we address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the headline performance numbers (SD 1.5 HPSv3 from 5.4 to 9.4) are load-bearing for the central claim, yet the manuscript supplies no experimental protocol, baseline comparisons, ablation studies, or error analysis; without these the data-to-claim link cannot be evaluated.

Authors: We agree that the abstract would benefit from additional context on evaluation. The full manuscript details the experimental protocol, baselines (including prior timestep-distillation methods under matched-space assumptions), ablations, and error analysis in the Experiments section. In revision we will add a concise clause to the abstract summarizing the evaluation setting and key baselines to strengthen the data-to-claim linkage without exceeding length limits. revision: yes
Referee: [Abstract] Abstract, paragraph describing the Bridge: the claim that a frozen student VAE decoder plus compact projector, trained only on reconstruction and attention fidelity, produces stable and useful alignment when resolution and VAE parameterization differ is the key assumption enabling cross-space distillation, but no quantitative alignment metrics, failure-case analysis, or ablation on the frozen-decoder prior are provided to support stability.

Authors: The manuscript quantifies alignment via reconstruction MSE and attention cosine similarity in the Bridge training subsection, with ablations on the frozen-decoder prior in the supplementary tables. Failure modes under extreme resolution mismatch are noted in the discussion. We will incorporate brief references to these metrics and the ablation results into the revised abstract paragraph on the Bridge to directly support the stability claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces an empirical engineering solution (the Bridge module) consisting of a frozen Student VAE decoder plus a learnable projector, trained explicitly on latent reconstruction and attention fidelity objectives to align mismatched latent spaces. The central claim of performance gains (e.g., SD 1.5 HPSv3 improvement) is presented as the outcome of this training and subsequent distillation experiments across heterogeneous Teachers. No equations, derivations, or self-citation chains are described that reduce the claimed results to quantities defined by the same fitted parameters or prior author work; the method is framed as an added trainable interface whose effectiveness is measured externally via standard metrics. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the effectiveness of the newly introduced Bridge module. No explicit free parameters beyond the learnable projector weights are named. The training objectives constitute domain assumptions about what constitutes useful alignment.

invented entities (1)

The Bridge no independent evidence
purpose: Lightweight latent interface mapping Student latents into Teacher space using frozen Student VAE decoder plus learnable projector
New component introduced to solve the cross-space mismatch; no independent evidence outside the paper is provided in the abstract.

pith-pipeline@v0.9.1-grok · 5799 in / 1412 out tokens · 45272 ms · 2026-07-01T05:33:16.458006+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 31 canonical work pages · 14 internal anchors

[1]

arXiv preprint arXiv:2506.10035 (2025)

Cai, F., Guo, Y., Li, J., Li, W., Chen, J., Fang, X.: Fastflux: Pruning flux with block-wise replacement and sandwich training. arXiv preprint arXiv:2506.10035 (2025)

work page arXiv 2025
[2]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Chen, J., Hu, D., Huang, X., Coskun, H., Sahni, A., Gupta, A., Goyal, A., Lahiri, D., Singh, R., Idelbayev, Y., et al.: Snapgen: Taming high-resolution text-to-image models for mobile devices with efficient architectures and training. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 7997–8008 (2025)

2025
[3]

In: European Conference on Computer Vision (ECCV) (2024), https://arxiv.org/abs/2403.04692

Chen, J., Ge, C., Xie, E., Wu, Y., Yao, L., Ren, X., Wang, Z., Luo, P., Lu, H., Li, Z.: Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to- image generation. In: European Conference on Computer Vision (ECCV) (2024), https://arxiv.org/abs/2403.04692

work page arXiv 2024
[4]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Chen, J., Xue, S., Zhao, Y., Yu, J., Paul, S., Chen, J., Cai, H., Han, S., Xie, E.: Sana-sprint: One-step diffusion with continuous-time consistency distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 16185–16195 (2025)

2025
[5]

MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model

Dao, Q., Metaxas, D.: Mpdit: Multi-patch global-to-local transformer architecture for efficient flow matching and diffusion model. arXiv preprint arXiv:2603.26357 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

In: European Conference on Computer Vision

Dao, T., Nguyen, T.H., Le, T., Vu, D., Nguyen, K., Pham, C., Tran, A.: Swift- brush v2: Make your one-step diffusion model better than its teacher. In: European Conference on Computer Vision. pp. 176–192. Springer (2024)

2024
[7]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Dao, T.T., Vu, D.H., Pham, C., Tran, A.: Efhq: Multi-purpose extremepose-face- hq dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22605–22615 (2024)

2024
[8]

In: Forty-first international conference on machine learning

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning
[9]

In: Proceedings of the IEEE/CVF international conference on computer vision

Gandikota, R., Materzynska, J., Fiotto-Kaufman, J., Bau, D.: Erasing concepts from diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2426–2436 (2023)

2023
[10]

In: The Twelfth International Conference on Learning Represen- tations

Gu, Y., Dong, L., Wei, F., Huang, M.: Minillm: Knowledge distillation of large language models. In: The Twelfth International Conference on Learning Represen- tations
[11]

MiniLLM: On-Policy Distillation of Large Language Models

Gu, Y., Dong, L., Wei, F., Huang, M.: Minillm: Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Advances in neural information processing systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

2020
[13]

arXiv preprint arXiv:2601.08303 (2026)

Hu, D., Gupta, A., Gabidolla, M., Sahni, A., Coskun, H., Li, Y., Idelbayev, Y., Mahmood, A., Lebedev, A., Lahiri, D., et al.: Snapgen++: Unleashing diffusion transformers for efficient high-fidelity image generation on edge devices. arXiv preprint arXiv:2601.08303 (2026)

work page arXiv 2026
[14]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Hu, X., Wang, R., Fang, Y., et al.: Ella: Equip diffusion models with llm for en- hanced semantic alignment. arXiv preprint arXiv:2403.05135 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Elucidating the Design Space of Diffusion-Based Generative Models

Karras,T.,Aittala,M.,Aila,T.,Laine,S.:Elucidatingthedesignspaceofdiffusion- based generative models. In: Advances in Neural Information Processing Systems (NeurIPS) (2022),https://arxiv.org/abs/2206.00364 16 A. Nguyen et al

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Labs, B.F.: Announcing FLUX.1.https://blackforestlabs.ai/announcing- flux-1(2024), accessed: 2026-03-04

2024
[17]

Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025), accessed: 2026-03-04

2025
[18]

In: International Conference on Learn- ing Representations (ICLR) (2025),https : / / openreview

Lee, S., Xu, Y., Geffner, T., Fanti, G., Kreis, K., Vahdat, A., Nie, W.: Truncated consistency models. In: International Conference on Learn- ing Representations (ICLR) (2025),https : / / openreview . net / pdf / bb8f3dceac43037618899ff56c90995c5e08e978.pdf

2025
[19]

arXiv preprint arXiv:2403.11027 (2024).https://doi.org/10.48550/ arXiv.2403.11027

Li, J., Feng, W., Chen, W., Wang, W.Y.: Reward guided latent consistency dis- tillation. arXiv preprint arXiv:2403.11027 (2024).https://doi.org/10.48550/ arXiv.2403.11027

work page arXiv 2024
[20]

Advances in Neural Information Processing Systems36, 20662–20678 (2023)

Li, Y., Wang, H., Jin, Q., Hu, J., Chemerys, P., Fu, Y., Wang, Y., Tulyakov, S., Ren, J.: Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. Advances in Neural Information Processing Systems36, 20662–20678 (2023)

2023
[21]

In: Proceedings of the IEEE/CVF international conference on computer vision

Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: Swinir: Image restoration using swin transformer. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1833–1844 (2021)

2021
[22]

SDXL-Lightning: Progressive Adversarial Diffusion Distillation

Lin, S., Wang, A., Yang, X.: Sdxl-lightning: Progressive adversarial diffusion dis- tillation. arXiv preprint arXiv:2402.13929 (2024).https://doi.org/10.48550/ arXiv.2402.13929,https://arxiv.org/abs/2402.13929

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: International Conference on Learning Representations (ICLR) (2023),https://arxiv.org/abs/2210.02747

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022),https://arxiv. org/abs/2209.03003

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

In: Advances in Neural Information Processing Systems (NeurIPS) (2022),https://arxiv.org/ abs/2206.00927

Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In: Advances in Neural Information Processing Systems (NeurIPS) (2022),https://arxiv.org/ abs/2206.00927

work page arXiv 2022
[26]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent consistency mod- els: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378 (2023),https://arxiv.org/abs/2310.04378

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Ma, J., Peng, Q., Guo, X., Chen, C., Lu, H., Yang, Z.: X2i: Seamless integration of multimodal understanding into diffusion transformer via attention distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 16733–16744 (October 2025)

2025
[28]

arXiv preprint arXiv:2508.03789 (2025)

Ma, Y., Wu, X., Sun, K., Li, H.: Hpsv3: Towards wide-spectrum human preference score. arXiv preprint arXiv:2508.03789 (2025)

work page arXiv 2025
[29]

arXiv preprint arXiv:2510.21250 (2025)

Nguyen, A., Nguyen, V., Vu, D., Dao, T., Tran, C., Tran, T., Tran, A.: Improved training technique for shortcut models. arXiv preprint arXiv:2510.21250 (2025). https://doi.org/10.48550/arXiv.2510.21250,https://arxiv.org/abs/2510. 21250, accepted at NeurIPS 2025

work page doi:10.48550/arxiv.2510.21250 2025
[30]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Nguyen, K., Tran, A., Pham, C.: Suma: A subspace mapping approach for robust and effective concept erasure in text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19587–19596 (2025)

2025
[31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Nguyen, T.H., Tran, A.: Swiftbrush: One-step text-to-image diffusion model with variational score distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7807–7816 (2024) Cross-Space Distillation 17

2024
[32]

arXiv preprint arXiv:2403.18605 (2024)

Nguyen, T.T., Nguyen, D.A., Tran, A., Pham, C.: Flexedit: Flexible and control- lable diffusion-based object-centric image editing. arXiv preprint arXiv:2403.18605 (2024)

work page arXiv 2024
[33]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Nguyen, V., Nguyen, A., Dao, T., Nguyen, K., Pham, C., Tran, T., Tran, A.: Supercharged one-step text-to-image diffusion models with negative prompts. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18004–18013 (2025)

2025
[34]

arXiv preprint arXiv:2511.05865 (2025)

Nguyen, V., Patel, V.M.: Cgce: Classifier-guided concept erasure in generative models. arXiv preprint arXiv:2511.05865 (2025)

work page arXiv 2025
[35]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Nguyen, V., Vu, G., Thanh, T.N., Than, K., Tran, T.: On inference stability for diffusion models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 14449–14456 (2024)

2024
[36]

In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=di52zR8xgf

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=di52zR8xgf

2024
[37]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10684– 10695 (June 2022)

2022
[38]

Progressive Distillation for Fast Sampling of Diffusion Models

Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512 (2022),https://arxiv.org/abs/2202.00512

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

In: SIGGRAPH Asia 2024 Conference Papers

Sauer, A., Boesel, F., Dockhorn, T., Blattmann, A., Esser, P., Rombach, R.: Fast high-resolution image synthesis with latent adversarial diffusion distillation. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024)

2024
[41]

In: European Conference on Computer Vision

Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distilla- tion. In: European Conference on Computer Vision. pp. 87–103. Springer (2024)

2024
[42]

Advances in neural information processing systems35, 25278–25294 (2022)

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large- scale dataset for training next generation image-text models. Advances in neural information processing systems35, 25278–25294 (2022)

2022
[43]

arXiv preprint arXiv:2506.02221 (2025), https://arxiv.org/abs/2506.02221

Schusterbauer, J., Gui, M., Fundel, F., Ommer, B.: Diff2flow: Training flow match- ing models via diffusion model alignment. arXiv preprint arXiv:2506.02221 (2025), https://arxiv.org/abs/2506.02221

work page arXiv 2025
[44]

Denoising Diffusion Implicit Models

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020),https://arxiv.org/abs/2010.02502

work page internal anchor Pith review Pith/arXiv arXiv 2010
[45]

Score-Based Generative Modeling through Stochastic Differential Equations

Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. In: Interna- tional Conference on Learning Representations (ICLR) (2021),https://arxiv. org/abs/2011.13456

work page internal anchor Pith review Pith/arXiv arXiv 2021
[46]

arXiv preprint (2024),https://github.com/Kwai- Kolors/ Kolors

Team, K.: Kolors: Effective training of diffusion model for photorealistic text- to-image synthesis. arXiv preprint (2024),https://github.com/Kwai- Kolors/ Kolors

2024
[47]

Improving and generalizing flow-based generative models with minibatch optimal transport

Tong, A., Fatras, K., Malkin, N., Huguet, G., Zhang, Y., Rector-Brooks, J., Wolf, G., Bengio, Y.: Improving and generalizing flow-based generative mod- els with minibatch optimal transport. arXiv preprint arXiv:2302.00482 (2023), https://arxiv.org/abs/2302.00482 18 A. Nguyen et al

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Vu, D., Nguyen, A., Tran, C., Tran, A.: Anti-i2v: Safeguarding your photos from malicious image-to-video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 37621–37631 (2026)

2026
[49]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Vu, D., Nguyen, K., Nguyen, T.T., Nguyen, N., Nguyen, P., Nguyen, K., Pham, C., Tran, A.: Inverfill: One-step inversion for enhanced few-step diffusion inpainting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 25677–25687 (2026)

2026
[50]

In: Advances in Neural Information Processing Systems (NeurIPS) (2023),https: //arxiv.org/abs/2305.16213

Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In: Advances in Neural Information Processing Systems (NeurIPS) (2023),https: //arxiv.org/abs/2305.16213

work page arXiv 2023
[51]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Wu, X., Hao, Y., Sun, K., Chen, Y., Zhu, F., Zhao, R., Li, H.: Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

Xie, E., Chen, J., Chen, J., Cai, H., Tang, H., Lin, Y., Zhang, Z., Li, M., Zhu, L., Lu, Y., et al.: Sana: Efficient high-resolution image synthesis with linear diffusion transformers. arXiv preprint arXiv:2410.10629 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

In: International Conference on Machine Learning

Xie, E., Chen, J., Zhao, Y., Yu, J., Zhu, L., Lin, Y., Zhang, Z., Li, M., Chen, J., Cai, H., et al.: Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer. In: International Conference on Machine Learning. pp. 68578–68598. PMLR (2025)

2025
[54]

arXiv preprint arXiv:2406.05768 (2024).https://doi.org/10.48550/arXiv.2406.05768

Xie, Q., Liao, Z., Deng, Z., Chen, C., Lu, H.: Tlcm: Training-efficient latent consis- tency model for image generation with 2-8 steps. arXiv preprint arXiv:2406.05768 (2024).https://doi.org/10.48550/arXiv.2406.05768

work page doi:10.48550/arxiv.2406.05768 2024
[55]

In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., Dong, Y.: Imagere- ward: Learning and evaluating human preferences for text-to-image generation. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

2023
[56]

arXiv preprint arXiv:2502.15681 (2025),https://arxiv.org/abs/ 2502.15681

Xu, Y., Nie, W., Vahdat, A.: One-step diffusion models withf-divergence distribu- tion matching. arXiv preprint arXiv:2502.15681 (2025),https://arxiv.org/abs/ 2502.15681

work page arXiv 2025
[57]

In: NeurIPS (2024)

Yin, T., Gharbi, M., Park, T., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T.: Improved distribution matching distillation for fast image synthesis. In: NeurIPS (2024)

2024
[58]

In: CVPR (2024)

Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T., Park, T.: One-step diffusion with distribution matching distillation. In: CVPR (2024)

2024
[59]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

Zhang, .: Learning multi-dimensional human preference for text-to-image gener- ation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

2024
[60]

arXiv preprint arXiv:2311.18158 (2023)

Zhang, Y., Hooi, B.: Hipa: Enabling one-step text-to-image diffusion models via high-frequency-promoting adaptation. arXiv preprint arXiv:2311.18158 (2023). https://doi.org/10.48550/arXiv.2311.18158,https://arxiv.org/abs/2311. 18158

work page doi:10.48550/arxiv.2311.18158 2023
[61]

In: European Conference on Computer Vision

Zhao, Y., Xu, Y., Xiao, Z., Jia, H., Hou, T.: Mobilediffusion: Instant text-to-image generation on mobile devices. In: European Conference on Computer Vision. pp. 225–242. Springer (2024)

2024
[62]

https://doi.org/10.48550/arXiv.2402.19159 Cross-Space Distillation 19

Zheng, J., Hu, M., Fan, Z., Wang, C., Ding, C., Tao, D., Cham, T.J.: Trajectory consistency distillation: Improved latent consistency distillation by semi-linear con- sistencyfunctionwithtrajectorymapping.arXivpreprintarXiv:2402.19159(2024). https://doi.org/10.48550/arXiv.2402.19159 Cross-Space Distillation 19

work page doi:10.48550/arxiv.2402.19159 2024
[63]

Zhou, M., Wang, Z., Zheng, H., Huang, H.: Guided score identity distillation for data-free one-step text-to-image generation (2024),https://arxiv.org/abs/ 2406.01561

work page arXiv 2024
[64]

Zhu, J., Wang, H., Su, M., Wang, Z., Wang, H.: Obs-diff: Accurate pruning for diffusion models in one-shot. arXiv preprint arXiv:2510.06751 (2025),https:// arxiv.org/abs/2510.06751 Cross-Space Distillation: Teaching One-Step Students with Modern Diffusion Teachers – Supplementary Materials – Overview.This supplementary material expands the main paper with...

work page arXiv 2025
[65]

Such methods are effective for improving sharpness and real- ism, especially when standard distillation losses alone are not sufficient

introduces a discriminator to provide stronger perceptual guidance during one-step distillation, and LADD [39] extends this idea to latent high-resolution image synthesis. Such methods are effective for improving sharpness and real- ism, especially when standard distillation losses alone are not sufficient. However, most of these approaches assume that Te...

[1] [1]

arXiv preprint arXiv:2506.10035 (2025)

Cai, F., Guo, Y., Li, J., Li, W., Chen, J., Fang, X.: Fastflux: Pruning flux with block-wise replacement and sandwich training. arXiv preprint arXiv:2506.10035 (2025)

work page arXiv 2025

[2] [2]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Chen, J., Hu, D., Huang, X., Coskun, H., Sahni, A., Gupta, A., Goyal, A., Lahiri, D., Singh, R., Idelbayev, Y., et al.: Snapgen: Taming high-resolution text-to-image models for mobile devices with efficient architectures and training. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 7997–8008 (2025)

2025

[3] [3]

In: European Conference on Computer Vision (ECCV) (2024), https://arxiv.org/abs/2403.04692

Chen, J., Ge, C., Xie, E., Wu, Y., Yao, L., Ren, X., Wang, Z., Luo, P., Lu, H., Li, Z.: Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to- image generation. In: European Conference on Computer Vision (ECCV) (2024), https://arxiv.org/abs/2403.04692

work page arXiv 2024

[4] [4]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Chen, J., Xue, S., Zhao, Y., Yu, J., Paul, S., Chen, J., Cai, H., Han, S., Xie, E.: Sana-sprint: One-step diffusion with continuous-time consistency distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 16185–16195 (2025)

2025

[5] [5]

MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model

Dao, Q., Metaxas, D.: Mpdit: Multi-patch global-to-local transformer architecture for efficient flow matching and diffusion model. arXiv preprint arXiv:2603.26357 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[6] [6]

In: European Conference on Computer Vision

Dao, T., Nguyen, T.H., Le, T., Vu, D., Nguyen, K., Pham, C., Tran, A.: Swift- brush v2: Make your one-step diffusion model better than its teacher. In: European Conference on Computer Vision. pp. 176–192. Springer (2024)

2024

[7] [7]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Dao, T.T., Vu, D.H., Pham, C., Tran, A.: Efhq: Multi-purpose extremepose-face- hq dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22605–22615 (2024)

2024

[8] [8]

In: Forty-first international conference on machine learning

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning

[9] [9]

In: Proceedings of the IEEE/CVF international conference on computer vision

Gandikota, R., Materzynska, J., Fiotto-Kaufman, J., Bau, D.: Erasing concepts from diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2426–2436 (2023)

2023

[10] [10]

In: The Twelfth International Conference on Learning Represen- tations

Gu, Y., Dong, L., Wei, F., Huang, M.: Minillm: Knowledge distillation of large language models. In: The Twelfth International Conference on Learning Represen- tations

[11] [11]

MiniLLM: On-Policy Distillation of Large Language Models

Gu, Y., Dong, L., Wei, F., Huang, M.: Minillm: Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Advances in neural information processing systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

2020

[13] [13]

arXiv preprint arXiv:2601.08303 (2026)

Hu, D., Gupta, A., Gabidolla, M., Sahni, A., Coskun, H., Li, Y., Idelbayev, Y., Mahmood, A., Lebedev, A., Lahiri, D., et al.: Snapgen++: Unleashing diffusion transformers for efficient high-fidelity image generation on edge devices. arXiv preprint arXiv:2601.08303 (2026)

work page arXiv 2026

[14] [14]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Hu, X., Wang, R., Fang, Y., et al.: Ella: Equip diffusion models with llm for en- hanced semantic alignment. arXiv preprint arXiv:2403.05135 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Elucidating the Design Space of Diffusion-Based Generative Models

Karras,T.,Aittala,M.,Aila,T.,Laine,S.:Elucidatingthedesignspaceofdiffusion- based generative models. In: Advances in Neural Information Processing Systems (NeurIPS) (2022),https://arxiv.org/abs/2206.00364 16 A. Nguyen et al

work page internal anchor Pith review Pith/arXiv arXiv 2022

[16] [16]

Labs, B.F.: Announcing FLUX.1.https://blackforestlabs.ai/announcing- flux-1(2024), accessed: 2026-03-04

2024

[17] [17]

Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025), accessed: 2026-03-04

2025

[18] [18]

In: International Conference on Learn- ing Representations (ICLR) (2025),https : / / openreview

Lee, S., Xu, Y., Geffner, T., Fanti, G., Kreis, K., Vahdat, A., Nie, W.: Truncated consistency models. In: International Conference on Learn- ing Representations (ICLR) (2025),https : / / openreview . net / pdf / bb8f3dceac43037618899ff56c90995c5e08e978.pdf

2025

[19] [19]

arXiv preprint arXiv:2403.11027 (2024).https://doi.org/10.48550/ arXiv.2403.11027

Li, J., Feng, W., Chen, W., Wang, W.Y.: Reward guided latent consistency dis- tillation. arXiv preprint arXiv:2403.11027 (2024).https://doi.org/10.48550/ arXiv.2403.11027

work page arXiv 2024

[20] [20]

Advances in Neural Information Processing Systems36, 20662–20678 (2023)

Li, Y., Wang, H., Jin, Q., Hu, J., Chemerys, P., Fu, Y., Wang, Y., Tulyakov, S., Ren, J.: Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. Advances in Neural Information Processing Systems36, 20662–20678 (2023)

2023

[21] [21]

In: Proceedings of the IEEE/CVF international conference on computer vision

Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: Swinir: Image restoration using swin transformer. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1833–1844 (2021)

2021

[22] [22]

SDXL-Lightning: Progressive Adversarial Diffusion Distillation

Lin, S., Wang, A., Yang, X.: Sdxl-lightning: Progressive adversarial diffusion dis- tillation. arXiv preprint arXiv:2402.13929 (2024).https://doi.org/10.48550/ arXiv.2402.13929,https://arxiv.org/abs/2402.13929

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: International Conference on Learning Representations (ICLR) (2023),https://arxiv.org/abs/2210.02747

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022),https://arxiv. org/abs/2209.03003

work page internal anchor Pith review Pith/arXiv arXiv 2022

[25] [25]

In: Advances in Neural Information Processing Systems (NeurIPS) (2022),https://arxiv.org/ abs/2206.00927

Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In: Advances in Neural Information Processing Systems (NeurIPS) (2022),https://arxiv.org/ abs/2206.00927

work page arXiv 2022

[26] [26]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent consistency mod- els: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378 (2023),https://arxiv.org/abs/2310.04378

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Ma, J., Peng, Q., Guo, X., Chen, C., Lu, H., Yang, Z.: X2i: Seamless integration of multimodal understanding into diffusion transformer via attention distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 16733–16744 (October 2025)

2025

[28] [28]

arXiv preprint arXiv:2508.03789 (2025)

Ma, Y., Wu, X., Sun, K., Li, H.: Hpsv3: Towards wide-spectrum human preference score. arXiv preprint arXiv:2508.03789 (2025)

work page arXiv 2025

[29] [29]

arXiv preprint arXiv:2510.21250 (2025)

Nguyen, A., Nguyen, V., Vu, D., Dao, T., Tran, C., Tran, T., Tran, A.: Improved training technique for shortcut models. arXiv preprint arXiv:2510.21250 (2025). https://doi.org/10.48550/arXiv.2510.21250,https://arxiv.org/abs/2510. 21250, accepted at NeurIPS 2025

work page doi:10.48550/arxiv.2510.21250 2025

[30] [30]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Nguyen, K., Tran, A., Pham, C.: Suma: A subspace mapping approach for robust and effective concept erasure in text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19587–19596 (2025)

2025

[31] [31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Nguyen, T.H., Tran, A.: Swiftbrush: One-step text-to-image diffusion model with variational score distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7807–7816 (2024) Cross-Space Distillation 17

2024

[32] [32]

arXiv preprint arXiv:2403.18605 (2024)

Nguyen, T.T., Nguyen, D.A., Tran, A., Pham, C.: Flexedit: Flexible and control- lable diffusion-based object-centric image editing. arXiv preprint arXiv:2403.18605 (2024)

work page arXiv 2024

[33] [33]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Nguyen, V., Nguyen, A., Dao, T., Nguyen, K., Pham, C., Tran, T., Tran, A.: Supercharged one-step text-to-image diffusion models with negative prompts. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18004–18013 (2025)

2025

[34] [34]

arXiv preprint arXiv:2511.05865 (2025)

Nguyen, V., Patel, V.M.: Cgce: Classifier-guided concept erasure in generative models. arXiv preprint arXiv:2511.05865 (2025)

work page arXiv 2025

[35] [35]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Nguyen, V., Vu, G., Thanh, T.N., Than, K., Tran, T.: On inference stability for diffusion models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 14449–14456 (2024)

2024

[36] [36]

In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=di52zR8xgf

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=di52zR8xgf

2024

[37] [37]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10684– 10695 (June 2022)

2022

[38] [38]

Progressive Distillation for Fast Sampling of Diffusion Models

Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512 (2022),https://arxiv.org/abs/2202.00512

work page internal anchor Pith review Pith/arXiv arXiv 2022

[39] [39]

In: SIGGRAPH Asia 2024 Conference Papers

Sauer, A., Boesel, F., Dockhorn, T., Blattmann, A., Esser, P., Rombach, R.: Fast high-resolution image synthesis with latent adversarial diffusion distillation. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024)

2024

[40] [41]

In: European Conference on Computer Vision

Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distilla- tion. In: European Conference on Computer Vision. pp. 87–103. Springer (2024)

2024

[41] [42]

Advances in neural information processing systems35, 25278–25294 (2022)

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large- scale dataset for training next generation image-text models. Advances in neural information processing systems35, 25278–25294 (2022)

2022

[42] [43]

arXiv preprint arXiv:2506.02221 (2025), https://arxiv.org/abs/2506.02221

Schusterbauer, J., Gui, M., Fundel, F., Ommer, B.: Diff2flow: Training flow match- ing models via diffusion model alignment. arXiv preprint arXiv:2506.02221 (2025), https://arxiv.org/abs/2506.02221

work page arXiv 2025

[43] [44]

Denoising Diffusion Implicit Models

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020),https://arxiv.org/abs/2010.02502

work page internal anchor Pith review Pith/arXiv arXiv 2010

[44] [45]

Score-Based Generative Modeling through Stochastic Differential Equations

Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. In: Interna- tional Conference on Learning Representations (ICLR) (2021),https://arxiv. org/abs/2011.13456

work page internal anchor Pith review Pith/arXiv arXiv 2021

[45] [46]

arXiv preprint (2024),https://github.com/Kwai- Kolors/ Kolors

Team, K.: Kolors: Effective training of diffusion model for photorealistic text- to-image synthesis. arXiv preprint (2024),https://github.com/Kwai- Kolors/ Kolors

2024

[46] [47]

Improving and generalizing flow-based generative models with minibatch optimal transport

Tong, A., Fatras, K., Malkin, N., Huguet, G., Zhang, Y., Rector-Brooks, J., Wolf, G., Bengio, Y.: Improving and generalizing flow-based generative mod- els with minibatch optimal transport. arXiv preprint arXiv:2302.00482 (2023), https://arxiv.org/abs/2302.00482 18 A. Nguyen et al

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [48]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Vu, D., Nguyen, A., Tran, C., Tran, A.: Anti-i2v: Safeguarding your photos from malicious image-to-video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 37621–37631 (2026)

2026

[48] [49]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Vu, D., Nguyen, K., Nguyen, T.T., Nguyen, N., Nguyen, P., Nguyen, K., Pham, C., Tran, A.: Inverfill: One-step inversion for enhanced few-step diffusion inpainting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 25677–25687 (2026)

2026

[49] [50]

In: Advances in Neural Information Processing Systems (NeurIPS) (2023),https: //arxiv.org/abs/2305.16213

Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In: Advances in Neural Information Processing Systems (NeurIPS) (2023),https: //arxiv.org/abs/2305.16213

work page arXiv 2023

[50] [51]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Wu, X., Hao, Y., Sun, K., Chen, Y., Zhu, F., Zhao, R., Li, H.: Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [52]

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

Xie, E., Chen, J., Chen, J., Cai, H., Tang, H., Lin, Y., Zhang, Z., Li, M., Zhu, L., Lu, Y., et al.: Sana: Efficient high-resolution image synthesis with linear diffusion transformers. arXiv preprint arXiv:2410.10629 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [53]

In: International Conference on Machine Learning

Xie, E., Chen, J., Zhao, Y., Yu, J., Zhu, L., Lin, Y., Zhang, Z., Li, M., Chen, J., Cai, H., et al.: Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer. In: International Conference on Machine Learning. pp. 68578–68598. PMLR (2025)

2025

[53] [54]

arXiv preprint arXiv:2406.05768 (2024).https://doi.org/10.48550/arXiv.2406.05768

Xie, Q., Liao, Z., Deng, Z., Chen, C., Lu, H.: Tlcm: Training-efficient latent consis- tency model for image generation with 2-8 steps. arXiv preprint arXiv:2406.05768 (2024).https://doi.org/10.48550/arXiv.2406.05768

work page doi:10.48550/arxiv.2406.05768 2024

[54] [55]

In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., Dong, Y.: Imagere- ward: Learning and evaluating human preferences for text-to-image generation. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

2023

[55] [56]

arXiv preprint arXiv:2502.15681 (2025),https://arxiv.org/abs/ 2502.15681

Xu, Y., Nie, W., Vahdat, A.: One-step diffusion models withf-divergence distribu- tion matching. arXiv preprint arXiv:2502.15681 (2025),https://arxiv.org/abs/ 2502.15681

work page arXiv 2025

[56] [57]

In: NeurIPS (2024)

Yin, T., Gharbi, M., Park, T., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T.: Improved distribution matching distillation for fast image synthesis. In: NeurIPS (2024)

2024

[57] [58]

In: CVPR (2024)

Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T., Park, T.: One-step diffusion with distribution matching distillation. In: CVPR (2024)

2024

[58] [59]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

Zhang, .: Learning multi-dimensional human preference for text-to-image gener- ation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

2024

[59] [60]

arXiv preprint arXiv:2311.18158 (2023)

Zhang, Y., Hooi, B.: Hipa: Enabling one-step text-to-image diffusion models via high-frequency-promoting adaptation. arXiv preprint arXiv:2311.18158 (2023). https://doi.org/10.48550/arXiv.2311.18158,https://arxiv.org/abs/2311. 18158

work page doi:10.48550/arxiv.2311.18158 2023

[60] [61]

In: European Conference on Computer Vision

Zhao, Y., Xu, Y., Xiao, Z., Jia, H., Hou, T.: Mobilediffusion: Instant text-to-image generation on mobile devices. In: European Conference on Computer Vision. pp. 225–242. Springer (2024)

2024

[61] [62]

https://doi.org/10.48550/arXiv.2402.19159 Cross-Space Distillation 19

Zheng, J., Hu, M., Fan, Z., Wang, C., Ding, C., Tao, D., Cham, T.J.: Trajectory consistency distillation: Improved latent consistency distillation by semi-linear con- sistencyfunctionwithtrajectorymapping.arXivpreprintarXiv:2402.19159(2024). https://doi.org/10.48550/arXiv.2402.19159 Cross-Space Distillation 19

work page doi:10.48550/arxiv.2402.19159 2024

[62] [63]

Zhou, M., Wang, Z., Zheng, H., Huang, H.: Guided score identity distillation for data-free one-step text-to-image generation (2024),https://arxiv.org/abs/ 2406.01561

work page arXiv 2024

[63] [64]

Zhu, J., Wang, H., Su, M., Wang, Z., Wang, H.: Obs-diff: Accurate pruning for diffusion models in one-shot. arXiv preprint arXiv:2510.06751 (2025),https:// arxiv.org/abs/2510.06751 Cross-Space Distillation: Teaching One-Step Students with Modern Diffusion Teachers – Supplementary Materials – Overview.This supplementary material expands the main paper with...

work page arXiv 2025

[64] [65]

Such methods are effective for improving sharpness and real- ism, especially when standard distillation losses alone are not sufficient

introduces a discriminator to provide stronger perceptual guidance during one-step distillation, and LADD [39] extends this idea to latent high-resolution image synthesis. Such methods are effective for improving sharpness and real- ism, especially when standard distillation losses alone are not sufficient. However, most of these approaches assume that Te...