Cross-Space Distillation: Teaching One-Step Students with Modern Diffusion Teachers
Pith reviewed 2026-07-01 05:33 UTC · model grok-4.3
The pith
A lightweight Bridge aligns mismatched latent spaces so modern diffusion teachers can train compact one-step students.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Cross-space distillation becomes feasible when a lightweight Bridge maps student latents into teacher space; the Bridge freezes the student VAE decoder as a spatial prior, adds a compact learnable projector, and trains the pair on latent reconstruction plus attention fidelity so that modern teachers can supervise one-step students whose resolution and VAE differ from the teacher.
What carries the argument
The Bridge: a frozen Student VAE decoder paired with a compact learnable projector that maps between mismatched latent spaces using reconstruction and attention fidelity objectives.
If this is right
- Modern teachers can supervise older one-step students without forcing the student to adopt the teacher's VAE or resolution.
- One-step inference speed and broad ecosystem compatibility of the student remain unchanged after distillation.
- The same Bridge construction works across multiple modern teachers with different latent characteristics.
- Alignment objectives based on latent reconstruction and attention fidelity suffice to stabilize the mapping.
Where Pith is reading between the lines
- The Bridge could be reused as a modular adapter when newer teachers appear, avoiding full student retraining.
- Similar lightweight interfaces might address space mismatches in other generative tasks such as video or audio synthesis.
- If the projector is made even smaller, the method could support on-device distillation pipelines.
Load-bearing premise
A small projector trained only on reconstruction and attention fidelity can produce stable, useful alignment between student and teacher latent spaces even when their resolutions and VAE parameterizations differ.
What would settle it
Training the Bridge on a new teacher-student pair yields no measurable improvement in the student's downstream quality metrics compared with the same student trained without the teacher.
Figures
read the original abstract
Modern one-step diffusion models achieve impressive quality through distribution-based timestep distillation. Yet, they rely on a critical assumption: Teacher and Student must inhabit the same latent space. This Shared-Space constraint prevents knowledge transfer from modern high-capacity Teachers (e.g., SD 3.5 and Flux) into compact, deployment-friendly Students such as SD 1.5, whose latent resolution and VAE parameterization differ from the Teacher. We formalize this overlooked regime as Cross-Space Distillation, where Teacher and Student differ in both latent resolution and VAE space. To enable distillation under this mismatch, we introduce the Bridge, a lightweight latent interface that maps Student latents into the Teacher space without modifying the Student backbone. Bridge combines a frozen Student VAE decoder as a spatial prior with a compact learnable projector, and is trained with latent reconstruction and attention fidelity objectives for stable Teacher-space alignment. Across diverse modern Teachers, Bridge enables substantial gains for compact one-step Students; for example, it improves SD 1.5 from 5.4 to 9.4 HPSv3 while preserving one-step inference, low latency, and broad ecosystem compatibility. These results show that heterogeneous large Teachers can be distilled into efficient, deployable backbones through a lightweight latent-space interface.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript formalizes Cross-Space Distillation as the regime in which modern high-capacity diffusion teachers (SD 3.5, Flux) and compact one-step students (SD 1.5) differ in both latent resolution and VAE parameterization, violating the shared-space assumption required by prior timestep-distillation methods. To enable transfer, the authors introduce the Bridge: a lightweight latent-space interface that freezes the student VAE decoder as a spatial prior, adds a compact learnable projector, and is trained solely on latent reconstruction plus attention-fidelity objectives. The central empirical claim is that this interface permits substantial quality gains (e.g., SD 1.5 HPSv3 rising from 5.4 to 9.4) while preserving one-step inference, low latency, and ecosystem compatibility across multiple modern teachers.
Significance. If the reported gains are reproducible and the alignment remains stable under resolution/VAE mismatch, the work would be significant: it removes a previously hard constraint that has limited distillation to matched or older models, thereby allowing state-of-the-art teachers to improve efficient, deployable students without architectural changes. The design choice of a frozen decoder prior plus explicit attention fidelity is a concrete, lightweight mechanism that could be adopted broadly.
major comments (2)
- [Abstract] Abstract: the headline performance numbers (SD 1.5 HPSv3 from 5.4 to 9.4) are load-bearing for the central claim, yet the manuscript supplies no experimental protocol, baseline comparisons, ablation studies, or error analysis; without these the data-to-claim link cannot be evaluated.
- [Abstract] Abstract, paragraph describing the Bridge: the claim that a frozen student VAE decoder plus compact projector, trained only on reconstruction and attention fidelity, produces stable and useful alignment when resolution and VAE parameterization differ is the key assumption enabling cross-space distillation, but no quantitative alignment metrics, failure-case analysis, or ablation on the frozen-decoder prior are provided to support stability.
minor comments (1)
- [Abstract] The abstract would benefit from a short statement of how many teachers and students were tested to substantiate the phrase 'across diverse modern Teachers.'
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for greater transparency in the abstract. We will revise the abstract to better contextualize our claims while preserving its brevity, and we address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline performance numbers (SD 1.5 HPSv3 from 5.4 to 9.4) are load-bearing for the central claim, yet the manuscript supplies no experimental protocol, baseline comparisons, ablation studies, or error analysis; without these the data-to-claim link cannot be evaluated.
Authors: We agree that the abstract would benefit from additional context on evaluation. The full manuscript details the experimental protocol, baselines (including prior timestep-distillation methods under matched-space assumptions), ablations, and error analysis in the Experiments section. In revision we will add a concise clause to the abstract summarizing the evaluation setting and key baselines to strengthen the data-to-claim linkage without exceeding length limits. revision: yes
-
Referee: [Abstract] Abstract, paragraph describing the Bridge: the claim that a frozen student VAE decoder plus compact projector, trained only on reconstruction and attention fidelity, produces stable and useful alignment when resolution and VAE parameterization differ is the key assumption enabling cross-space distillation, but no quantitative alignment metrics, failure-case analysis, or ablation on the frozen-decoder prior are provided to support stability.
Authors: The manuscript quantifies alignment via reconstruction MSE and attention cosine similarity in the Bridge training subsection, with ablations on the frozen-decoder prior in the supplementary tables. Failure modes under extreme resolution mismatch are noted in the discussion. We will incorporate brief references to these metrics and the ablation results into the revised abstract paragraph on the Bridge to directly support the stability claim. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces an empirical engineering solution (the Bridge module) consisting of a frozen Student VAE decoder plus a learnable projector, trained explicitly on latent reconstruction and attention fidelity objectives to align mismatched latent spaces. The central claim of performance gains (e.g., SD 1.5 HPSv3 improvement) is presented as the outcome of this training and subsequent distillation experiments across heterogeneous Teachers. No equations, derivations, or self-citation chains are described that reduce the claimed results to quantities defined by the same fitted parameters or prior author work; the method is framed as an added trainable interface whose effectiveness is measured externally via standard metrics. This is a standard non-circular empirical contribution.
Axiom & Free-Parameter Ledger
invented entities (1)
-
The Bridge
no independent evidence
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2506.10035 (2025)
Cai, F., Guo, Y., Li, J., Li, W., Chen, J., Fang, X.: Fastflux: Pruning flux with block-wise replacement and sandwich training. arXiv preprint arXiv:2506.10035 (2025)
-
[2]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Chen, J., Hu, D., Huang, X., Coskun, H., Sahni, A., Gupta, A., Goyal, A., Lahiri, D., Singh, R., Idelbayev, Y., et al.: Snapgen: Taming high-resolution text-to-image models for mobile devices with efficient architectures and training. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 7997–8008 (2025)
2025
-
[3]
In: European Conference on Computer Vision (ECCV) (2024), https://arxiv.org/abs/2403.04692
Chen, J., Ge, C., Xie, E., Wu, Y., Yao, L., Ren, X., Wang, Z., Luo, P., Lu, H., Li, Z.: Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to- image generation. In: European Conference on Computer Vision (ECCV) (2024), https://arxiv.org/abs/2403.04692
-
[4]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Chen, J., Xue, S., Zhao, Y., Yu, J., Paul, S., Chen, J., Cai, H., Han, S., Xie, E.: Sana-sprint: One-step diffusion with continuous-time consistency distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 16185–16195 (2025)
2025
-
[5]
Dao, Q., Metaxas, D.: Mpdit: Multi-patch global-to-local transformer architecture for efficient flow matching and diffusion model. arXiv preprint arXiv:2603.26357 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[6]
In: European Conference on Computer Vision
Dao, T., Nguyen, T.H., Le, T., Vu, D., Nguyen, K., Pham, C., Tran, A.: Swift- brush v2: Make your one-step diffusion model better than its teacher. In: European Conference on Computer Vision. pp. 176–192. Springer (2024)
2024
-
[7]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Dao, T.T., Vu, D.H., Pham, C., Tran, A.: Efhq: Multi-purpose extremepose-face- hq dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22605–22615 (2024)
2024
-
[8]
In: Forty-first international conference on machine learning
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning
-
[9]
In: Proceedings of the IEEE/CVF international conference on computer vision
Gandikota, R., Materzynska, J., Fiotto-Kaufman, J., Bau, D.: Erasing concepts from diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2426–2436 (2023)
2023
-
[10]
In: The Twelfth International Conference on Learning Represen- tations
Gu, Y., Dong, L., Wei, F., Huang, M.: Minillm: Knowledge distillation of large language models. In: The Twelfth International Conference on Learning Represen- tations
-
[11]
MiniLLM: On-Policy Distillation of Large Language Models
Gu, Y., Dong, L., Wei, F., Huang, M.: Minillm: Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Advances in neural information processing systems33, 6840–6851 (2020)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)
2020
-
[13]
arXiv preprint arXiv:2601.08303 (2026)
Hu, D., Gupta, A., Gabidolla, M., Sahni, A., Coskun, H., Li, Y., Idelbayev, Y., Mahmood, A., Lebedev, A., Lahiri, D., et al.: Snapgen++: Unleashing diffusion transformers for efficient high-fidelity image generation on edge devices. arXiv preprint arXiv:2601.08303 (2026)
-
[14]
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Hu, X., Wang, R., Fang, Y., et al.: Ella: Equip diffusion models with llm for en- hanced semantic alignment. arXiv preprint arXiv:2403.05135 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Elucidating the Design Space of Diffusion-Based Generative Models
Karras,T.,Aittala,M.,Aila,T.,Laine,S.:Elucidatingthedesignspaceofdiffusion- based generative models. In: Advances in Neural Information Processing Systems (NeurIPS) (2022),https://arxiv.org/abs/2206.00364 16 A. Nguyen et al
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
Labs, B.F.: Announcing FLUX.1.https://blackforestlabs.ai/announcing- flux-1(2024), accessed: 2026-03-04
2024
-
[17]
Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025), accessed: 2026-03-04
2025
-
[18]
In: International Conference on Learn- ing Representations (ICLR) (2025),https : / / openreview
Lee, S., Xu, Y., Geffner, T., Fanti, G., Kreis, K., Vahdat, A., Nie, W.: Truncated consistency models. In: International Conference on Learn- ing Representations (ICLR) (2025),https : / / openreview . net / pdf / bb8f3dceac43037618899ff56c90995c5e08e978.pdf
2025
-
[19]
arXiv preprint arXiv:2403.11027 (2024).https://doi.org/10.48550/ arXiv.2403.11027
Li, J., Feng, W., Chen, W., Wang, W.Y.: Reward guided latent consistency dis- tillation. arXiv preprint arXiv:2403.11027 (2024).https://doi.org/10.48550/ arXiv.2403.11027
-
[20]
Advances in Neural Information Processing Systems36, 20662–20678 (2023)
Li, Y., Wang, H., Jin, Q., Hu, J., Chemerys, P., Fu, Y., Wang, Y., Tulyakov, S., Ren, J.: Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. Advances in Neural Information Processing Systems36, 20662–20678 (2023)
2023
-
[21]
In: Proceedings of the IEEE/CVF international conference on computer vision
Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: Swinir: Image restoration using swin transformer. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1833–1844 (2021)
2021
-
[22]
SDXL-Lightning: Progressive Adversarial Diffusion Distillation
Lin, S., Wang, A., Yang, X.: Sdxl-lightning: Progressive adversarial diffusion dis- tillation. arXiv preprint arXiv:2402.13929 (2024).https://doi.org/10.48550/ arXiv.2402.13929,https://arxiv.org/abs/2402.13929
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Flow Matching for Generative Modeling
Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: International Conference on Learning Representations (ICLR) (2023),https://arxiv.org/abs/2210.02747
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022),https://arxiv. org/abs/2209.03003
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In: Advances in Neural Information Processing Systems (NeurIPS) (2022),https://arxiv.org/ abs/2206.00927
-
[26]
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent consistency mod- els: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378 (2023),https://arxiv.org/abs/2310.04378
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Ma, J., Peng, Q., Guo, X., Chen, C., Lu, H., Yang, Z.: X2i: Seamless integration of multimodal understanding into diffusion transformer via attention distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 16733–16744 (October 2025)
2025
-
[28]
arXiv preprint arXiv:2508.03789 (2025)
Ma, Y., Wu, X., Sun, K., Li, H.: Hpsv3: Towards wide-spectrum human preference score. arXiv preprint arXiv:2508.03789 (2025)
-
[29]
arXiv preprint arXiv:2510.21250 (2025)
Nguyen, A., Nguyen, V., Vu, D., Dao, T., Tran, C., Tran, T., Tran, A.: Improved training technique for shortcut models. arXiv preprint arXiv:2510.21250 (2025). https://doi.org/10.48550/arXiv.2510.21250,https://arxiv.org/abs/2510. 21250, accepted at NeurIPS 2025
-
[30]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Nguyen, K., Tran, A., Pham, C.: Suma: A subspace mapping approach for robust and effective concept erasure in text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19587–19596 (2025)
2025
-
[31]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Nguyen, T.H., Tran, A.: Swiftbrush: One-step text-to-image diffusion model with variational score distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7807–7816 (2024) Cross-Space Distillation 17
2024
-
[32]
arXiv preprint arXiv:2403.18605 (2024)
Nguyen, T.T., Nguyen, D.A., Tran, A., Pham, C.: Flexedit: Flexible and control- lable diffusion-based object-centric image editing. arXiv preprint arXiv:2403.18605 (2024)
-
[33]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Nguyen, V., Nguyen, A., Dao, T., Nguyen, K., Pham, C., Tran, T., Tran, A.: Supercharged one-step text-to-image diffusion models with negative prompts. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18004–18013 (2025)
2025
-
[34]
arXiv preprint arXiv:2511.05865 (2025)
Nguyen, V., Patel, V.M.: Cgce: Classifier-guided concept erasure in generative models. arXiv preprint arXiv:2511.05865 (2025)
-
[35]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Nguyen, V., Vu, G., Thanh, T.N., Than, K., Tran, T.: On inference stability for diffusion models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 14449–14456 (2024)
2024
-
[36]
In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=di52zR8xgf
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=di52zR8xgf
2024
-
[37]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10684– 10695 (June 2022)
2022
-
[38]
Progressive Distillation for Fast Sampling of Diffusion Models
Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512 (2022),https://arxiv.org/abs/2202.00512
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[39]
In: SIGGRAPH Asia 2024 Conference Papers
Sauer, A., Boesel, F., Dockhorn, T., Blattmann, A., Esser, P., Rombach, R.: Fast high-resolution image synthesis with latent adversarial diffusion distillation. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024)
2024
-
[41]
In: European Conference on Computer Vision
Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distilla- tion. In: European Conference on Computer Vision. pp. 87–103. Springer (2024)
2024
-
[42]
Advances in neural information processing systems35, 25278–25294 (2022)
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large- scale dataset for training next generation image-text models. Advances in neural information processing systems35, 25278–25294 (2022)
2022
-
[43]
arXiv preprint arXiv:2506.02221 (2025), https://arxiv.org/abs/2506.02221
Schusterbauer, J., Gui, M., Fundel, F., Ommer, B.: Diff2flow: Training flow match- ing models via diffusion model alignment. arXiv preprint arXiv:2506.02221 (2025), https://arxiv.org/abs/2506.02221
-
[44]
Denoising Diffusion Implicit Models
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020),https://arxiv.org/abs/2010.02502
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[45]
Score-Based Generative Modeling through Stochastic Differential Equations
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. In: Interna- tional Conference on Learning Representations (ICLR) (2021),https://arxiv. org/abs/2011.13456
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[46]
arXiv preprint (2024),https://github.com/Kwai- Kolors/ Kolors
Team, K.: Kolors: Effective training of diffusion model for photorealistic text- to-image synthesis. arXiv preprint (2024),https://github.com/Kwai- Kolors/ Kolors
2024
-
[47]
Improving and generalizing flow-based generative models with minibatch optimal transport
Tong, A., Fatras, K., Malkin, N., Huguet, G., Zhang, Y., Rector-Brooks, J., Wolf, G., Bengio, Y.: Improving and generalizing flow-based generative mod- els with minibatch optimal transport. arXiv preprint arXiv:2302.00482 (2023), https://arxiv.org/abs/2302.00482 18 A. Nguyen et al
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Vu, D., Nguyen, A., Tran, C., Tran, A.: Anti-i2v: Safeguarding your photos from malicious image-to-video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 37621–37631 (2026)
2026
-
[49]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Vu, D., Nguyen, K., Nguyen, T.T., Nguyen, N., Nguyen, P., Nguyen, K., Pham, C., Tran, A.: Inverfill: One-step inversion for enhanced few-step diffusion inpainting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 25677–25687 (2026)
2026
-
[50]
Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In: Advances in Neural Information Processing Systems (NeurIPS) (2023),https: //arxiv.org/abs/2305.16213
-
[51]
Wu, X., Hao, Y., Sun, K., Chen, Y., Zhu, F., Zhao, R., Li, H.: Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers
Xie, E., Chen, J., Chen, J., Cai, H., Tang, H., Lin, Y., Zhang, Z., Li, M., Zhu, L., Lu, Y., et al.: Sana: Efficient high-resolution image synthesis with linear diffusion transformers. arXiv preprint arXiv:2410.10629 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
In: International Conference on Machine Learning
Xie, E., Chen, J., Zhao, Y., Yu, J., Zhu, L., Lin, Y., Zhang, Z., Li, M., Chen, J., Cai, H., et al.: Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer. In: International Conference on Machine Learning. pp. 68578–68598. PMLR (2025)
2025
-
[54]
arXiv preprint arXiv:2406.05768 (2024).https://doi.org/10.48550/arXiv.2406.05768
Xie, Q., Liao, Z., Deng, Z., Chen, C., Lu, H.: Tlcm: Training-efficient latent consis- tency model for image generation with 2-8 steps. arXiv preprint arXiv:2406.05768 (2024).https://doi.org/10.48550/arXiv.2406.05768
-
[55]
In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., Dong, Y.: Imagere- ward: Learning and evaluating human preferences for text-to-image generation. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
2023
-
[56]
arXiv preprint arXiv:2502.15681 (2025),https://arxiv.org/abs/ 2502.15681
Xu, Y., Nie, W., Vahdat, A.: One-step diffusion models withf-divergence distribu- tion matching. arXiv preprint arXiv:2502.15681 (2025),https://arxiv.org/abs/ 2502.15681
-
[57]
In: NeurIPS (2024)
Yin, T., Gharbi, M., Park, T., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T.: Improved distribution matching distillation for fast image synthesis. In: NeurIPS (2024)
2024
-
[58]
In: CVPR (2024)
Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T., Park, T.: One-step diffusion with distribution matching distillation. In: CVPR (2024)
2024
-
[59]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Zhang, .: Learning multi-dimensional human preference for text-to-image gener- ation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
2024
-
[60]
arXiv preprint arXiv:2311.18158 (2023)
Zhang, Y., Hooi, B.: Hipa: Enabling one-step text-to-image diffusion models via high-frequency-promoting adaptation. arXiv preprint arXiv:2311.18158 (2023). https://doi.org/10.48550/arXiv.2311.18158,https://arxiv.org/abs/2311. 18158
-
[61]
In: European Conference on Computer Vision
Zhao, Y., Xu, Y., Xiao, Z., Jia, H., Hou, T.: Mobilediffusion: Instant text-to-image generation on mobile devices. In: European Conference on Computer Vision. pp. 225–242. Springer (2024)
2024
-
[62]
https://doi.org/10.48550/arXiv.2402.19159 Cross-Space Distillation 19
Zheng, J., Hu, M., Fan, Z., Wang, C., Ding, C., Tao, D., Cham, T.J.: Trajectory consistency distillation: Improved latent consistency distillation by semi-linear con- sistencyfunctionwithtrajectorymapping.arXivpreprintarXiv:2402.19159(2024). https://doi.org/10.48550/arXiv.2402.19159 Cross-Space Distillation 19
- [63]
-
[64]
Zhu, J., Wang, H., Su, M., Wang, Z., Wang, H.: Obs-diff: Accurate pruning for diffusion models in one-shot. arXiv preprint arXiv:2510.06751 (2025),https:// arxiv.org/abs/2510.06751 Cross-Space Distillation: Teaching One-Step Students with Modern Diffusion Teachers – Supplementary Materials – Overview.This supplementary material expands the main paper with...
-
[65]
Such methods are effective for improving sharpness and real- ism, especially when standard distillation losses alone are not sufficient
introduces a discriminator to provide stronger perceptual guidance during one-step distillation, and LADD [39] extends this idea to latent high-resolution image synthesis. Such methods are effective for improving sharpness and real- ism, especially when standard distillation losses alone are not sufficient. However, most of these approaches assume that Te...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.