pith. machine review for the scientific record. sign in

arxiv: 2605.03317 · v1 · submitted 2026-05-05 · 💻 cs.CV · cs.AI

Recognition: unknown

AHPA: Adaptive Hierarchical Prior Alignment for Diffusion Transformers

Ruibin Min , Yexin Liu , Aimin Pan , Changsheng Lu , Jiafei Wu , Kelu Yao , Xiaogang Xu , Harry Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 01:25 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords diffusion transformersrepresentation alignmentadaptive hierarchical priorsVAE featuresdenoising trajectorydynamic routingtraining acceleration
0
0 comments X

The pith

A timestep-conditioned router selects multi-level VAE features to match the changing supervision needs of diffusion transformers during denoising.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that fixed alignment targets in diffusion transformer training create a mismatch because the right level of detail in supervision shifts as the model moves from high-noise to low-noise regimes. Coarse semantic guidance helps early on while fine spatial detail matters later, so a single static prior cannot serve the whole trajectory. AHPA extracts several layers of features from a frozen VAE encoder and lets a learned router, conditioned on the current timestep, pick and blend the appropriate prior at each step. This produces faster convergence and better final image quality while adding nothing to inference time and avoiding any external encoder during training. A reader would care because diffusion models remain expensive to train, and reusing the VAE's existing hierarchy offers a lightweight way to supply better-matched guidance throughout the process.

Core claim

The central claim is that a timestep-conditioned Dynamic Router can extract and weight complementary hierarchical features from the frozen VAE encoder to supply alignment targets whose granularity automatically tracks the model's evolving needs along the denoising trajectory, thereby removing the representational mismatch imposed by any fixed single-level supervisor.

What carries the argument

The timestep-conditioned Dynamic Router that adaptively selects and weights multi-level features from the frozen VAE encoder to keep alignment granularity in step with the current signal-to-noise ratio.

If this is right

  • Training converges faster because each denoising stage receives supervision at the granularity it currently needs.
  • Final image quality improves without any added computation at inference time.
  • Training requires no external vision encoders or additional labeled supervision sources.
  • The VAE's native hierarchy supplies the full range of priors from local geometry to semantic layout.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same router principle could be tested on other progressive refinement tasks such as video or 3D diffusion models where detail requirements also shift by stage.
  • One could replace the VAE hierarchy with a different multi-scale encoder and measure whether the router still learns useful stage-specific weighting.
  • The method opens a path to fully internal, parameter-free supervision schedules that might reduce reliance on any fixed external teacher across generative training.

Load-bearing premise

The useful level of representational detail needed for effective supervision changes systematically as noise decreases along the denoising trajectory.

What would settle it

A controlled comparison in which a well-tuned static single-level VAE alignment or external-encoder baseline matches or exceeds AHPA on convergence speed and sample quality across several model sizes, datasets, and timestep schedules.

Figures

Figures reproduced from arXiv: 2605.03317 by Aimin Pan, Changsheng Lu, Harry Yang, Jiafei Wu, Kelu Yao, Ruibin Min, Xiaogang Xu, Yexin Liu.

Figure 1
Figure 1. Figure 1: High-fidelity image generated by SiT-XL/2 with AHPA alignment. Abstract Representation alignment has recently emerged as an effective paradigm for accel￾erating Diffusion Transformer training. Despite their success, existing alignment methods typically impose a fixed supervision target or a fixed alignment granularity throughout the entire denoising trajectory, whether the guidance is provided by external … view at source ↗
Figure 2
Figure 2. Figure 2: Quantifying the non-stationary alignment requirement. (a) Diagnostic probing reveals that static baselines suffer from catastrophic fidelity degradation as t → 0. (b) Hierarchical features (Gdeep and Gmid) exhibit significant phase complementarity. (c) Our AHPA adaptively bridges these gaps to maintain a high-level G-SNR envelope throughout the full trajectory. • Adaptive and Efficient Alignment. We propos… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of AHPA. AHPA extracts multi-scale hierarchical priors from a frozen VAE encoder. A timestep-conditioned dynamic router Rϕ adaptively schedules these priors (α, β) to align with the DiT backbone’s evolving needs, incurring zero inference overhead view at source ↗
Figure 4
Figure 4. Figure 4: Group-averaged Feature Visualization via PCA. From top to bottom: Original images, Middle-level blueprints (Gmid), and Deep-level blueprints (Gdeep). The visualization demonstrates the consistent structural-to-semantic abstraction across diverse categories. Middle-level Group (Gmid): Derived from G3, this group represents a transitional stage of ab￾straction, primarily capturing spatial topologies and part… view at source ↗
Figure 5
Figure 5. Figure 5: Mechanistic analysis of AHPA’s dynamic routing policy. (Left) The inter-group weights β exhibit a clear transition from semantic anchoring to structural refinement as t → 0. (Middle, Right) The intra-group weights α show the microscopic selection of hierarchical layers within each functional group. Results are obtained from SiT-XL/2 after 400k training iterations on ImageNet. 5.4 Mechanistic Insights into … view at source ↗
read the original abstract

Representation alignment has recently emerged as an effective paradigm for accelerating Diffusion Transformer training. Despite their success, existing alignment methods typically impose a fixed supervision target or a fixed alignment granularity throughout the entire denoising trajectory, whether the guidance is provided by external vision encoders, internal self-representations, or VAE-derived features. We argue that such timestep-agnostic alignment is suboptimal because the useful granularity of representation supervision changes systematically with the signal-to-noise ratio. In high-noise regimes, diffusion models benefit more from coarse semantic and layout-level anchoring, whereas in low-noise regimes, the training signal should emphasize spatially detailed and structurally faithful refinement. This non-stationary alignment behavior creates a representational mismatch for static single-level supervisors. To address this issue, we propose Adaptive Hierarchical Prior Alignment (AHPA), a lightweight alignment framework that exploits the hierarchical representations naturally embedded in the frozen VAE encoder. Instead of using only a single compressed latent as the alignment target, AHPA extracts multi-level VAE features that provide complementary priors ranging from local geometry and spatial topology to coarse semantic layout. A timestep-conditioned Dynamic Router adaptively selects and weights these hierarchical priors along the denoising trajectory, thereby synchronizing the alignment granularity with the model's evolving training needs. Extensive experiments show that AHPA improves convergence and generation quality over baselines and incurs no additional inference cost while avoiding external encoder supervision during training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Adaptive Hierarchical Prior Alignment (AHPA) for Diffusion Transformers. It argues that fixed-granularity alignment (from external encoders, self-representations, or single VAE latents) is suboptimal because the useful level of representation supervision varies with signal-to-noise ratio: coarse layout and semantics help at high noise while fine spatial details matter at low noise. AHPA extracts multi-level features from a frozen VAE encoder and introduces a timestep-conditioned Dynamic Router that adaptively selects and weights these hierarchical priors along the denoising trajectory. The abstract states that extensive experiments demonstrate improved convergence and generation quality over baselines, with no added inference cost and without external encoder supervision during training.

Significance. If the empirical gains are reproducible and attributable to the adaptive mechanism rather than richer static supervision alone, AHPA would provide a lightweight, training-only improvement for DiT models that leverages existing VAE hierarchies without external models or inference overhead. This could be useful for accelerating training in resource-constrained settings. The approach receives credit for avoiding external supervision and maintaining inference efficiency, though its impact hinges on isolating the router's adaptivity.

major comments (2)
  1. [Experiments] Experiments section: The reported results do not include an ablation comparing the full dynamic router against a static (timestep-independent) weighted combination of the same multi-level VAE features. Without this control, it remains unclear whether the claimed gains require the timestep-adaptive weighting or could be obtained from non-adaptive multi-level supervision, undermining the central claim that the router synchronizes granularity with SNR.
  2. [Method] Method section (Dynamic Router description): The paper does not report statistics on router behavior (e.g., how often it selects each hierarchical level as a function of timestep or noise level). If router outputs are nearly constant across the trajectory, the adaptivity is not load-bearing and the improvement reduces to using richer VAE features.
minor comments (2)
  1. [Abstract] Abstract and §4: Provide concrete quantitative improvements (e.g., FID deltas, convergence speed metrics) and list the exact datasets and baselines used, rather than stating 'extensive experiments show improvements.'
  2. [Method] Notation: Define the hierarchical VAE feature levels and router output formulation with explicit equations to clarify how weighting occurs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important aspects for strengthening the evidence that the dynamic router, rather than static multi-level supervision alone, drives the reported gains. We address each point below and will incorporate the requested analyses in the revised manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The reported results do not include an ablation comparing the full dynamic router against a static (timestep-independent) weighted combination of the same multi-level VAE features. Without this control, it remains unclear whether the claimed gains require the timestep-adaptive weighting or could be obtained from non-adaptive multi-level supervision, undermining the central claim that the router synchronizes granularity with SNR.

    Authors: We agree that this control experiment is essential to isolate the benefit of timestep-conditioned routing. In the revised manuscript we will add an ablation that replaces the Dynamic Router with a static (timestep-independent) weighted combination of the identical multi-level VAE features. The static weights will be either uniformly fixed or learned once across the full trajectory; the resulting model will be trained and evaluated under identical settings to the original AHPA. This will directly test whether adaptivity is required or whether richer static supervision suffices. revision: yes

  2. Referee: [Method] Method section (Dynamic Router description): The paper does not report statistics on router behavior (e.g., how often it selects each hierarchical level as a function of timestep or noise level). If router outputs are nearly constant across the trajectory, the adaptivity is not load-bearing and the improvement reduces to using richer VAE features.

    Authors: We acknowledge that quantitative evidence of router variation is needed to substantiate the adaptivity claim. In the revision we will include new figures and tables reporting router statistics: average selection probabilities (or softmax weights) for each VAE hierarchical level plotted against timestep, plus per-timestep histograms or variance metrics. These will be computed on the trained model and shown for representative noise levels, confirming that the router’s output distribution changes systematically along the denoising trajectory. revision: yes

Circularity Check

0 steps flagged

No circularity: new adaptive alignment framework evaluated against external baselines

full rationale

The paper introduces AHPA as a novel lightweight framework that extracts multi-level VAE features and uses a timestep-conditioned dynamic router to adapt alignment granularity. Its central claims rest on the design choice motivated by a hypothesis about SNR-dependent supervision needs, followed by empirical comparisons to baselines showing improved convergence and quality with no inference overhead. No equations or derivations reduce by construction to fitted inputs, self-citations, or renamed known results; the router and hierarchical priors are independent additions whose value is measured externally rather than defined tautologically. The derivation chain is self-contained against the stated experimental benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the domain assumption that VAE encoders naturally embed useful hierarchical priors and that a learned router can reliably match them to noise levels without introducing new instabilities.

axioms (2)
  • domain assumption VAE encoder features provide complementary priors ranging from local geometry to coarse semantic layout
    Invoked when stating that multi-level VAE features supply the necessary hierarchy for adaptive alignment.
  • domain assumption Useful alignment granularity changes systematically with signal-to-noise ratio
    Core premise used to argue that fixed supervision is suboptimal.
invented entities (1)
  • Timestep-conditioned Dynamic Router no independent evidence
    purpose: To adaptively select and weight hierarchical VAE priors along the denoising trajectory
    New module introduced to synchronize alignment granularity with training needs

pith-pipeline@v0.9.0 · 5559 in / 1373 out tokens · 25715 ms · 2026-05-08T01:25:26.628616+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 2 canonical work pages

  1. [1]

    All are worth words: A vit backbone for diffusion models, 2023

    Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models, 2023

  2. [2]

    One transformer fits all distributions in multi-modal diffusion at scale, 2023

    Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, and Jun Zhu. One transformer fits all distributions in multi-modal diffusion at scale, 2023

  3. [3]

    Sara: Structural and adversarial representa- tion alignment for training-efficient diffusion models, 2025

    Hesen Chen, Junyan Wang, Zhiyu Tan, and Hao Li. Sara: Structural and adversarial representa- tion alignment for training-efficient diffusion models, 2025

  4. [4]

    Perception prioritized training of diffusion models, 2022

    Jooyoung Choi, Jungbeom Lee, Chaehun Shin, Sungwon Kim, Hyunwoo Kim, and Sungroh Yoon. Perception prioritized training of diffusion models, 2022

  5. [5]

    Diffusion models beat gans on image synthesis, 2021

    Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis, 2021

  6. [6]

    Scaling rectified flow transformers for high-resolution image synthesis, 2024

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024

  7. [7]

    Frido: Feature pyramid diffusion for complex scene image synthesis, 2022

    Wan-Cyuan Fan, Yen-Chun Chen, Dongdong Chen, Yu Cheng, Lu Yuan, and Yu-Chiang Frank Wang. Frido: Feature pyramid diffusion for complex scene image synthesis, 2022

  8. [8]

    Vector quantized diffusion model for text-to-image synthesis, 2022

    Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis, 2022

  9. [9]

    Diffit: Diffusion vision transformers for image generation, 2024

    Ali Hatamizadeh, Jiaming Song, Guilin Liu, Jan Kautz, and Arash Vahdat. Diffit: Diffusion vision transformers for image generation, 2024

  10. [10]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020

  11. [11]

    Fleet, Mohammad Norouzi, and Tim Salimans

    Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation, 2021

  12. [12]

    Simple diffusion: End-to-end diffusion for high resolution images, 2023

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. Simple diffusion: End-to-end diffusion for high resolution images, 2023

  13. [13]

    No other representation component is needed: Diffusion transformers can provide representation guidance by themselves, 2026

    Dengyang Jiang, Mengmeng Wang, Liuzhuozheng Li, Lei Zhang, Haoyu Wang, Wei Wei, Guang Dai, Yanning Zhang, and Jingdong Wang. No other representation component is needed: Diffusion transformers can provide representation guidance by themselves, 2026

  14. [14]

    Consistency trajectory models: Learning probability flow ode trajectory of diffusion, 2024

    Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory models: Learning probability flow ode trajectory of diffusion, 2024

  15. [15]

    Kingma and Ruiqi Gao

    Diederik P. Kingma and Ruiqi Gao. Understanding diffusion objectives as the elbo with simple data augmentation, 2023. 10

  16. [16]

    Kingma, Tim Salimans, Ben Poole, and Jonathan Ho

    Diederik P. Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models, 2023

  17. [17]

    Lawrence Zitnick, and Piotr Dollár

    Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015

  18. [18]

    Albergo, Nicholas M

    Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InECCV (77), volume 15135 ofLecture Notes in Computer Science, pages 23–40. Springer, 2024

  19. [19]

    Diffusion model is effectively its own teacher.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12901–12911, 2025

    Xinyin Ma, Runpeng Yu, Songhua Liu, Gongfan Fang, and Xinchao Wang. Diffusion model is effectively its own teacher.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12901–12911, 2025

  20. [20]

    Improved denoising diffusion probabilistic models, 2021

    Alex Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models, 2021

  21. [21]

    Dinov2: Learning robust visual features without supervision, 2024

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick La...

  22. [22]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4172–4182. IEEE, 2023

  23. [23]

    Dual-path condition alignment for diffusion transformers

    Changhao Peng, Yuqi Ye, Shuangjun Du, Wenxu Gao, and Wei Gao. Dual-path condition alignment for diffusion transformers. InThe F ourteenth International Conference on Learning Representations, 2026

  24. [24]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023

  25. [25]

    Learning transferable visual models from natural language supervision, 2021

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021

  26. [26]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InCVPR, pages 10674–10685. IEEE, 2022

  27. [27]

    High-resolution image synthesis with latent diffusion models, 2022

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022

  28. [28]

    Saxe, James L

    Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, 2014

  29. [29]

    What matters for representation alignment: Global information or spatial structure? arXiv preprint arXiv:2512.10794, 2025

    Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial structure? arXiv preprint arXiv:2512.10794, 2025

  30. [30]

    Denoising diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In ICLR. OpenReview.net, 2021

  31. [31]

    Df-gan: A simple and effective baseline for text-to-image synthesis, 2022

    Ming Tao, Hao Tang, Fei Wu, Xiao-Yuan Jing, Bing-Kun Bao, and Changsheng Xu. Df-gan: A simple and effective baseline for text-to-image synthesis, 2022

  32. [32]

    Sra 2: Variational autoencoder self-representation alignment for efficient diffusion training, 2026

    Mengmeng Wang, Dengyang Jiang, Liuzhuozheng Li, Yucheng Lin, Guojiang Shen, Xi- angjie Kong, Yong Liu, Guang Dai, and Jingdong Wang. Sra 2: Variational autoencoder self-representation alignment for efficient diffusion training, 2026. 11

  33. [33]

    Representation entanglement for generation: Training diffusion transformers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025

    Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, et al. Representation entanglement for generation: Training diffusion transformers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025

  34. [34]

    Attngan: Fine-grained text to image generation with attentional generative adversarial networks, 2017

    Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks, 2017

  35. [35]

    Reconstruction vs

    Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimiza- tion dilemma in latent diffusion models, 2025

  36. [36]

    Representation alignment for generation: Training diffusion transformers is easier than you think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InICLR. OpenReview.net, 2025

  37. [37]

    Gradient surgery for multi-task learning, 2020

    Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning, 2020

  38. [38]

    Cross-modal contrastive learning for text-to-image generation, 2022

    Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. Cross-modal contrastive learning for text-to-image generation, 2022

  39. [39]

    Fast training of diffusion models with masked transformers.TMLR, 2024

    Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers.TMLR, 2024

  40. [40]

    Lafite: Towards language-free training for text-to-image generation, 2022

    Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxi- ang Gu, Jinhui Xu, and Tong Sun. Lafite: Towards language-free training for text-to-image generation, 2022

  41. [41]

    Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis, 2019

    Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis, 2019

  42. [42]

    flexible

    Rui Zhu, Yingwei Pan, Yehao Li, Ting Yao, Zhenglong Sun, Tao Mei, and Chang Wen Chen. Sd-dit: Unleashing the power of self-supervised discrimination in diffusion transformer, 2024. 12 A Detailed Experimental Settings Model Configurations.We evaluate AHPA across three scales: SiT-B/2, L/2, and XL/2. The pre-trained V AE encoder is frozen and utilized as a ...

  43. [43]

    weight aliasing,

    Timestep Encoding Strategy.We evaluate how the representation of scalar t∈[0,1] affects the quality of manifold alignment. Formally, we compare three encoding schemes: • Linear Projection (Ours):Defined as elinear(t) =Wt+b , where W∈R D×1 and b∈R D are learnable. The constant Jacobian ∂e ∂t =W ensures that the router’s input changes at a uniform rate, pre...

  44. [44]

    Local Adaptation.We evaluate whether the routing policy ϕ should depend on the global denoising schedule or adapt to localized feature representations

    Informational Source: Global Sync vs. Local Adaptation.We evaluate whether the routing policy ϕ should depend on the global denoising schedule or adapt to localized feature representations. We contrast our timestep-only design with a content-adaptive variant. Implementation of Content Feature:To capture the specific state of the model at each alignment st...

  45. [45]

    gradient shocks

    Router Architecture and Complexity: From Discreteness to Continuity.We contrast our MLP- based Rϕ with two alternative structural paradigms to verify the necessity of continuous non-linear mapping for inter-group (β) and intra-group (α) scheduling. • Lookup Table (LUT):Implementation:We discretize the continuous timestep t∈[0,1] into N= 10 uniform bins. T...

  46. [45]

    gradient shocks

    Router Architecture and Complexity: From Discreteness to Continuity.We contrast our MLP- based Rϕ with two alternative structural paradigms to verify the necessity of continuous non-linear mapping for inter-group (β) and intra-group (α) scheduling. • Lookup Table (LUT):Implementation:We discretize the continuous timestep t∈[0,1] into N= 10 uniform bins. T...

  47. [46]

    poisoning

    Depth of the MLP.We sweep the number of layers L∈ {3,4,8} for our MLP router. A 3-layer MLP lacks the non-linearity required for the sharp semantic-to-structural transition (26.1 FID). We find that L= 4 provides the optimal balance; further increasing the depth to L= 8 results in inferior performance (26.3 FID), suggesting that excessive complexity may hi...