HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion

· 2026 · cs.CV · arXiv 2605.15741

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Pixel-space diffusion models bypass the reconstruction bottleneck of Variational Autoencoders (VAEs) but face a fundamental "granularity dilemma": capturing global semantics favors large patch scales, while generating high-fidelity details demands fine-grained inputs. To address this issue, we propose HyperDiT, a unified framework establishing Hyper-Connected Cross-Scale Interactions to bridge the semantic and pixel manifold. Diverging from injecting semantics by AdaLN, HyperDiT utilizes Cross-Attention mechanisms, enabling fine-grained tokens to query multi-level semantic anchors globally. To resolve the spatial mismatch during multi-scale interactions, we introduce Scale-Aware Rotary Position Embedding (SA-RoPE) to ensure precise geometric alignment among tokens of varying patch sizes. Furthermore, we incorporate Registers to learn the dense semantics from a pretrained Visual Foundation Model (VFM), effectively reducing generation hallucination and artifacts. Extensive experiments demonstrate that HyperDiT achieves state-of-the-art (SoTA) FID of $\mathbf{1.56}$ on ImageNet $256\times256$ directly within the pixel space. By combining the fine-grained stream with semantic guidance, HyperDiT offers a superior paradigm for high-fidelity pixel generation.

representative citing papers

PixelU: A U-Shaped Transformer for Efficient End-to-End Pixel Diffusion

cs.CV · 2026-06-26 · unverdicted · novelty 6.0

PixelU is a minimalist U-shaped Diffusion Transformer for pixel-space diffusion that decouples frequencies with zero-cost skip connections and constant-channel downsampling, outperforming baselines like JiT-G at 1/3 the compute cost with FID 1.63 on ImageNet 256x256.

Robust and Generalizable Safety Steering for Text-to-Image Diffusion Transformers

cs.AI · 2026-05-28 · unverdicted · novelty 4.0

SafeDIG applies position-aware sparse feature transfer via SAEs in DiT models to reduce unsafe generations in target risk domains on FLUX.1 Dev and SD 3.5 while keeping source safety and quality.

citing papers explorer

Showing 2 of 2 citing papers.

PixelU: A U-Shaped Transformer for Efficient End-to-End Pixel Diffusion cs.CV · 2026-06-26 · unverdicted · none · ref 11 · internal anchor
PixelU is a minimalist U-shaped Diffusion Transformer for pixel-space diffusion that decouples frequencies with zero-cost skip connections and constant-channel downsampling, outperforming baselines like JiT-G at 1/3 the compute cost with FID 1.63 on ImageNet 256x256.
Robust and Generalizable Safety Steering for Text-to-Image Diffusion Transformers cs.AI · 2026-05-28 · unverdicted · none · ref 20 · internal anchor
SafeDIG applies position-aware sparse feature transfer via SAEs in DiT models to reduce unsafe generations in target risk domains on FLUX.1 Dev and SD 3.5 while keeping source safety and quality.

HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion

fields

years

verdicts

representative citing papers

citing papers explorer