MPDiT uses a hierarchical multi-patch design in transformers to lower computation in diffusion models by handling coarse global features first then fine local details, plus faster-converging embeddings.
Lit: Delving into a simpli- fied linear diffusion transformer for image generation
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 3years
2026 3roles
background 1polarities
background 1representative citing papers
JetViT uses post-training attention search to hybridize full-attention ViTs with linear and window attention blocks, achieving up to 1.79x throughput gains on high-res images while preserving accuracy on DINOv3 and DepthAnythingV2.
citing papers explorer
-
MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model
MPDiT uses a hierarchical multi-patch design in transformers to lower computation in diffusion models by handling coarse global features first then fine local details, plus faster-converging embeddings.
-
JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search
JetViT uses post-training attention search to hybridize full-attention ViTs with linear and window attention blocks, achieving up to 1.79x throughput gains on high-res images while preserving accuracy on DINOv3 and DepthAnythingV2.
- Linearizing Vision Transformer with Test-Time Training