Converts pretrained Vision Transformers to linear-complexity TTT models via architectural and representational alignment, demonstrated by linearizing Stable Diffusion 3.5 with 1-hour fine-tuning to match quality at 1.32-1.47x faster inference.
Lit: Delving into a simpli- fied linear diffusion transformer for image generation
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.CV 3years
2026 3verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
MPDiT uses a hierarchical multi-patch design in transformers to lower computation in diffusion models by handling coarse global features first then fine local details, plus faster-converging embeddings.
JetViT uses post-training attention search to hybridize full-attention ViTs with linear and window attention blocks, achieving up to 1.79x throughput gains on high-res images while preserving accuracy on DINOv3 and DepthAnythingV2.
citing papers explorer
-
MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model
MPDiT uses a hierarchical multi-patch design in transformers to lower computation in diffusion models by handling coarse global features first then fine local details, plus faster-converging embeddings.