Fit: Flexible vision transformer for diffusion model

Zeyu Lu, Zidong Wang, Di Huang, Chengyue Wu, Xihui Liu, Wanli Ouyang, Lei Bai · 2024 · arXiv 2402.12376

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Elastic Attention Cores for Scalable Vision Transformers

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.

ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

ViTok-v2 is a 5B-parameter native-resolution image autoencoder using NaFlex and DINOv3 loss that matches or exceeds prior tokenizers at 256p and outperforms them at 512p and above while advancing the Pareto frontier in joint scaling with generators.

VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

cs.CV · 2026-04-27 · unverdicted · novelty 6.0

VibeToken enables autoregressive image generation at arbitrary resolutions using 64 tokens for 1024x1024 images with 3.94 gFID, constant 179G FLOPs, and better efficiency than diffusion or fixed AR baselines.

LTX-Video: Realtime Video Latent Diffusion

cs.CV · 2024-12-30 · conditional · novelty 6.0

LTX-Video integrates Video-VAE and transformer for 1:192 latent compression and real-time video diffusion by moving patchifying to the VAE and letting the decoder finish denoising in pixel space.

MultiMedVision: Multi-Modal Medical Vision Framework

cs.CV · 2026-05-09 · unverdicted · novelty 5.0

A unified Sparse Vision Transformer learns joint 2D/3D medical image representations via self-supervision and achieves competitive AUROC on chest X-ray and CT benchmarks with 5x less data than modality-specific models.

Open-Sora: Democratizing Efficient Video Production for All

cs.CV · 2024-12-29 · unverdicted · novelty 5.0

Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tasks with claimed high fidelity.

citing papers explorer

Showing 6 of 6 citing papers.

Elastic Attention Cores for Scalable Vision Transformers cs.CV · 2026-05-12 · unverdicted · none · ref 129
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.
ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters cs.CV · 2026-05-06 · unverdicted · none · ref 7
ViTok-v2 is a 5B-parameter native-resolution image autoencoder using NaFlex and DINOv3 loss that matches or exceeds prior tokenizers at 256p and outperforms them at 512p and above while advancing the Pareto frontier in joint scaling with generators.
VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations cs.CV · 2026-04-27 · unverdicted · none · ref 23
VibeToken enables autoregressive image generation at arbitrary resolutions using 64 tokens for 1024x1024 images with 3.94 gFID, constant 179G FLOPs, and better efficiency than diffusion or fixed AR baselines.
LTX-Video: Realtime Video Latent Diffusion cs.CV · 2024-12-30 · conditional · none · ref 19
LTX-Video integrates Video-VAE and transformer for 1:192 latent compression and real-time video diffusion by moving patchifying to the VAE and letting the decoder finish denoising in pixel space.
MultiMedVision: Multi-Modal Medical Vision Framework cs.CV · 2026-05-09 · unverdicted · none · ref 8
A unified Sparse Vision Transformer learns joint 2D/3D medical image representations via self-supervision and achieves competitive AUROC on chest X-ray and CT benchmarks with 5x less data than modality-specific models.
Open-Sora: Democratizing Efficient Video Production for All cs.CV · 2024-12-29 · unverdicted · none · ref 19
Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tasks with claimed high fidelity.

Fit: Flexible vision transformer for diffusion model

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer