pith. machine review for the scientific record. sign in

Fit: Flexible vision transformer for diffusion model

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

citation-role summary

background 1

citation-polarity summary

fields

cs.CV 6

years

2026 4 2024 2

roles

background 1

polarities

background 1

representative citing papers

Elastic Attention Cores for Scalable Vision Transformers

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.

ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

ViTok-v2 is a 5B-parameter native-resolution image autoencoder using NaFlex and DINOv3 loss that matches or exceeds prior tokenizers at 256p and outperforms them at 512p and above while advancing the Pareto frontier in joint scaling with generators.

LTX-Video: Realtime Video Latent Diffusion

cs.CV · 2024-12-30 · conditional · novelty 6.0

LTX-Video integrates Video-VAE and transformer for 1:192 latent compression and real-time video diffusion by moving patchifying to the VAE and letting the decoder finish denoising in pixel space.

MultiMedVision: Multi-Modal Medical Vision Framework

cs.CV · 2026-05-09 · unverdicted · novelty 5.0

A unified Sparse Vision Transformer learns joint 2D/3D medical image representations via self-supervision and achieves competitive AUROC on chest X-ray and CT benchmarks with 5x less data than modality-specific models.

Open-Sora: Democratizing Efficient Video Production for All

cs.CV · 2024-12-29 · unverdicted · novelty 5.0

Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tasks with claimed high fidelity.

citing papers explorer

Showing 6 of 6 citing papers.

  • Elastic Attention Cores for Scalable Vision Transformers cs.CV · 2026-05-12 · unverdicted · none · ref 129

    VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.

  • ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters cs.CV · 2026-05-06 · unverdicted · none · ref 7

    ViTok-v2 is a 5B-parameter native-resolution image autoencoder using NaFlex and DINOv3 loss that matches or exceeds prior tokenizers at 256p and outperforms them at 512p and above while advancing the Pareto frontier in joint scaling with generators.

  • VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations cs.CV · 2026-04-27 · unverdicted · none · ref 23

    VibeToken enables autoregressive image generation at arbitrary resolutions using 64 tokens for 1024x1024 images with 3.94 gFID, constant 179G FLOPs, and better efficiency than diffusion or fixed AR baselines.

  • LTX-Video: Realtime Video Latent Diffusion cs.CV · 2024-12-30 · conditional · none · ref 19

    LTX-Video integrates Video-VAE and transformer for 1:192 latent compression and real-time video diffusion by moving patchifying to the VAE and letting the decoder finish denoising in pixel space.

  • MultiMedVision: Multi-Modal Medical Vision Framework cs.CV · 2026-05-09 · unverdicted · none · ref 8

    A unified Sparse Vision Transformer learns joint 2D/3D medical image representations via self-supervision and achieves competitive AUROC on chest X-ray and CT benchmarks with 5x less data than modality-specific models.

  • Open-Sora: Democratizing Efficient Video Production for All cs.CV · 2024-12-29 · unverdicted · none · ref 19

    Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tasks with claimed high fidelity.