Vitar: Vision transformer with any resolution

Qihang Fan, Quanzeng You, Xiaotian Han, Yongfei Liu, Yunzhe Tao, Huaibo Huang, Ran He, Hongxia Yang · 2024 · arXiv 2403.18361

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

Weighted Reverse Convolution for Feature Upsampling

cs.CV · 2026-05-17 · unverdicted · novelty 6.0 · 2 refs

Weighted Reverse Convolution is a spatially adaptive inverse operator for densifying high-level visual descriptors from vision foundation models, using weighted regularization and an FFT closed-form solution to improve dense prediction tasks.

On What We Can Learn from Low-Resolution Data

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Low-resolution data improves high-resolution model performance when high-resolution samples are limited, via KL-divergence bounds and experiments on vision transformers and CNNs.

VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

cs.CV · 2026-04-27 · unverdicted · novelty 6.0

VibeToken enables autoregressive image generation at arbitrary resolutions using 64 tokens for 1024x1024 images with 3.94 gFID, constant 179G FLOPs, and better efficiency than diffusion or fixed AR baselines.

citing papers explorer

Showing 3 of 3 citing papers.

Weighted Reverse Convolution for Feature Upsampling cs.CV · 2026-05-17 · unverdicted · none · ref 12 · 2 links
Weighted Reverse Convolution is a spatially adaptive inverse operator for densifying high-level visual descriptors from vision foundation models, using weighted regularization and an FFT closed-form solution to improve dense prediction tasks.
On What We Can Learn from Low-Resolution Data cs.LG · 2026-05-12 · unverdicted · none · ref 23
Low-resolution data improves high-resolution model performance when high-resolution samples are limited, via KL-divergence bounds and experiments on vision transformers and CNNs.
VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations cs.CV · 2026-04-27 · unverdicted · none · ref 12
VibeToken enables autoregressive image generation at arbitrary resolutions using 64 tokens for 1024x1024 images with 3.94 gFID, constant 179G FLOPs, and better efficiency than diffusion or fixed AR baselines.

Vitar: Vision transformer with any resolution

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer