An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexander Kolesnikov, Alexey Dosovitskiy, Dirk Weissenborn, Georg Heigold, Jakob Uszkoreit, Lucas Beyer, Matthias Minderer, Mostafa Dehghani, Neil Houlsby, Sylvain Gelly, Thomas Unterthiner, Xiaohua Zhai

classification 💻 cs.CV cs.AIcs.LG

keywords imageconvolutionalnetworkstransformervisionwhileappliedrecognition

0 comments

read the original abstract

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DyABD: The Abdominal Muscle Segmentation in Dynamic MRI Benchmark
cs.CV 2026-04 conditional novelty 9.0

DyABD is the first benchmark dataset for abdominal muscle segmentation in dynamic MRIs featuring exercise-induced anatomical changes and pre/post-surgery scans, where existing models achieve an average Dice score of 0.82.
CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography
cs.CV 2026-05 accept novelty 8.0

CheXTemporal supplies paired chest X-rays with explicit temporal progression taxonomy and spatial grounding to benchmark and improve models on longitudinal reasoning tasks.
Dissecting Jet-Tagger Through Mechanistic Interpretability
hep-ph 2026-05 accept novelty 8.0

A Particle Transformer jet tagger contains a sparse six-head circuit whose source-relay-readout structure recovers most performance and whose residual stream preferentially encodes 2-prong energy correlators.
Gradient-Based Program Synthesis with Neurally Interpreted Languages
cs.LG 2026-04 unverdicted novelty 8.0

NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prio...
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
cs.RO 2023-03 accept novelty 8.0

Diffusion Policy models robot actions as a conditional diffusion process, outperforming prior state-of-the-art methods by 46.9% on average across 12 manipulation tasks from four benchmarks.
Efficiently Modeling Long Sequences with Structured State Spaces
cs.LG 2021-10 unverdicted novelty 8.0

S4 is an efficient state space sequence model that captures long-range dependencies via structured parameterization of the SSM, achieving state-of-the-art results on the Long Range Arena and other benchmarks while bei...
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
Revisiting Shadow Detection from a Vision-Language Perspective
cs.CV 2026-05 unverdicted novelty 7.0

SVL uses language embeddings aligned with global image representations via shadow ratio regression and global-to-local coupling to improve shadow detection robustness in ambiguous cases.
Weather-Robust Cross-View Geo-Localization via Prototype-Based Semantic Part Discovery
cs.CV 2026-05 unverdicted novelty 7.0

SkyPart uses learnable prototypes for patch grouping, altitude modulation only in training, graph-attention readout, and Kendall-weighted loss to set new state-of-the-art single-pass performance on SUES-200, Universit...
SoK: Unlearnability and Unlearning for Model Dememorization
cs.LG 2026-05 conditional novelty 7.0

The first integrated taxonomy, empirical study of interplay and shallow dememorization, plus a theoretical guarantee on dememorization depth for certified unlearning.
TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles
cs.CV 2026-05 unverdicted novelty 7.0

TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.
Can Graphs Help Vision SSMs See Better?
cs.CV 2026-05 unverdicted novelty 7.0

GraphScan replaces geometric or coordinate-based scanning in Vision SSMs with learned local semantic graph routing, yielding SOTA results among such models on classification and segmentation tasks.
RelFlexformer: Efficient Attention 3D-Transformers for Integrable Relative Positional Encodings
cs.LG 2026-05 unverdicted novelty 7.0

RelFlexformers enable flexible integrable 3D RPE in attention via NU-FFT, generalizing prior methods to heterogeneous token positions with O(L log L) complexity.
Automated Detection of Abnormalities in Zebrafish Development
cs.CV 2026-05 unverdicted novelty 7.0

A new annotated dataset of zebrafish embryo image sequences enables a spatiotemporal transformer to classify fertility at 98% accuracy and detect compound-induced malformations at 92% accuracy.
The Benefits of Temporal Correlations: SGD Learns k-Juntas from Random Walks Efficiently
cs.LG 2026-05 unverdicted novelty 7.0

Temporal correlations from lazy random walks enable efficient SGD learning of k-juntas via temporal-difference loss on ReLU networks, achieving linear sample complexity in d.
Learning to Align Generative Appearance Priors for Fine-grained Image Retrieval
cs.CV 2026-05 unverdicted novelty 7.0

GAPan uses invertible normalizing flows to learn generative appearance priors from seen categories and aligns retrieval embeddings to these priors, improving performance on unseen categories in fine-grained image retrieval.
PromptDx: Differentiable Prompt Tuning for Multimodal In-Context Alzheimer's Diagnosis
cs.CV 2026-05 unverdicted novelty 7.0

PromptDx adds a differentiable adapter to align multimodal data with a pre-trained TabPFN-style ICL engine, achieving strong Alzheimer's diagnosis performance with only 1% context samples.
EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

EyeCue detects driver cognitive distraction by modeling gaze-visual context interactions in egocentric videos and achieves 74.38% accuracy on the new CogDrive dataset, outperforming 11 baselines.
Neural network quantum states in the grand canonical ensemble
quant-ph 2026-05 unverdicted novelty 7.0

A new neural quantum state ansatz for bosons in the grand canonical ensemble achieves competitive variational energies in 1D and 2D systems and provides access to one-body reduced density matrices.
SAM 3D Animal: Promptable Animal 3D Reconstruction from Images in the Wild
cs.CV 2026-05 unverdicted novelty 7.0

SAM 3D Animal is the first promptable framework for multi-animal 3D reconstruction from single images, built on SMAL+ and trained on the new Herd3D dataset, achieving SOTA results on Animal3D, APTv2, and Animal Kingdo...
On the Invariance and Generality of Neural Scaling Laws
cs.LG 2026-05 unverdicted novelty 7.0

Neural scaling laws are invariant under bijective data transformations and change predictably with information resolution ρ under non-bijective transformations, enabling cross-domain transport of fitted exponents.
Amortized-Precision Quantization for Early-Exit Vision Transformers
cs.CV 2026-05 unverdicted novelty 7.0

Amortized-Precision Quantization (APQ) and the MAQEE bi-level framework jointly optimize bit-widths and exit thresholds for early-exit ViTs, cutting BOPs by up to 95% with maintained accuracy across vision tasks.
Testing machine-learned distributions against Monte Carlo data for the QCD chiral phase transition
hep-lat 2026-05 unverdicted novelty 7.0

Conditional MAFs interpolate QCD chiral phase structure across coupling, mass, and volume, reproducing reweighting while cutting required ensembles despite bias near transitions.
TRAJGANR: Trajectory-Centric Urban Multimodal Learning via Geospatially Aligned Neural Representations
cs.CV 2026-05 unverdicted novelty 7.0

TrajGANR learns continuous neural representations of trajectories to enable fine-grained alignment with street-view images and locations in a joint multimodal self-supervised objective, outperforming prior geospatial ...
How Does Attention Help? Insights from Random Matrices on Signal Recovery from Sequence Models
stat.ML 2026-05 conditional novelty 7.0

Attention pooling produces a free-multiplicative-convolution bulk spectrum and two phase transitions for signal recovery; optimal weights are the top eigenvector of the positional correlation matrix R.
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
cs.CL 2026-05 unverdicted novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
Transformers Efficiently Perform In-Context Logistic Regression via Normalized Gradient Descent
cs.LG 2026-05 conditional novelty 7.0

Multi-layer transformers can implement in-context logistic regression by performing normalized gradient descent steps layer by layer, obtained via supervised training of a single attention layer followed by recurrent ...
Empirical Evidence for Simply Connected Decision Regions in Image Classifiers
cs.CV 2026-05 unverdicted novelty 7.0

Empirical tests with quad-mesh filling indicate that decision regions in modern image classifiers are simply connected.
SIGMA-ASL: Sensor-Integrated Multimodal Dataset for Sign Language Recognition
cs.HC 2026-05 unverdicted novelty 7.0

SIGMA-ASL is a multimodal dataset with 93,545 word-level ASL clips from Kinect RGB-D, mmWave radar, and dual IMUs, plus benchmarking protocols for single- and multi-modal recognition.
Adaptive Selection of LoRA Components in Privacy-Preserving Federated Learning
cs.LG 2026-05 unverdicted novelty 7.0

AS-LoRA adaptively chooses which LoRA factor to update per layer and round using a curvature-aware second-order score, eliminating reconstruction error floors and improving performance in DP federated learning.
Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems
cs.AR 2026-05 unverdicted novelty 7.0

CAIS delivers 1.38x end-to-end LLM training speedup over NVLS and 1.61x over T3 by making in-switch computing aware of computation memory requirements instead of treating communication as an isolated phase.
FIBER: A Differentially Private Optimizer with Filter-Aware Innovation Bias Correction
cs.LG 2026-05 unverdicted novelty 7.0

FiBeR adds a closed-form filter-aware correction A(ω)σ_w² to the second-moment term for temporally filtered DP gradients, improving adaptive optimization performance.
GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 7.0

GRPO-TTA applies GRPO to test-time visual tuning of vision-language models via group-wise policy optimization on unlabeled class candidates, outperforming prior TTA methods especially under natural distribution shifts.
Projection-Free Transformers via Gaussian Kernel Attention
cs.LG 2026-05 unverdicted novelty 7.0

Gaussian Kernel Attention replaces learned QKV projections with a Gaussian RBF kernel on per-head token features, using 0.42x parameters and 0.49x FLOPs while showing competitive language modeling performance at depth 20.
SignMAE: Segmentation-Driven Self-Supervised Learning for Sign Language Recognition
cs.CV 2026-05 unverdicted novelty 7.0

SignMAE uses segmentation-driven masking in a mask-and-reconstruct self-supervised task to learn fine-grained sign representations, achieving state-of-the-art accuracy on WLASL, NMFs-CSL, and Slovo with fewer frames a...
Evaluating LLMs on Large-Scale Graph Property Estimation via Random Walks
cs.LG 2026-05 unverdicted novelty 7.0

EstGraph benchmark evaluates LLMs on estimating properties of very large graphs from random-walk samples that fit in context limits.
SplAttN: Bridging 2D and 3D with Gaussian Soft Splatting and Attention for Point Cloud Completion
cs.CV 2026-05 conditional novelty 7.0

SplAttN replaces hard projection with Gaussian soft splatting to avoid cross-modal entropy collapse, achieving SOTA point cloud completion on PCN and ShapeNet while maintaining visual cue dependency on KITTI.
Machine Learning-Augmented Acceleration of Iterative Ptychographic Reconstruction
cs.LG 2026-05 conditional novelty 7.0

A learned fast-forward operator accelerates iterative ptychographic reconstruction by over twofold in wall-clock time while maintaining comparable quality on temporally held-out experimental data.
Reconstructing conformal field theoretical compositions with Transformers
hep-th 2026-05 unverdicted novelty 7.0

Transformers reconstruct the constituent RCFTs in tensor-product theories from low-energy spectra, reaching 98% accuracy on WZW models and generalizing to larger central charges with few out-of-domain examples.
Physiology-Aware Masked Cross-Modal Reconstruction for Biosignal Representation Learning
cs.LG 2026-05 unverdicted novelty 7.0

xMAE pretrains biosignal representations via masked cross-modal reconstruction of temporally ordered signals like ECG and PPG, outperforming baselines on 15 of 19 downstream tasks including cardiovascular prediction a...
Foundation AI Models for Aerosol Optical Depth Estimation from PACE Satellite Data
cs.CV 2026-05 unverdicted novelty 7.0

ViTCG, a channel-grouped Vision Transformer, retrieves AOD from PACE hyperspectral data with 62% lower MSE than prior foundation models while producing spatially coherent fields.
Sampling two-dimensional spin systems with transformers
cond-mat.dis-nn 2026-04 unverdicted novelty 7.0

Transformer networks sample up to 180x180 2D Ising systems and 64x64 Edwards-Anderson systems by generating spin groups with probability approximations, yielding ~20x higher effective sample size than prior neural sam...
Rethink MAE with Linear Time-Invariant Dynamics
cs.CV 2026-04 unverdicted novelty 7.0

Token order in frozen visual representations is exploitable via SSM-based LTI probes, revealing pre-training-dependent heterogeneity that fixed pooling misses.
Distributed Multi-View Vision-Only RSSI Estimation
cs.IT 2026-04 unverdicted novelty 7.0

MulViT-TF uses distributed multi-view vision and Transformer fusion to estimate RSSI, cutting RMSE by up to 26.3% versus single-view baselines in two indoor scenes while using fewer resources.
Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection
cs.CV 2026-04 unverdicted novelty 7.0

Sparse autoencoders on ViT class tokens reveal stable Class Activation Profiles for in-distribution data, enabling OOD detection via divergence from core energy profiles.
Learning Neural Operator Surrogates for the Black Hole Accretion Code
astro-ph.HE 2026-04 unverdicted novelty 7.0

Physics-informed Fourier neural operators recover plasmoid formation in sparse SRRMHD vortex data where data-only models fail, and transformer operators approximate AMR jet evolution, marking first reported uses in th...
Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning
astro-ph.GA 2026-04 unverdicted novelty 7.0

A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.
Pareto Frontier of Neural Quantum States: Scalable, Affordable, and Accurate Convolutional Backflow for Strongly Correlated Lattice Fermions
cond-mat.str-el 2026-04 unverdicted novelty 7.0

SCALE and ACE are new convolutional backflow architectures for Neural Quantum States that deliver O(N^3) scaling with high accuracy and over 40x speedup on Hubbard and t-J models up to 32x32 lattices.
Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models
cs.CV 2026-04 conditional novelty 7.0

Prefill-Time Intervention (PTI) reduces hallucinations in large vision-language models by applying a one-time modality-aware steering correction to the initial KV cache at the prefill stage rather than during autoregr...
Attention Is Not All You Need for Diffraction
cond-mat.mtrl-sci 2026-04 unverdicted novelty 7.0

Physics-informed transformer with sin^2(theta) encoding, physics-aware positional encoding, multi-task decoder, and three-stage curriculum classifies powder diffraction into 99 extinction groups, with structured error...
VitaminP: cross-modal learning enables whole-cell segmentation from routine histology
cs.CV 2026-04 unverdicted novelty 7.0

VitaminP uses paired H&E-mIF data to train a model that transfers molecular boundary information, enabling accurate whole-cell segmentation directly from routine H&E histology across 34 cancer types.
KAConvNet: Kolmogorov-Arnold Convolutional Networks for Vision Recognition
cs.CV 2026-04 unverdicted novelty 7.0

KAConvNet introduces a Kolmogorov-Arnold Convolutional Layer to build networks competitive with ViTs and CNNs while offering stronger theoretical interpretability.
Different Strokes for Different Folks: Writer Identification for Historical Arabic Manuscripts
cs.CV 2026-04 unverdicted novelty 7.0

CNN models with attention reach 99.05% top-1 accuracy on line-level splits and 78.61% on page-disjoint splits for writer identification after expanding the labeled portion of the Muharaf historical Arabic manuscript dataset.
StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition
cs.GR 2026-04 unverdicted novelty 7.0

StyleID supplies human-perception-aligned benchmarks and fine-tuned encoders that improve facial identity recognition robustness across stylization types and strengths.
GraphLeap: Decoupling Graph Construction and Convolution for Vision GNN Acceleration on FPGA
cs.CV 2026-04 conditional novelty 7.0

GraphLeap decouples per-layer graph construction from feature updates in Vision GNNs by using previous-layer features for the current graph, enabling pipelined FPGA acceleration with up to 95.7× CPU speedup after fine-tuning.
Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback
cs.CV 2026-04 unverdicted novelty 7.0

Render-in-the-Loop reformulates SVG generation as a step-wise visual-context-aware process using self-feedback from rendered intermediate states, VSF training, and RaV inference to outperform baselines on MMSVGBench f...
AI models of unstable flow exhibit hallucination
physics.flu-dyn 2026-04 unverdicted novelty 7.0

AI models of viscous fingering exhibit hallucinations from spectral bias; DeepFingers combines FNO and DeepONet with time-contrast conditioning to predict accurate finger dynamics while preserving mixing metrics.
Benign Overfitting in Adversarial Training for Vision Transformers
cs.LG 2026-04 unverdicted novelty 7.0

Adversarial training on simplified Vision Transformers achieves benign overfitting with near-zero robust loss and generalization error when signal-to-noise ratio and perturbation budget meet specific conditions.
URoPE: Universal Relative Position Embedding across Geometric Spaces
cs.CV 2026-04 unverdicted novelty 7.0

URoPE is a parameter-free relative position embedding for transformers that works across arbitrary geometric spaces by ray sampling and projection, yielding consistent gains on novel view synthesis, 3D detection, trac...
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.