How to train your vit? data, augmentation, and regularization in vision transformers

· 2021 · arXiv 2106.10270

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm

cs.LG · 2026-05-14 · conditional · novelty 7.0

A framework to identify and convert foldable layer normalizations to RMSNorm for exact equivalence and faster inference in deep neural networks.

Weierstrass Positional Encoding for Vision Transformers

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

WePE encodes 2D patch positions in Vision Transformers via Weierstrass elliptic functions on the complex plane to exploit double periodicity and derive relative positions algebraically.

Causal Attribution via Activation Patching

cs.CV · 2026-03-13 · unverdicted · novelty 6.0

CAAP produces patch attributions in ViTs by direct activation patching on intermediate layers to measure causal contribution to the target class score.

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

cs.LG · 2024-10-31 · unverdicted · novelty 6.0

π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.

Demystifying CLIP Data

cs.CV · 2023-09-28 · accept · novelty 6.0

MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.

Sigmoid Loss for Language Image Pre-Training

cs.CV · 2023-03-27 · conditional · novelty 6.0

SigLIP replaces softmax-based contrastive loss with a simple pairwise sigmoid loss for vision-language pre-training, decoupling batch size from normalization and reaching strong zero-shot performance with limited compute.

ASAP: Attention Sink Anchored Pruning

cs.LG · 2026-05-21 · unverdicted · novelty 5.0

ASAP prunes tokens in ViTs by anchoring on attention sinks modeled as lazy random walks, using cumulative transition matrices and radial diffusion clustering to compress redundancy while preserving accuracy.

Decision-Aware Attention Propagation for Vision Transformer Explainability

cs.CV · 2026-04-20 · unverdicted · novelty 5.0

DAP improves ViT attribution maps by injecting decision-relevant gradients into attention propagation, producing more class-sensitive and faithful explanations than standard attention rollout.

Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

cs.LG · 2024-03-21 · accept · novelty 4.0

A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.

citing papers explorer

Showing 9 of 9 citing papers.

Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm cs.LG · 2026-05-14 · conditional · none · ref 57
A framework to identify and convert foldable layer normalizations to RMSNorm for exact equivalence and faster inference in deep neural networks.
Weierstrass Positional Encoding for Vision Transformers cs.CV · 2026-05-20 · unverdicted · none · ref 19
WePE encodes 2D patch positions in Vision Transformers via Weierstrass elliptic functions on the complex plane to exploit double periodicity and derive relative positions algebraically.
Causal Attribution via Activation Patching cs.CV · 2026-03-13 · unverdicted · none · ref 34
CAAP produces patch attributions in ViTs by direct activation patching on intermediate layers to measure causal contribution to the target class score.
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control cs.LG · 2024-10-31 · unverdicted · none · ref 47
π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.
Demystifying CLIP Data cs.CV · 2023-09-28 · accept · none · ref 80
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
Sigmoid Loss for Language Image Pre-Training cs.CV · 2023-03-27 · conditional · none · ref 42
SigLIP replaces softmax-based contrastive loss with a simple pairwise sigmoid loss for vision-language pre-training, decoupling batch size from normalization and reaching strong zero-shot performance with limited compute.
ASAP: Attention Sink Anchored Pruning cs.LG · 2026-05-21 · unverdicted · none · ref 22
ASAP prunes tokens in ViTs by anchoring on attention sinks modeled as lazy random walks, using cumulative transition matrices and radial diffusion clustering to compress redundancy while preserving accuracy.
Decision-Aware Attention Propagation for Vision Transformer Explainability cs.CV · 2026-04-20 · unverdicted · none · ref 23
DAP improves ViT attribution maps by injecting decision-relevant gradients into attention propagation, producing more class-sensitive and faithful explanations than standard attention rollout.
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey cs.LG · 2024-03-21 · accept · none · ref 185
A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.

How to train your vit? data, augmentation, and regularization in vision transformers

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer