Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.
Are large-scale datasets necessary for self- supervised pre-training?
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
Survey benchmarks SSL instance discrimination and masked image modeling for object detection, finding instance discrimination suits CNN encoders while MIM suits ViT encoders and custom pre-training, especially for small objects.
Pith review generated a malformed one-line summary.
citing papers explorer
-
Revisiting Feature Prediction for Learning Visual Representations from Video
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.