Masked autoencoders with asymmetric encoder-decoder and 75% masking ratio enable scalable self-supervised pre-training of vision transformers, achieving 87.8% ImageNet-1K accuracy with ViT-Huge using only unlabeled data.
Very deep convolutional networks for large-scale image recognition
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CV 2representative citing papers
Obj-GloVe is a contextual embedding for visual objects derived from scene co-occurrences using the GloVe method, shown useful for object detection and text-to-image synthesis.
citing papers explorer
-
Masked Autoencoders Are Scalable Vision Learners
Masked autoencoders with asymmetric encoder-decoder and 75% masking ratio enable scalable self-supervised pre-training of vision transformers, achieving 87.8% ImageNet-1K accuracy with ViT-Huge using only unlabeled data.
-
Obj-GloVe: Scene-Based Contextual Object Embedding
Obj-GloVe is a contextual embedding for visual objects derived from scene co-occurrences using the GloVe method, shown useful for object detection and text-to-image synthesis.