arxiv: 2309.16588 · v2 · submitted 2023-09-28 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Vision Transformers Need Registers

Julien Mairal, Maxime Oquab, Piotr Bojanowski, Timoth\'ee Darcet

Authors on Pith no claims yet

Pith reviewed 2026-05-13 09:36 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision transformersfeature mapsartifactsregister tokensself-supervised learningdense predictionattention maps

0 comments

The pith

Vision transformers develop high-norm tokens in background regions that disrupt feature maps, but adding register tokens absorbs this role and cleans up representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision transformers exhibit artifacts in feature maps where high-norm tokens emerge primarily in low-information background areas during inference. These tokens appear repurposed for internal model computations instead of carrying image content. The paper proposes inserting a small number of extra tokens, called registers, into the input sequence to handle that computational role. This change eliminates the artifacts for both supervised and self-supervised models. It also improves results on dense visual prediction tasks, supports object discovery with larger models, and produces smoother feature maps and attention maps.

Core claim

High-norm tokens appear in low-informative background regions of ViT feature maps and are repurposed for internal computations. Adding a few register tokens to the input sequence lets these computations occur separately, removing the artifacts from image token representations and yielding cleaner outputs for both supervised and self-supervised training.

What carries the argument

Register tokens, a small set of additional input tokens that absorb internal computations previously performed by high-norm background tokens.

If this is right

Eliminates artifacts entirely in both supervised and self-supervised ViTs
Sets new state of the art for self-supervised models on dense visual prediction tasks
Enables object discovery methods with larger models
Produces smoother feature maps and attention maps for downstream processing

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Register tokens may reduce unnecessary computation in background regions across other transformer vision models
The method could extend to tasks like semantic segmentation by further isolating non-informative areas
Standard ViT tokenization may need rethinking to separate image content from internal model bookkeeping

Load-bearing premise

The high-norm tokens are mainly handling internal computations rather than carrying essential image information, so extra registers can take over without degrading useful representations.

What would settle it

If adding register tokens either lowers accuracy on standard image classification or creates new high-norm artifacts in feature maps, the claim that they fully resolve the issue would be falsified.

read the original abstract

Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role. We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Register tokens clean up ViT background artifacts with a low-cost fix, but the claim of fixing it entirely rests on how well the ablations isolate the mechanism.

read the letter

The main takeaway is that Vision Transformers develop high-norm tokens in background regions that get repurposed for internal computations rather than carrying image content. Adding a few dedicated register tokens to the input sequence lets those handle the computation instead, which produces smoother feature maps and attention maps. The paper shows this holds for both supervised and self-supervised models and brings measurable gains on dense prediction tasks plus object discovery with larger models. They also report new state-of-the-art numbers for self-supervised ViTs on those benchmarks. The observation itself is useful and the fix is simple enough that it could be adopted without much overhead. What stands out is the consistent documentation of the artifact across training regimes and the qualitative improvement in map quality. The soft spots are around the causal interpretation. The stress-test note is reasonable: norm statistics and attention maps suggest these tokens are sinks, but without a direct ablation that zeros or replaces their content while holding everything else fixed, it remains possible they still encode some background signal. If so, the registers may mask rather than eliminate the issue, and downstream tasks could see subtle effects not captured in the main metrics. The abstract's language about fixing the problem entirely is strong, so the full results need to show the controls that back that up. This is aimed at researchers working on ViT architectures for segmentation, detection, or self-supervised pretraining. The technique is practical and the analysis of the artifact adds value even if the fix turns out to be partial. It has enough substance to go to peer review rather than a desk reject.

Referee Report

3 major / 2 minor

Summary. The manuscript identifies artifacts in the feature maps of Vision Transformers (both supervised and self-supervised), characterized by high-norm tokens appearing in low-informative background regions of images. These tokens are interpreted as being repurposed for internal computations within the network. The authors propose adding a small number of additional 'register' tokens to the input sequence to absorb this role. They claim that this simple modification eliminates the artifacts entirely, produces smoother feature and attention maps, achieves new state-of-the-art results on dense visual prediction tasks for self-supervised models, and facilitates object discovery with larger models.

Significance. If the central claims hold, this represents a significant practical contribution to Vision Transformer architectures. The addition of register tokens offers a minimal-change solution that improves feature quality for downstream tasks, particularly benefiting self-supervised learning pipelines and applications requiring dense predictions or object discovery. The empirical nature of the fix, while simple, could have broad adoption if the improvements are robustly demonstrated.

major comments (3)

[§3] §3 (analysis of token norms): The interpretation that high-norm background tokens serve exclusively as computational sinks (rather than encoding any image information) is central to the motivation but relies on correlational evidence from norm statistics and attention maps. A controlled ablation—such as zeroing out these tokens or replacing their features while keeping the rest of the model fixed—would be necessary to isolate their role and rule out the possibility that they carry useful background context, as raised by the weakest assumption in the work.
[§5] §5 (experiments on dense tasks): The claim of setting a new state of the art on dense visual prediction tasks is load-bearing for the paper's impact but requires explicit numerical comparisons in the main results table, including the exact performance deltas over previous best methods, the number of register tokens used, and ablations showing that gains are attributable to the registers rather than other factors.
[§4.1] §4.1 (qualitative results): The assertion that the solution 'fixes that problem entirely' for both supervised and self-supervised models needs stronger quantitative backing, such as metrics quantifying the reduction in artifact severity (e.g., background feature variance or attention entropy) before and after adding registers, beyond the qualitative examples provided.

minor comments (2)

[Abstract] The abstract states the claims strongly but omits any quantitative results or specific numbers, which would help readers assess the magnitude of the improvements immediately.
[Method] Clarify the exact implementation details of how register tokens are initialized (learned parameters vs. fixed) and their interaction with positional embeddings, as this affects reproducibility and should be stated explicitly in the method section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below. Where the suggestions strengthen the evidence or clarity, we have revised the manuscript accordingly.

read point-by-point responses

Referee: [§3] §3 (analysis of token norms): The interpretation that high-norm background tokens serve exclusively as computational sinks (rather than encoding any image information) is central to the motivation but relies on correlational evidence from norm statistics and attention maps. A controlled ablation—such as zeroing out these tokens or replacing their features while keeping the rest of the model fixed—would be necessary to isolate their role and rule out the possibility that they carry useful background context, as raised by the weakest assumption in the work.

Authors: We agree that the evidence in §3 is observational and correlational, relying on norm statistics and attention patterns rather than direct causal intervention. A zeroing ablation on internal activations would be a useful addition for stronger isolation of the role, but it is non-trivial to implement without modifying the forward pass in ways that could introduce new artifacts. In the revised manuscript we have added a dedicated paragraph in §3 explicitly acknowledging the correlational nature of the evidence and discussing why the register-token results provide indirect support (the high-norm role is visibly transferred to the registers while image tokens remain low-norm and semantically coherent). We do not claim the interpretation is proven beyond correlation, but maintain that the practical elimination of artifacts via registers is the core contribution. revision: partial
Referee: [§5] §5 (experiments on dense tasks): The claim of setting a new state of the art on dense visual prediction tasks is load-bearing for the paper's impact but requires explicit numerical comparisons in the main results table, including the exact performance deltas over previous best methods, the number of register tokens used, and ablations showing that gains are attributable to the registers rather than other factors.

Authors: We accept this point. The original submission presented results but did not tabulate deltas or isolate the register contribution with dedicated ablations. In the revised version we have updated the primary results table in §5 to report (i) exact performance numbers and deltas versus the prior best methods, (ii) the precise number of register tokens used in each experiment (typically 4), and (iii) an additional ablation column that varies only the presence of registers while holding all other hyperparameters fixed. These changes make the source of the gains explicit. revision: yes
Referee: [§4.1] §4.1 (qualitative results): The assertion that the solution 'fixes that problem entirely' for both supervised and self-supervised models needs stronger quantitative backing, such as metrics quantifying the reduction in artifact severity (e.g., background feature variance or attention entropy) before and after adding registers, beyond the qualitative examples provided.

Authors: We agree that quantitative metrics would make the 'fixes entirely' claim more robust. We have therefore added, in the revised §4.1, two new quantitative measures: (1) average background feature variance (computed over non-object regions identified by off-the-shelf saliency) and (2) attention entropy across layers. Both metrics show large, consistent reductions (background variance drops by approximately 80 % on average) when registers are introduced, for both supervised and self-supervised models. These numbers are reported alongside the existing qualitative figures. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observation leads to architectural addition without self-referential reduction

full rationale

The paper identifies high-norm tokens in background regions via direct inspection of ViT activations and attention maps, then proposes adding a small number of register tokens to the input sequence as an empirical fix. This proposal is validated through downstream experiments on supervised and self-supervised models but does not rely on any equation that defines the solution in terms of itself, any fitted parameter renamed as a prediction, or a load-bearing self-citation chain. The central claim remains an independent architectural intervention supported by ablation results and qualitative visualizations, with no derivation step that collapses to the input observations by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claim rests on the empirical characterization of artifacts and the effectiveness of added tokens. No free parameters are introduced beyond standard ViT training. The main invented entity is the register token itself.

axioms (1)

standard math Vision Transformers process images as fixed-length sequences of patch embeddings plus class token
Standard ViT input construction assumed throughout the work.

invented entities (1)

register tokens no independent evidence
purpose: Additional input tokens that absorb high-norm internal computations previously performed by background patch tokens
New tokens appended to the sequence to regularize feature maps; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5424 in / 1311 out tokens · 51625 ms · 2026-05-13T09:36:59.482218+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.EightTick eight_tick_forces_D3 echoes
We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role.

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WriteSAE: Sparse Autoencoders for Recurrent State
cs.LG 2026-05 unverdicted novelty 8.0

WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models
cs.CV 2026-04 unverdicted novelty 7.0

LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-laye...
Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs
cs.LG 2026-04 unverdicted novelty 7.0

Sink-Token-aware Pruning (SToP) suppresses semantically uninformative sink tokens during visual token pruning in Video LLMs, boosting fine-grained performance even at 90% pruning rates across hallucination, reasoning,...
Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals
cs.AI 2026-04 unverdicted novelty 7.0

Pairwise scoring signals in Vision Transformer token reduction are inherently unstable due to high perturbation counts and degrade in deep layers, causing collapse, while unary signals with triage enable CATIS to reta...
OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance
cs.CV 2026-04 unverdicted novelty 7.0

OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.
Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP
cs.CV 2026-04 unverdicted novelty 7.0

The paper proposes information scope as a new interpretability axis for SAE features in CLIP and introduces the Contextual Dependency Score to separate local from global scope features, showing they influence model pr...
Elastic Attention Cores for Scalable Vision Transformers
cs.CV 2026-05 unverdicted novelty 6.0

VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...
The two clocks and the innovation window: When and how generative models learn rules
cs.LG 2026-05 unverdicted novelty 6.0

Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion
cs.CV 2026-05 unverdicted novelty 6.0

Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
Taming Outlier Tokens in Diffusion Transformers
cs.CV 2026-05 unverdicted novelty 6.0

Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.
Revisiting Graph-Tokenizing Large Language Models: A Systematic Evaluation of Graph Token Understanding
cs.CL 2026-05 unverdicted novelty 6.0

GTokenLLMs do not fully understand graph tokens, exhibiting over-sensitivity or insensitivity to instruction changes and relying heavily on text for reasoning even when graph information is preserved.
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
cs.CV 2026-05 unverdicted novelty 6.0

RTPrune prunes visual tokens in DeepSeek-OCR via a reading-twice two-stage process, retaining 84.25% tokens for 99.47% accuracy and 1.23x faster prefill on OmniDocBench.
Self-supervised pretraining for an iterative image size agnostic vision transformer
cs.CV 2026-04 unverdicted novelty 6.0

A sequential-to-global SSL method based on DINO pretrains iterative foveal-inspired vision transformers to achieve competitive ImageNet-1K performance with constant compute regardless of input resolution.
OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens
q-bio.NC 2026-04 unverdicted novelty 6.0

OmniMouse demonstrates data-driven scaling in multi-task brain models on a 150B-token neural dataset, achieving SOTA across prediction, decoding, and forecasting while model size gains saturate.
OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.
Enhancing event reconstruction for $\gamma$-ray particle detector arrays using transformers
astro-ph.IM 2026-04 unverdicted novelty 6.0

Transformer models applied to simulated water-Cherenkov array data improve gamma-hadron separation and reconstruction of direction, core position, and energy compared to established techniques.
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks
cs.CV 2026-04 conditional novelty 6.0

VoxelFM learns robust 3D CT visual features via DINO self-distillation that transfer effectively to seven clinical task categories using frozen backbones and lightweight heads, outperforming prior CT foundation models...
Generative Event Pretraining with Foundation Model Alignment
cs.CV 2026-03 unverdicted novelty 6.0

GEP transfers semantic knowledge from image foundation models to event data via alignment and generative pretraining on mixed sequences to create transferable event-based visual models.
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
cs.CL 2025-05 conditional novelty 6.0

Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay
cs.CV 2026-05 unverdicted novelty 5.0

Colinearity-Decay regularizer trains ViTs that maintain or improve full-precision accuracy while delivering higher accuracy after low-bit quantization on ImageNet and COCO tasks.
Let ViT Speak: Generative Language-Image Pre-training
cs.CV 2026-05 unverdicted novelty 5.0

GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.
Uni-Encoder Meets Multi-Encoders: Representation Before Fusion for Brain Tumor Segmentation with Missing Modalities
cs.CV 2026-04 unverdicted novelty 5.0

UniME combines a pretrained unified ViT encoder with modality-specific CNN encoders to improve brain tumor segmentation performance when some MRI modalities are missing.
LTX-2: Efficient Joint Audio-Visual Foundation Model
cs.CV 2026-01 conditional novelty 5.0

LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.
Movie Gen: A Cast of Media Foundation Models
cs.CV 2024-10 unverdicted novelty 5.0

A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
cs.CV 2026-05 unverdicted novelty 4.0

RTPrune delivers 99.47% accuracy and 1.23x faster prefill on OmniDocBench for DeepSeek-OCR-Large by retaining only 84.25% of tokens through a reading-twice inspired two-stage pruning process.
Revisiting Human-in-the-Loop Object Retrieval with Pre-Trained Vision Transformers
cs.CV 2026-04 unverdicted novelty 4.0

Pre-trained ViT representations combined with active learning and targeted design choices for annotations and selection improve object class retrieval in multi-object scenes.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 26 Pith papers · 16 internal anchors

[1]

Perceiver: General perception with iterative attention , author=

work page
[2]

Cut and learn for unsupervised object detection and instance segmentation , author=

work page
[3]

DINOv2: Learning Robust Visual Features without Supervision

DINOv2: Learning robust visual features without supervision , author=. arXiv preprint arXiv:2304.07193 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

something something

The "something something" video database for learning and evaluating visual common sense , author=

work page
[5]

The Kinetics Human Action Video Dataset

The kinetics human action video dataset , author=. arXiv preprint arXiv:1705.06950 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

UCF101: A dataset of 101 human actions classes from videos in the wild , author=. arXiv preprint arXiv:1212.0402 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Scene parsing through ade20k dataset , author=

work page
[8]

The cityscapes dataset for semantic urban scene understanding , author=

work page
[9]

Learning to Generate Reviews and Discovering Sentiment , publisher =

Learning to generate reviews and discovering sentiment , author=. arXiv preprint arXiv:1704.01444 , year=

work page arXiv
[10]

International Conference on Machine Learning , pages=

Understanding the robustness in vision transformers , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[11]

European conference on computer vision , pages=

SOLAR: second-order loss and attention for image retrieval , author=. European conference on computer vision , pages=. 2020 , organization=

work page 2020
[12]

ICCV , year=

Divide and contrast: Self-supervised learning from uncurated data , author=. ICCV , year=

work page
[13]

arXiv preprint arXiv:2203.08777 , year=

Object discovery and representation networks , author=. arXiv preprint arXiv:2203.08777 , year=

work page arXiv
[14]

Billion-scale semi-supervised learning for image classification

Billion-scale semi-supervised learning for image classification , author=. arXiv preprint arXiv:1905.00546 , year=

work page Pith review arXiv 1905
[15]

arXiv preprint arXiv:2202.08360 , year=

Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision , author=. arXiv preprint arXiv:2202.08360 , year=

work page arXiv
[16]

Gupta, Agrim and Dollar, Piotr and Girshick, Ross , booktitle=

work page
[17]

iBOT: Image bert pre-training with online tokenizer , author=

work page
[18]

Masked autoencoders are scalable vision learners , author=

work page
[19]

ICLR , year=

Prototypical Contrastive Learning of Unsupervised Representations , author=. ICLR , year=

work page
[20]

Boosting self-supervised learning via knowledge transfer , author=

work page
[21]

Eva: Exploring the limits of masked visual representation learning at scale , author=

work page
[22]

SEED: Self-supervised Distillation For Visual Representation , author=

work page
[23]

arXiv preprint arXiv:2012.02733 , year=

Seed the Views: Hierarchical Semantic Alignment for Contrastive Representation Learning , author=. arXiv preprint arXiv:2012.02733 , year=

work page arXiv 2012
[24]

Learning Super-Features for Image Retrieval , author=

work page
[25]

Fine-tuning CNN image retrieval with no human annotation , author=

work page
[26]

arXiv preprint arXiv:2012.11552 , year=

Online Bag-of-Visual-Words Generation for Unsupervised Representation Learning , author=. arXiv preprint arXiv:2012.11552 , year=

work page arXiv 2012
[27]

MultiGrain : a unified image embedding for classes and instances

Berman, Maxim and J. MultiGrain : a unified image embedding for classes and instances. arXiv preprint arXiv:1902.05509 , year=

work page arXiv 1902
[28]

arXiv preprint arXiv:2203.14415 , year=

Mugs: A multi-granular self-supervised learning framework , author=. arXiv preprint arXiv:2203.14415 , year=

work page arXiv
[29]

Communications of the ACM , volume=

YFCC100M: The New Data in Multimedia Research , author=. Communications of the ACM , volume=. 2016 , pages=

work page 2016
[30]

Evaluation of gist descriptors for web-scale image search , author=

work page
[31]

arXiv preprint arXiv:2012.05649 , year=

Concept generalization in visual representation learning , author=. arXiv preprint arXiv:2012.05649 , year=

work page arXiv 2012
[32]

Particular object retrieval with integral max-pooling of CNN activations , author=

work page
[33]

Learning with average precision: Training image retrieval with a listwise loss , author=

work page
[34]

Google Landmarks Dataset v2 -- A Large-Scale Benchmark for Instance-Level Recognition and Retrieval , author=

work page
[35]

Lost in quantization: Improving particular object retrieval in large scale image databases , author=

work page
[36]

Revisiting Oxford and Paris: Large-scale image retrieval benchmarking , author=

work page
[37]

BEiT: BERT Pre-Training of Image Transformers , author=

work page
[38]

arXiv preprint arXiv:2111.09886 , year=

Simmim: A simple framework for masked image modeling , author=. arXiv preprint arXiv:2111.09886 , year=

work page arXiv
[39]

Localizing Objects with Self-Supervised Transformers and no Labels , author=

work page
[40]

Filtering and Mining Parallel Data in a Joint Multilingual Space , author=. Proc. ACL , year=

work page
[41]

WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia , author=. Proc. EACL , year=

work page
[42]

arXiv preprint arXiv:2104.14548 , year=

With a little help from my friends: Nearest-neighbor contrastive learning of visual representations , author=. arXiv preprint arXiv:2104.14548 , year=

work page arXiv
[43]

arXiv preprint arXiv:1911.04944 , year=

Ccmatrix: Mining billions of high-quality parallel sentences on the web , author=. arXiv preprint arXiv:1911.04944 , year=

work page arXiv 1911
[44]

Unsupervised Learning of Dense Visual Representations , author=

work page
[45]

Space-time correspondence as a contrastive random walk , author=

work page
[46]

SIAM journal on control and optimization , volume=

Acceleration of stochastic approximation by averaging , author=. SIAM journal on control and optimization , volume=. 1992 , publisher=

work page 1992
[47]

Asano and Christian Rupprecht and Andrew Zisserman and Andrea Vedaldi

Yuki M. Asano and Christian Rupprecht and Andrew Zisserman and Andrea Vedaldi. PASS: An ImageNet replacement for self-supervised pretraining without humans. NeurIPS Track on Datasets and Benchmarks. 2021

work page 2021
[48]

Self-labelling via simultaneous clustering and representation learning , author=

work page
[49]

Deep clustering for unsupervised learning of visual features , author=

work page
[50]

An empirical study of training self-supervised vision transformers , author=

work page
[51]

arXiv preprint arXiv:2112.10740 , year=

Are Large-scale Datasets Necessary for Self-Supervised Pre-training? , author=. arXiv preprint arXiv:2112.10740 , year=

work page arXiv
[52]

Unsupervised feature learning via non-parametric instance discrimination , author=

work page
[53]

Unsupervised pre-training of image features on non-curated data , author=

work page
[54]

Microsoft coco: Common objects in context , author=

work page
[55]

Mask r-cnn , author=

work page
[56]

Large Batch Training of Convolutional Networks

Large Batch Training of Convolutional Networks , author=. preprint arXiv:1708.03888 , year=

work page Pith review arXiv
[57]

preprint arXiv:1912.08165 , year=

Cyanure: An Open-Source Toolbox for Empirical Risk Minimization for Python, C++, and soon more , author=. preprint arXiv:1912.08165 , year=

work page arXiv 1912
[58]

Fixing the train-test resolution discrepancy , author=

work page
[59]

Momentum contrast for unsupervised visual representation learning , author=

work page
[60]

Self-supervised learning of pretext-invariant representations , author=

work page
[61]

End-to-end learning of visual representations from uncurated instructional videos , author=

work page
[62]

ClusterFit: Improving Generalization of Visual Representations , author=

work page
[63]

Big self-supervised models are strong semi-supervised learners , author=

work page
[64]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Semi-supervised learning of visual features by non-parametrically predicting view assignments with support samples , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[65]

Exploring the limits of weakly supervised pretraining , author=

work page
[66]

Learning visual features from large weakly supervised data , author=

work page
[67]

A simple framework for contrastive learning of visual representations , author=

work page
[68]

Bootstrap your own latent: A new approach to self-supervised learning , author=

work page
[69]

Unsupervised learning of visual features by contrasting cluster assignments , author=

work page
[70]

Scaling the scattering transform: Deep hybrid networks , author=

work page
[71]

Unsupervised learning by predicting Noise , author=

work page
[72]

Language models are unsupervised multitask learners , author=

work page
[73]

NAACL , year=

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. NAACL , year=

work page
[74]

wav2vec: Unsupervised pre-training for speech recognition.arXiv preprint arXiv:1904.05862, 2019

wav2vec: Unsupervised pre-training for speech recognition , author=. preprint arXiv:1904.05862 , year=

work page arXiv 1904
[75]

Unsupervised pretraining transfers well across languages , author=

work page
[76]

wav2vec 2.0: A framework for self-supervised learning of speech representations , author=

work page
[77]

Data2vec: A general framework for self-supervised learning in speech, vision and language , author=

work page
[78]

arXiv preprint arXiv:2209.03917 , year=

Exploring target representations for masked autoencoders , author=. arXiv preprint arXiv:2209.03917 , year=

work page arXiv
[79]

arXiv preprint arXiv:2111.12710 , year=

Peco: Perceptual codebook for bert pre-training of vision transformers , author=. arXiv preprint arXiv:2111.12710 , year=

work page arXiv
[80]

Libri-light: A benchmark for asr with limited or no supervision , author=

work page

Showing first 80 references.