hub Mixed citations

Vision Transformers Need Registers

· 2023 · cs.CV · arXiv 2309.16588

Mixed citation behavior. Most common role is background (65%).

76 Pith papers citing it

Background 65% of classified citations

open full Pith review browse 76 citing papers arXiv PDF

abstract

Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role. We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 13 baseline 2 dataset 1 method 1

citation-polarity summary

background 11 baseline 2 support 1 unclear 1 use dataset 1 use method 1

representative citing papers

WriteSAE: Sparse Autoencoders for Recurrent State

cs.LG · 2026-05-12 · unverdicted · novelty 8.0 · 3 refs

WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.

GeoMix: Descriptor-Free Visual Localization via Global Context and Multi-Detector Training

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

GeoMix achieves new state-of-the-art results in descriptor-free 2D-3D matching by adding directional embeddings, learnable global context nodes, and multi-detector training, cutting rotation and translation errors by up to 90% on standard benchmarks.

Polarisation and Faraday rotation measure imaging at metre wavelengths with sub-arcsecond resolution: a foundational calibration strategy

astro-ph.IM · 2026-06-16 · unverdicted · novelty 7.0

A calibration strategy using full-Jones corrections with an in-field unpolarised calibrator and visibility-based multi-epoch alignment enables sub-arcsecond polarimetric imaging with LOFAR at metre wavelengths.

AdaTok: Self-Budgeting Image Tokenization with Quality-Preserving Dynamic Tokens

cs.CV · 2026-06-05 · unverdicted · novelty 7.0

AdaTok learns content-dependent token budgets for discrete 1D image tokenization via prioritized representation learning and a GRPO allocation policy, achieving rFID 1.50 at ~118 tokens average versus fixed 256-token baselines.

When Graph Tokens Sink: A Mechanistic Analysis of Graph Language Models

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

Mechanistic analysis of GLMs shows graph sink tokens have high activation but low importance for predictions, indicating decoupling between saliency and graph-semantic utility.

LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-layer attention in LLMs.

Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals

cs.AI · 2026-04-17 · unverdicted · novelty 7.0

Pairwise scoring signals in Vision Transformer token reduction are inherently unstable due to high perturbation counts and degrade in deep layers, causing collapse, while unary signals with triage enable CATIS to retain 96.9% accuracy at 63% FLOPs reduction on ViT-Large ImageNet-1K.

OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.

Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

The paper proposes information scope as a new interpretability axis for SAE features in CLIP and introduces the Contextual Dependency Score to separate local from global scope features, showing they influence model predictions differently.

Training Agents Inside of Scalable World Models

cs.AI · 2025-09-29 · conditional · novelty 7.0

Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

cs.RO · 2024-09-03 · conditional · novelty 7.0

ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms without task-specific data.

Massive Activations in Large Language Models

cs.CL · 2024-02-27 · unverdicted · novelty 7.0

Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.

PixelU: A U-Shaped Transformer for Efficient End-to-End Pixel Diffusion

cs.CV · 2026-06-26 · unverdicted · novelty 6.0

PixelU is a minimalist U-shaped Diffusion Transformer for pixel-space diffusion that decouples frequencies with zero-cost skip connections and constant-channel downsampling, outperforming baselines like JiT-G at 1/3 the compute cost with FID 1.63 on ImageNet 256x256.

GHOST: Hierarchical Sub-Goal Policies for Generalizing Robot Manipulation

cs.RO · 2026-06-08 · unverdicted · novelty 6.0

GHOST improves generalization in robot manipulation via hierarchical factorization into 3D sub-goal prediction from RGB-D views and a goal-conditioned low-level controller, enabling human video integration without action retargeting.

Latent Anchor-Driven Test Generation for Deep Neural Networks

cs.LG · 2026-06-03 · unverdicted · novelty 6.0

Latte performs seed-centered one-step latent mutations along class anchors in VQ-VAE space to produce diverse, low-drift, fault-revealing DNN tests.

Beyond Compression: Quantifying Spectral Accessibility in Vision Representations

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

Vision encoders alter spectral accessibility non-monotonically across depth with architecture-specific effects from projections and pooling, quantified via a new residual loss against random baselines.

Fixed-Point Masked Generative Modeling

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

FP-MGMs with consistency loss and three-state reuse (CoFRe) reduce parameters by up to 38.8% and improve low-budget perplexity and FID versus standard masked generative models on text and images.

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

cs.LG · 2026-05-29 · unverdicted · novelty 6.0 · 2 refs

Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.

SplitAvatar: One-shot Head Avatar with Autoregressive Gaussian Splitting

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

SplitAvatar applies an autoregressive graph splitting network with mesh topology extension and gated density control to generate detailed one-shot head avatars via 3D Gaussian Splatting.

Dual Prototype-Conditioned Diffusion Model for Scalable Multi-Class Unsupervised Anomaly Detection in Large Category Spaces

cs.CV · 2026-05-23 · unverdicted · novelty 6.0

DPDiff-AD conditions a diffusion model on local prototypes (via nearest aggregation) and global prototypes (via optimal transport) to model normality scalably in multi-class anomaly detection, reporting AUROC gains on 160-category data.

Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

A feed-forward model aligns ground and satellite features to predict Gaussian splats for improved novel-view synthesis on georeferenced outdoor scenes.

UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

UniRefiner uses contrastive registers and a dual alignment objective to remove three categories of spurious tokens from pre-trained ViTs, yielding up to 9.4% mIoU gains on ADE20K and 22% zero-shot segmentation improvements.

PIXLRelight: Controllable Relighting via Intrinsic Conditioning

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

A transformer-based neural renderer that transfers arbitrary PBR lighting to single images via shared intrinsic conditioning extracted from both multi-illumination photos and path-traced coarse 3D renders.

TIGER-FG: Text-Guided Implicit Fine-Grained Grounding for E-commerce Retrieval

cs.IR · 2026-05-18 · unverdicted · novelty 6.0

TIGER-FG proposes text-guided implicit fine-grained grounding with dual distillation to address modality and granularity asymmetries in image-to-multimodal e-commerce retrieval, reporting Recall@1 gains of 6.1 and 34.4 points on two new benchmarks.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Vision Transformers Need Registers

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer