hub Mixed citations

Vision Transformers Need Registers

· 2023 · cs.CV · arXiv 2309.16588

Mixed citation behavior. Most common role is background (65%).

79 Pith papers citing it

Background 65% of classified citations

open full Pith review browse 79 citing papers arXiv PDF

abstract

Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role. We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 13 baseline 2 dataset 1 method 1

citation-polarity summary

background 11 baseline 2 support 1 unclear 1 use dataset 1 use method 1

representative citing papers

WriteSAE: Sparse Autoencoders for Recurrent State

cs.LG · 2026-05-12 · unverdicted · novelty 8.0 · 3 refs

WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.

GeoMix: Descriptor-Free Visual Localization via Global Context and Multi-Detector Training

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

GeoMix achieves new state-of-the-art results in descriptor-free 2D-3D matching by adding directional embeddings, learnable global context nodes, and multi-detector training, cutting rotation and translation errors by up to 90% on standard benchmarks.

Massive Activations Are Architecturally Robust: A Controlled Scratch/Commitment Residual Stream Test

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

In 160M and 290M parameter models, a new residual-stream split into scratch and protected channels causes massive activations to re-emerge in the protected decode channel, more concentrated on the start token.

Polarisation and Faraday rotation measure imaging at metre wavelengths with sub-arcsecond resolution: a foundational calibration strategy

astro-ph.IM · 2026-06-16 · unverdicted · novelty 7.0

A calibration strategy using full-Jones corrections with an in-field unpolarised calibrator and visibility-based multi-epoch alignment enables sub-arcsecond polarimetric imaging with LOFAR at metre wavelengths.

AdaTok: Self-Budgeting Image Tokenization with Quality-Preserving Dynamic Tokens

cs.CV · 2026-06-05 · unverdicted · novelty 7.0

AdaTok learns content-dependent token budgets for discrete 1D image tokenization via prioritized representation learning and a GRPO allocation policy, achieving rFID 1.50 at ~118 tokens average versus fixed 256-token baselines.

When Graph Tokens Sink: A Mechanistic Analysis of Graph Language Models

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

Mechanistic analysis of GLMs shows graph sink tokens have high activation but low importance for predictions, indicating decoupling between saliency and graph-semantic utility.

LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-layer attention in LLMs.

Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals

cs.AI · 2026-04-17 · unverdicted · novelty 7.0

Pairwise scoring signals in Vision Transformer token reduction are inherently unstable due to high perturbation counts and degrade in deep layers, causing collapse, while unary signals with triage enable CATIS to retain 96.9% accuracy at 63% FLOPs reduction on ViT-Large ImageNet-1K.

OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.

Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

The paper proposes information scope as a new interpretability axis for SAE features in CLIP and introduces the Contextual Dependency Score to separate local from global scope features, showing they influence model predictions differently.

Training Agents Inside of Scalable World Models

cs.AI · 2025-09-29 · conditional · novelty 7.0

Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

cs.RO · 2024-09-03 · conditional · novelty 7.0

ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms without task-specific data.

Massive Activations in Large Language Models

cs.CL · 2024-02-27 · unverdicted · novelty 7.0

Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.

Argus: Metric Panoramic 3D Reconstruction for Indoor Scenes

cs.CV · 2026-06-29 · unverdicted · novelty 6.0 · 2 refs

Argus introduces a covisibility module and decomposed pixel-to-world mapping to deliver SOTA metric performance on camera pose, depth, and point cloud tasks using the Realsee3D panoramic dataset.

PixelU: A U-Shaped Transformer for Efficient End-to-End Pixel Diffusion

cs.CV · 2026-06-26 · unverdicted · novelty 6.0

PixelU is a minimalist U-shaped Diffusion Transformer for pixel-space diffusion that decouples frequencies with zero-cost skip connections and constant-channel downsampling, outperforming baselines like JiT-G at 1/3 the compute cost with FID 1.63 on ImageNet 256x256.

RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer

cs.CV · 2026-06-16 · unverdicted · novelty 6.0

RegimeVGGT applies layer-wise U-shaped compression via saliency-guided banded merging and selectively protected K/V downsampling to deliver 6.7x speedup on VGGT at matched reconstruction quality.

Contrastive Action-Image Pre-training for Visuomotor Control

cs.RO · 2026-06-15 · unverdicted · novelty 6.0

CAIP learns action-aligned visual representations via contrastive pre-training on human hand keypoints from egocentric video, outperforming DINOv2, SigLIP, MVP, and R3M with >30% gains on real dexterous manipulation tasks.

GHOST: Hierarchical Sub-Goal Policies for Generalizing Robot Manipulation

cs.RO · 2026-06-08 · unverdicted · novelty 6.0

GHOST improves generalization in robot manipulation via hierarchical factorization into 3D sub-goal prediction from RGB-D views and a goal-conditioned low-level controller, enabling human video integration without action retargeting.

Latent Anchor-Driven Test Generation for Deep Neural Networks

cs.LG · 2026-06-03 · unverdicted · novelty 6.0

Latte performs seed-centered one-step latent mutations along class anchors in VQ-VAE space to produce diverse, low-drift, fault-revealing DNN tests.

Beyond Compression: Quantifying Spectral Accessibility in Vision Representations

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

Vision encoders alter spectral accessibility non-monotonically across depth with architecture-specific effects from projections and pooling, quantified via a new residual loss against random baselines.

Fixed-Point Masked Generative Modeling

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

FP-MGMs with consistency loss and three-state reuse (CoFRe) reduce parameters by up to 38.8% and improve low-budget perplexity and FID versus standard masked generative models on text and images.

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

cs.LG · 2026-05-29 · unverdicted · novelty 6.0 · 2 refs

Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.

SplitAvatar: One-shot Head Avatar with Autoregressive Gaussian Splitting

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

SplitAvatar applies an autoregressive graph splitting network with mesh topology extension and gated density control to generate detailed one-shot head avatars via 3D Gaussian Splatting.

Dual Prototype-Conditioned Diffusion Model for Scalable Multi-Class Unsupervised Anomaly Detection in Large Category Spaces

cs.CV · 2026-05-23 · unverdicted · novelty 6.0

DPDiff-AD conditions a diffusion model on local prototypes (via nearest aggregation) and global prototypes (via optimal transport) to model normality scalably in multi-class anomaly detection, reporting AUROC gains on 160-category data.

citing papers explorer

Showing 7 of 7 citing papers after filters.

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation cs.RO · 2024-09-03 · conditional · none · ref 137 · internal anchor
ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms without task-specific data.
Contrastive Action-Image Pre-training for Visuomotor Control cs.RO · 2026-06-15 · unverdicted · none · ref 52 · internal anchor
CAIP learns action-aligned visual representations via contrastive pre-training on human hand keypoints from egocentric video, outperforming DINOv2, SigLIP, MVP, and R3M with >30% gains on real dexterous manipulation tasks.
GHOST: Hierarchical Sub-Goal Policies for Generalizing Robot Manipulation cs.RO · 2026-06-08 · unverdicted · none · ref 8 · internal anchor
GHOST improves generalization in robot manipulation via hierarchical factorization into 3D sub-goal prediction from RGB-D views and a goal-conditioned low-level controller, enabling human video integration without action retargeting.
OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation cs.RO · 2026-04-20 · unverdicted · none · ref 16 · internal anchor
OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.
InCoM: Intent-Driven Perception and Structured Coordination for Mobile Manipulation cs.RO · 2026-02-26 · unverdicted · none · ref 44 · internal anchor
InCoM achieves 23-28% higher success rates in mobile manipulation tasks by inferring motion intent for adaptive perception and decoupling base-arm action generation.
3D Point World Models: Point Completion Enables More Accurate Dynamics Learning cs.RO · 2026-06-30 · unverdicted · none · ref 47 · internal anchor
3DPWM completes partial point clouds then learns dynamics on the completed 3D scenes to produce reliable long-horizon rollouts for model-based robotic planning.
ReSiReg: Towards Spatially Consistent Semantics in Language-Conditioned Robotic Tasks cs.RO · 2026-06-17 · unverdicted · none · ref 20 · internal anchor
ReSiReg clusters VLM intermediates into prototypes, derives language descriptors, and reconstructs patches as mixtures to improve spatial consistency in dense language-grounded retrieval for robotics.

Vision Transformers Need Registers

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer