WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
hub Mixed citations
Vision Transformers Need Registers
Mixed citation behavior. Most common role is background (65%).
abstract
Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role. We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
GeoMix achieves new state-of-the-art results in descriptor-free 2D-3D matching by adding directional embeddings, learnable global context nodes, and multi-detector training, cutting rotation and translation errors by up to 90% on standard benchmarks.
A calibration strategy using full-Jones corrections with an in-field unpolarised calibrator and visibility-based multi-epoch alignment enables sub-arcsecond polarimetric imaging with LOFAR at metre wavelengths.
AdaTok learns content-dependent token budgets for discrete 1D image tokenization via prioritized representation learning and a GRPO allocation policy, achieving rFID 1.50 at ~118 tokens average versus fixed 256-token baselines.
Mechanistic analysis of GLMs shows graph sink tokens have high activation but low importance for predictions, indicating decoupling between saliency and graph-semantic utility.
LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-layer attention in LLMs.
Pairwise scoring signals in Vision Transformer token reduction are inherently unstable due to high perturbation counts and degrade in deep layers, causing collapse, while unary signals with triage enable CATIS to retain 96.9% accuracy at 63% FLOPs reduction on ViT-Large ImageNet-1K.
OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.
The paper proposes information scope as a new interpretability axis for SAE features in CLIP and introduces the Contextual Dependency Score to separate local from global scope features, showing they influence model predictions differently.
Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms without task-specific data.
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
PixelU is a minimalist U-shaped Diffusion Transformer for pixel-space diffusion that decouples frequencies with zero-cost skip connections and constant-channel downsampling, outperforming baselines like JiT-G at 1/3 the compute cost with FID 1.63 on ImageNet 256x256.
GHOST improves generalization in robot manipulation via hierarchical factorization into 3D sub-goal prediction from RGB-D views and a goal-conditioned low-level controller, enabling human video integration without action retargeting.
Latte performs seed-centered one-step latent mutations along class anchors in VQ-VAE space to produce diverse, low-drift, fault-revealing DNN tests.
Vision encoders alter spectral accessibility non-monotonically across depth with architecture-specific effects from projections and pooling, quantified via a new residual loss against random baselines.
FP-MGMs with consistency loss and three-state reuse (CoFRe) reduce parameters by up to 38.8% and improve low-budget perplexity and FID versus standard masked generative models on text and images.
Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.
SplitAvatar applies an autoregressive graph splitting network with mesh topology extension and gated density control to generate detailed one-shot head avatars via 3D Gaussian Splatting.
DPDiff-AD conditions a diffusion model on local prototypes (via nearest aggregation) and global prototypes (via optimal transport) to model normality scalably in multi-class anomaly detection, reporting AUROC gains on 160-category data.
A feed-forward model aligns ground and satellite features to predict Gaussian splats for improved novel-view synthesis on georeferenced outdoor scenes.
UniRefiner uses contrastive registers and a dual alignment objective to remove three categories of spurious tokens from pre-trained ViTs, yielding up to 9.4% mIoU gains on ADE20K and 22% zero-shot segmentation improvements.
A transformer-based neural renderer that transfers arbitrary PBR lighting to single images via shared intrinsic conditioning extracted from both multi-illumination photos and path-traced coarse 3D renders.
TIGER-FG proposes text-guided implicit fine-grained grounding with dual distillation to address modality and granularity asymmetries in image-to-multimodal e-commerce retrieval, reporting Recall@1 gains of 6.1 and 34.4 points on two new benchmarks.
citing papers explorer
No citing papers match the current filters.