super hub Canonical reference

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy · 2020 · cs.CV · arXiv 2010.11929

Canonical reference. 86% of citing Pith papers cite this work as background.

425 Pith papers citing it

Background 86% of classified citations

open full Pith review browse 425 citing papers more from Dosovitskiy arXiv PDF

abstract

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 method 1

citation-polarity summary

background 6 use method 1

claims ledger

abstract While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple m

authors

Dosovitskiy

co-cited works

representative citing papers

DyABD: The Abdominal Muscle Segmentation in Dynamic MRI Benchmark

cs.CV · 2026-04-25 · conditional · novelty 9.0

DyABD is the first benchmark dataset for abdominal muscle segmentation in dynamic MRIs featuring exercise-induced anatomical changes and pre/post-surgery scans, where existing models achieve an average Dice score of 0.82.

CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography

cs.CV · 2026-05-11 · accept · novelty 8.0

CheXTemporal supplies paired chest X-rays with explicit temporal progression taxonomy and spatial grounding to benchmark and improve models on longitudinal reasoning tasks.

Dissecting Jet-Tagger Through Mechanistic Interpretability

hep-ph · 2026-05-11 · accept · novelty 8.0

A Particle Transformer jet tagger contains a sparse six-head circuit whose source-relay-readout structure recovers most performance and whose residual stream preferentially encodes 2-prong energy correlators.

Gradient-Based Program Synthesis with Neurally Interpreted Languages

cs.LG · 2026-04-20 · unverdicted · novelty 8.0

NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prior methods on combinatorial generalization tasks.

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

cs.RO · 2023-03-07 · accept · novelty 8.0

Diffusion Policy models robot actions as a conditional diffusion process, outperforming prior state-of-the-art methods by 46.9% on average across 12 manipulation tasks from four benchmarks.

Efficiently Modeling Long Sequences with Structured State Spaces

cs.LG · 2021-10-31 · unverdicted · novelty 8.0

S4 is an efficient state space sequence model that captures long-range dependencies via structured parameterization of the SSM, achieving state-of-the-art results on the Long Range Arena and other benchmarks while being faster than Transformers for generation.

Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

SPA unlocks patch-level features in CLIP for class-incremental learning via semantic-guided selection and optimal transport alignment with class descriptions, plus projectors and pseudo-feature replay to reduce forgetting.

QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

QLAM extends state-space models with quantum superposition in the hidden state for linear-time long-sequence modeling and reports consistent gains over RNN and transformer baselines on sequential image tasks.

MedCore: Boundary-Preserving Medical Core Pruning for MedSAM

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

MedCore achieves 60% parameter and 58.4% FLOP reduction on MedSAM with Dice 0.9549 and preserved boundary metrics via dual-intervention pruning and a new boundary leverage principle.

Sensing-Assisted LoS/NLoS Identification in Dynamic UAV Positioning Systems

eess.SP · 2026-05-13 · unverdicted · novelty 7.0

A new dual-input feature fusion network using RGB images and channel impulse responses identifies LoS/NLoS conditions for UAVs with up to 97.69% accuracy and reduces trilateration positioning error by about 70%.

RotVLA: Rotational Latent Action for Vision-Language-Action Model

cs.RO · 2026-05-13 · unverdicted · novelty 7.0

RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.

KamonBench: A Grammar-Based Dataset for Evaluating Compositional Factor Recovery in Vision-Language Models

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

KamonBench is a grammar-generated synthetic dataset of compositional kamon crests with explicit factor annotations to evaluate factor recovery in vision-language models.

Backdoor Channels Hidden in Latent Space: Cryptographic Undetectability in Modern Neural Networks

cs.CR · 2026-05-13 · unverdicted · novelty 7.0

Backdoors can be realized as statistically natural latent directions in modern neural networks, achieving high attack success with negligible clean accuracy loss and resisting existing defenses.

From Compression to Accountability: Harmless Copyright Protection for Dataset Distillation

cs.CR · 2026-05-13 · unverdicted · novelty 7.0

SubPopMark protects distilled datasets by injecting verifiable subpopulation biases that create distinguishable model behaviors for copyright tracing without using backdoors.

MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.

From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

Revisiting Shadow Detection from a Vision-Language Perspective

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

SVL uses language embeddings aligned with global image representations via shadow ratio regression and global-to-local coupling to improve shadow detection robustness in ambiguous cases.

Weather-Robust Cross-View Geo-Localization via Prototype-Based Semantic Part Discovery

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

SkyPart uses learnable prototypes for patch grouping, altitude modulation only in training, graph-attention readout, and Kendall-weighted loss to set new state-of-the-art single-pass performance on SUES-200, University-1652, and DenseUAV while widening gains under weather corruptions.

SoK: Unlearnability and Unlearning for Model Dememorization

cs.LG · 2026-05-12 · conditional · novelty 7.0

The first integrated taxonomy, empirical study of interplay and shallow dememorization, plus a theoretical guarantee on dememorization depth for certified unlearning.

TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.

Can Graphs Help Vision SSMs See Better?

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

GraphScan replaces geometric or coordinate-based scanning in Vision SSMs with learned local semantic graph routing, yielding SOTA results among such models on classification and segmentation tasks.

RelFlexformer: Efficient Attention 3D-Transformers for Integrable Relative Positional Encodings

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

RelFlexformers enable flexible integrable 3D RPE in attention via NU-FFT, generalizing prior methods to heterogeneous token positions with O(L log L) complexity.

Automated Detection of Abnormalities in Zebrafish Development

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

A new annotated dataset of zebrafish embryo image sequences enables a spatiotemporal transformer to classify fertility at 98% accuracy and detect compound-induced malformations at 92% accuracy.

The Benefits of Temporal Correlations: SGD Learns k-Juntas from Random Walks Efficiently

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Temporal correlations from lazy random walks enable efficient SGD learning of k-juntas via temporal-difference loss on ReLU networks, achieving linear sample complexity in d.

citing papers explorer

Showing 50 of 396 citing papers after filters.

DyABD: The Abdominal Muscle Segmentation in Dynamic MRI Benchmark cs.CV · 2026-04-25 · conditional · none · ref 36 · internal anchor
DyABD is the first benchmark dataset for abdominal muscle segmentation in dynamic MRIs featuring exercise-induced anatomical changes and pre/post-surgery scans, where existing models achieve an average Dice score of 0.82.
CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography cs.CV · 2026-05-11 · accept · none · ref 14 · internal anchor
CheXTemporal supplies paired chest X-rays with explicit temporal progression taxonomy and spatial grounding to benchmark and improve models on longitudinal reasoning tasks.
Dissecting Jet-Tagger Through Mechanistic Interpretability hep-ph · 2026-05-11 · accept · none · ref 48 · internal anchor
A Particle Transformer jet tagger contains a sparse six-head circuit whose source-relay-readout structure recovers most performance and whose residual stream preferentially encodes 2-prong energy correlators.
Gradient-Based Program Synthesis with Neurally Interpreted Languages cs.LG · 2026-04-20 · unverdicted · none · ref 96 · internal anchor
NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prior methods on combinatorial generalization tasks.
Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning cs.CV · 2026-05-13 · unverdicted · none · ref 9 · internal anchor
SPA unlocks patch-level features in CLIP for class-incremental learning via semantic-guided selection and optimal transport alignment with class descriptions, plus projectors and pseudo-feature replay to reduce forgetting.
QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling cs.LG · 2026-05-13 · unverdicted · none · ref 20 · internal anchor
QLAM extends state-space models with quantum superposition in the hidden state for linear-time long-sequence modeling and reports consistent gains over RNN and transformer baselines on sequential image tasks.
MedCore: Boundary-Preserving Medical Core Pruning for MedSAM cs.CV · 2026-05-13 · unverdicted · none · ref 3 · internal anchor
MedCore achieves 60% parameter and 58.4% FLOP reduction on MedSAM with Dice 0.9549 and preserved boundary metrics via dual-intervention pruning and a new boundary leverage principle.
Sensing-Assisted LoS/NLoS Identification in Dynamic UAV Positioning Systems eess.SP · 2026-05-13 · unverdicted · none · ref 23 · internal anchor
A new dual-input feature fusion network using RGB images and channel impulse responses identifies LoS/NLoS conditions for UAVs with up to 97.69% accuracy and reduces trilateration positioning error by about 70%.
RotVLA: Rotational Latent Action for Vision-Language-Action Model cs.RO · 2026-05-13 · unverdicted · none · ref 48 · internal anchor
RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
KamonBench: A Grammar-Based Dataset for Evaluating Compositional Factor Recovery in Vision-Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 4 · internal anchor
KamonBench is a grammar-generated synthetic dataset of compositional kamon crests with explicit factor annotations to evaluate factor recovery in vision-language models.
Backdoor Channels Hidden in Latent Space: Cryptographic Undetectability in Modern Neural Networks cs.CR · 2026-05-13 · unverdicted · none · ref 7 · internal anchor
Backdoors can be realized as statistically natural latent directions in modern neural networks, achieving high attack success with negligible clean accuracy loss and resisting existing defenses.
From Compression to Accountability: Harmless Copyright Protection for Dataset Distillation cs.CR · 2026-05-13 · unverdicted · none · ref 42 · internal anchor
SubPopMark protects distilled datasets by injecting verifiable subpopulation biases that create distinguishable model behaviors for copyright tracing without using backdoors.
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving cs.RO · 2026-05-12 · unverdicted · none · ref 55 · internal anchor
MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation cs.RO · 2026-05-12 · unverdicted · none · ref 15 · internal anchor
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
Revisiting Shadow Detection from a Vision-Language Perspective cs.CV · 2026-05-12 · unverdicted · none · ref 38 · internal anchor
SVL uses language embeddings aligned with global image representations via shadow ratio regression and global-to-local coupling to improve shadow detection robustness in ambiguous cases.
Weather-Robust Cross-View Geo-Localization via Prototype-Based Semantic Part Discovery cs.CV · 2026-05-12 · unverdicted · none · ref 15 · internal anchor
SkyPart uses learnable prototypes for patch grouping, altitude modulation only in training, graph-attention readout, and Kendall-weighted loss to set new state-of-the-art single-pass performance on SUES-200, University-1652, and DenseUAV while widening gains under weather corruptions.
SoK: Unlearnability and Unlearning for Model Dememorization cs.LG · 2026-05-12 · conditional · none · ref 45 · internal anchor
The first integrated taxonomy, empirical study of interplay and shallow dememorization, plus a theoretical guarantee on dememorization depth for certified unlearning.
TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles cs.CV · 2026-05-12 · unverdicted · none · ref 10 · internal anchor
TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.
Can Graphs Help Vision SSMs See Better? cs.CV · 2026-05-11 · unverdicted · none · ref 10 · internal anchor
GraphScan replaces geometric or coordinate-based scanning in Vision SSMs with learned local semantic graph routing, yielding SOTA results among such models on classification and segmentation tasks.
RelFlexformer: Efficient Attention 3D-Transformers for Integrable Relative Positional Encodings cs.LG · 2026-05-11 · unverdicted · none · ref 16 · internal anchor
RelFlexformers enable flexible integrable 3D RPE in attention via NU-FFT, generalizing prior methods to heterogeneous token positions with O(L log L) complexity.
Automated Detection of Abnormalities in Zebrafish Development cs.CV · 2026-05-11 · unverdicted · none · ref 5 · internal anchor
A new annotated dataset of zebrafish embryo image sequences enables a spatiotemporal transformer to classify fertility at 98% accuracy and detect compound-induced malformations at 92% accuracy.
The Benefits of Temporal Correlations: SGD Learns k-Juntas from Random Walks Efficiently cs.LG · 2026-05-11 · unverdicted · none · ref 121 · internal anchor
Temporal correlations from lazy random walks enable efficient SGD learning of k-juntas via temporal-difference loss on ReLU networks, achieving linear sample complexity in d.
Learning to Align Generative Appearance Priors for Fine-grained Image Retrieval cs.CV · 2026-05-11 · unverdicted · none · ref 4 · internal anchor
GAPan uses invertible normalizing flows to learn generative appearance priors from seen categories and aligns retrieval embeddings to these priors, improving performance on unseen categories in fine-grained image retrieval.
PromptDx: Differentiable Prompt Tuning for Multimodal In-Context Alzheimer's Diagnosis cs.CV · 2026-05-09 · unverdicted · none · ref 4 · internal anchor
PromptDx adds a differentiable adapter to align multimodal data with a pre-trained TabPFN-style ICL engine, achieving strong Alzheimer's diagnosis performance with only 1% context samples.
EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding cs.CV · 2026-05-08 · unverdicted · none · ref 27 · internal anchor
EyeCue detects driver cognitive distraction by modeling gaze-visual context interactions in egocentric videos and achieves 74.38% accuracy on the new CogDrive dataset, outperforming 11 baselines.
Neural network quantum states in the grand canonical ensemble quant-ph · 2026-05-08 · unverdicted · none · ref 34 · internal anchor
A new neural quantum state ansatz for bosons in the grand canonical ensemble achieves competitive variational energies in 1D and 2D systems and provides access to one-body reduced density matrices.
SAM 3D Animal: Promptable Animal 3D Reconstruction from Images in the Wild cs.CV · 2026-05-08 · unverdicted · none · ref 12 · internal anchor
SAM 3D Animal is the first promptable framework for multi-animal 3D reconstruction from single images, built on SMAL+ and trained on the new Herd3D dataset, achieving SOTA results on Animal3D, APTv2, and Animal Kingdom benchmarks.
On the Invariance and Generality of Neural Scaling Laws cs.LG · 2026-05-08 · unverdicted · none · ref 16 · internal anchor
Neural scaling laws are invariant under bijective data transformations and change predictably with information resolution ρ under non-bijective transformations, enabling cross-domain transport of fitted exponents.
Amortized-Precision Quantization for Early-Exit Vision Transformers cs.CV · 2026-05-08 · unverdicted · none · ref 7 · internal anchor
Amortized-Precision Quantization (APQ) and the MAQEE bi-level framework jointly optimize bit-widths and exit thresholds for early-exit ViTs, cutting BOPs by up to 95% with maintained accuracy across vision tasks.
Testing machine-learned distributions against Monte Carlo data for the QCD chiral phase transition hep-lat · 2026-05-08 · unverdicted · none · ref 39 · internal anchor
Conditional MAFs interpolate QCD chiral phase structure across coupling, mass, and volume, reproducing reweighting while cutting required ensembles despite bias near transitions.
TRAJGANR: Trajectory-Centric Urban Multimodal Learning via Geospatially Aligned Neural Representations cs.CV · 2026-05-07 · unverdicted · none · ref 14 · internal anchor
TrajGANR learns continuous neural representations of trajectories to enable fine-grained alignment with street-view images and locations in a joint multimodal self-supervised objective, outperforming prior geospatial MSSL methods on urban mobility and road tasks.
How Does Attention Help? Insights from Random Matrices on Signal Recovery from Sequence Models stat.ML · 2026-05-07 · conditional · none · ref 4 · internal anchor
Attention pooling produces a free-multiplicative-convolution bulk spectrum and two phase transitions for signal recovery; optimal weights are the top eigenvector of the positional correlation matrix R.
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing cs.CL · 2026-05-07 · unverdicted · none · ref 71 · internal anchor
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
Transformers Efficiently Perform In-Context Logistic Regression via Normalized Gradient Descent cs.LG · 2026-05-07 · conditional · none · ref 12 · internal anchor
Multi-layer transformers can implement in-context logistic regression by performing normalized gradient descent steps layer by layer, obtained via supervised training of a single attention layer followed by recurrent application with convergence and OOD guarantees.
Empirical Evidence for Simply Connected Decision Regions in Image Classifiers cs.CV · 2026-05-07 · unverdicted · none · ref 15 · internal anchor
Empirical tests with quad-mesh filling indicate that decision regions in modern image classifiers are simply connected.
SIGMA-ASL: Sensor-Integrated Multimodal Dataset for Sign Language Recognition cs.HC · 2026-05-07 · unverdicted · none · ref 21 · internal anchor
SIGMA-ASL is a multimodal dataset with 93,545 word-level ASL clips from Kinect RGB-D, mmWave radar, and dual IMUs, plus benchmarking protocols for single- and multi-modal recognition.
Adaptive Selection of LoRA Components in Privacy-Preserving Federated Learning cs.LG · 2026-05-07 · unverdicted · none · ref 10 · internal anchor
AS-LoRA adaptively chooses which LoRA factor to update per layer and round using a curvature-aware second-order score, eliminating reconstruction error floors and improving performance in DP federated learning.
Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems cs.AR · 2026-05-07 · unverdicted · none · ref 6 · internal anchor
CAIS delivers 1.38x end-to-end LLM training speedup over NVLS and 1.61x over T3 by making in-switch computing aware of computation memory requirements instead of treating communication as an isolated phase.
FIBER: A Differentially Private Optimizer with Filter-Aware Innovation Bias Correction cs.LG · 2026-05-05 · unverdicted · none · ref 38 · internal anchor
FiBeR adds a closed-form filter-aware correction A(ω)σ_w² to the second-moment term for temporally filtered DP gradients, improving adaptive optimization performance.
GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning cs.CV · 2026-05-05 · unverdicted · none · ref 5 · internal anchor
GRPO-TTA applies GRPO to test-time visual tuning of vision-language models via group-wise policy optimization on unlabeled class candidates, outperforming prior TTA methods especially under natural distribution shifts.
Evaluating LLMs on Large-Scale Graph Property Estimation via Random Walks cs.LG · 2026-05-02 · unverdicted · none · ref 106 · internal anchor
EstGraph benchmark evaluates LLMs on estimating properties of very large graphs from random-walk samples that fit in context limits.
SplAttN: Bridging 2D and 3D with Gaussian Soft Splatting and Attention for Point Cloud Completion cs.CV · 2026-05-02 · conditional · none · ref 23 · internal anchor
SplAttN replaces hard projection with Gaussian soft splatting to avoid cross-modal entropy collapse, achieving SOTA point cloud completion on PCN and ShapeNet while maintaining visual cue dependency on KITTI.
Machine Learning-Augmented Acceleration of Iterative Ptychographic Reconstruction cs.LG · 2026-05-01 · conditional · none · ref 17 · internal anchor
A learned fast-forward operator accelerates iterative ptychographic reconstruction by over twofold in wall-clock time while maintaining comparable quality on temporally held-out experimental data.
Reconstructing conformal field theoretical compositions with Transformers hep-th · 2026-05-01 · unverdicted · none · ref 19 · internal anchor
Transformers reconstruct the constituent RCFTs in tensor-product theories from low-energy spectra, reaching 98% accuracy on WZW models and generalizing to larger central charges with few out-of-domain examples.
Physiology-Aware Masked Cross-Modal Reconstruction for Biosignal Representation Learning cs.LG · 2026-05-01 · unverdicted · none · ref 5 · internal anchor
xMAE pretrains biosignal representations via masked cross-modal reconstruction of temporally ordered signals like ECG and PPG, outperforming baselines on 15 of 19 downstream tasks including cardiovascular prediction and sleep staging.
Foundation AI Models for Aerosol Optical Depth Estimation from PACE Satellite Data cs.CV · 2026-05-01 · unverdicted · none · ref 24 · internal anchor
ViTCG, a channel-grouped Vision Transformer, retrieves AOD from PACE hyperspectral data with 62% lower MSE than prior foundation models while producing spatially coherent fields.
Sampling two-dimensional spin systems with transformers cond-mat.dis-nn · 2026-04-30 · unverdicted · none · ref 24 · internal anchor
Transformer networks sample up to 180x180 2D Ising systems and 64x64 Edwards-Anderson systems by generating spin groups with probability approximations, yielding ~20x higher effective sample size than prior neural samplers at criticality.
Rethink MAE with Linear Time-Invariant Dynamics cs.CV · 2026-04-29 · unverdicted · none · ref 3 · internal anchor
Token order in frozen visual representations is exploitable via SSM-based LTI probes, revealing pre-training-dependent heterogeneity that fixed pooling misses.
Distributed Multi-View Vision-Only RSSI Estimation cs.IT · 2026-04-29 · unverdicted · none · ref 18 · internal anchor
MulViT-TF uses distributed multi-view vision and Transformer fusion to estimate RSSI, cutting RMSE by up to 26.3% versus single-view baselines in two indoor scenes while using fewer resources.
Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection cs.CV · 2026-04-29 · unverdicted · none · ref 10 · internal anchor
Sparse autoencoders on ViT class tokens reveal stable Class Activation Profiles for in-distribution data, enabling OOD detection via divergence from core energy profiles.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer