Swin Transformer reaches 87.3% ImageNet accuracy and sets new records on COCO detection and ADE20K segmentation by replacing global self-attention with shifted-window local attention inside a hierarchical pyramid.
arXiv preprint arXiv:2102.10882 (2021)
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
method 1polarities
use method 1representative citing papers
The work demonstrates masked-token prediction with transformers for model-independent anomaly detection in LHC data, achieving strong results on top-rich BSM signatures like four-top production using VQ-VAE tokenization.
A hierarchical mesh transformer using topology-guided pretraining on simplicial complexes achieves state-of-the-art results on Alzheimer's classification, amyloid prediction, and focal cortical dysplasia detection from brain meshes.
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
WaveLiT combines wavelet tokenization, linear attention, and multiscale pyramids to produce parameter-efficient neural PDE solvers that match much larger models on TheWell benchmarks.
LOOPE learns a patch ordering for positional embeddings in ViTs and introduces the Three Cell Experiment benchmark that shows 30-35% gaps in positional retention versus the usual 4-6%.
USEMA is a hybrid UNet architecture merging CNNs with scalable Mamba-like attention (SEMA) that achieves better efficiency than transformers and superior segmentation accuracy than pure CNN or Mamba models across medical imaging modalities.
GSAM applies random cropping to enable variable input sizes for efficient SAM fine-tuning, claiming lower compute with comparable or higher accuracy on varied datasets.
citing papers explorer
-
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Swin Transformer reaches 87.3% ImageNet accuracy and sets new records on COCO detection and ADE20K segmentation by replacing global self-attention with shifted-window local attention inside a hierarchical pyramid.
-
Masked-Token Prediction for Anomaly Detection at the Large Hadron Collider
The work demonstrates masked-token prediction with transformers for model-independent anomaly detection in LHC data, achieving strong results on top-rich BSM signatures like four-top production using VQ-VAE tokenization.
-
Hierarchical Mesh Transformers with Topology-Guided Pretraining for Morphometric Analysis of Brain Structures
A hierarchical mesh transformer using topology-guided pretraining on simplicial complexes achieves state-of-the-art results on Alzheimer's classification, amyloid prediction, and focal cortical dysplasia detection from brain meshes.
-
Massive Activations in Large Language Models
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
-
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
-
Small Models, Strong Priors: Architectural Inductive Bias for Parameter-Efficient Neural PDE Solvers
WaveLiT combines wavelet tokenization, linear attention, and multiscale pyramids to produce parameter-efficient neural PDE solvers that match much larger models on TheWell benchmarks.
-
LOOPE: Learnable Optimal Patch Order in Positional Embeddings for Vision Transformers
LOOPE learns a patch ordering for positional embeddings in ViTs and introduces the Three Cell Experiment benchmark that shows 30-35% gaps in positional retention versus the usual 4-6%.
-
USEMA: a Scalable Efficient Mamba Like Attention for Medical Image Segmentation
USEMA is a hybrid UNet architecture merging CNNs with scalable Mamba-like attention (SEMA) that achieves better efficiency than transformers and superior segmentation accuracy than pure CNN or Mamba models across medical imaging modalities.
-
Generalized SAM: Efficient Fine-Tuning of SAM for Variable Input Image Sizes
GSAM applies random cropping to enable variable input sizes for efficient SAM fine-tuning, claiming lower compute with comparable or higher accuracy on varied datasets.