pith. machine review for the scientific record. sign in

super hub Canonical reference

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Canonical reference. 86% of citing Pith papers cite this work as background.

425 Pith papers citing it
Background 86% of classified citations
abstract

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

hub tools

citation-role summary

background 6 method 1

citation-polarity summary

claims ledger

  • abstract While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple m

authors

co-cited works

clear filters

representative citing papers

DyABD: The Abdominal Muscle Segmentation in Dynamic MRI Benchmark

cs.CV · 2026-04-25 · conditional · novelty 9.0

DyABD is the first benchmark dataset for abdominal muscle segmentation in dynamic MRIs featuring exercise-induced anatomical changes and pre/post-surgery scans, where existing models achieve an average Dice score of 0.82.

Dissecting Jet-Tagger Through Mechanistic Interpretability

hep-ph · 2026-05-11 · accept · novelty 8.0

A Particle Transformer jet tagger contains a sparse six-head circuit whose source-relay-readout structure recovers most performance and whose residual stream preferentially encodes 2-prong energy correlators.

Gradient-Based Program Synthesis with Neurally Interpreted Languages

cs.LG · 2026-04-20 · unverdicted · novelty 8.0

NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prior methods on combinatorial generalization tasks.

Efficiently Modeling Long Sequences with Structured State Spaces

cs.LG · 2021-10-31 · unverdicted · novelty 8.0

S4 is an efficient state space sequence model that captures long-range dependencies via structured parameterization of the SSM, achieving state-of-the-art results on the Long Range Arena and other benchmarks while being faster than Transformers for generation.

Can Graphs Help Vision SSMs See Better?

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

GraphScan replaces geometric or coordinate-based scanning in Vision SSMs with learned local semantic graph routing, yielding SOTA results among such models on classification and segmentation tasks.

Automated Detection of Abnormalities in Zebrafish Development

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

A new annotated dataset of zebrafish embryo image sequences enables a spatiotemporal transformer to classify fertility at 98% accuracy and detect compound-induced malformations at 92% accuracy.

citing papers explorer

Showing 7 of 7 citing papers after filters.

  • V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning cs.AI · 2025-06-11 · unverdicted · none · ref 18 · internal anchor

    V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 hours of unlabeled robot video.

  • MAGI-1: Autoregressive Video Generation at Scale cs.CV · 2025-05-19 · unverdicted · none · ref 10 · internal anchor

    MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.

  • VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness cs.CV · 2025-03-27 · accept · none · ref 50 · internal anchor

    VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs, and anomaly detection methods.

  • Unified Video Action Model cs.RO · 2025-02-28 · unverdicted · none · ref 1 · internal anchor

    UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without performance loss versus task-specific methods.

  • YOLOv12: Attention-Centric Real-Time Object Detectors cs.CV · 2025-02-18 · unverdicted · none · ref 18 · internal anchor

    YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.

  • WorldVLA: Towards Autoregressive Action World Model cs.RO · 2025-06-26 · unverdicted · none · ref 11 · internal anchor

    WorldVLA unifies VLA and world models in one autoregressive system, shows they boost each other, and adds an attention mask to stop error buildup when generating action chunks.

  • Seed1.5-VL Technical Report cs.CV · 2025-05-11 · unverdicted · none · ref 26 · internal anchor

    Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.