Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

Arkabandhu Chowdhury; Chaitanya Ryali; Chen Wei; Christoph Feichtenhofer; Daniel Bolya; Haoqi Fan; Jitendra Malik; Judy Hoffman; Omid Poursaeed; Po-Yao Huang

arxiv: 2306.00989 · v1 · pith:X54ZTFUOnew · submitted 2023-06-01 · 💻 cs.CV · cs.LG

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

Chaitanya Ryali , Yuan-Ting Hu , Daniel Bolya , Chen Wei , Haoqi Fan , Po-Yao Huang , Vaibhav Aggarwal , Arkabandhu Chowdhury

show 5 more authors

Omid Poursaeed Judy Hoffman Jitendra Malik Yanghao Li Christoph Feichtenhofer

This is my paper

classification 💻 cs.CV cs.LG

keywords hieravisionhierarchicaltransformeraddedbells-and-whistlescomponentsmodels

0 comments

read the original abstract

Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance. While these components lead to effective accuracies and attractive FLOP counts, the added complexity actually makes these transformers slower than their vanilla ViT counterparts. In this paper, we argue that this additional bulk is unnecessary. By pretraining with a strong visual pretext task (MAE), we can strip out all the bells-and-whistles from a state-of-the-art multi-stage vision transformer without losing accuracy. In the process, we create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models while being significantly faster both at inference and during training. We evaluate Hiera on a variety of tasks for image and video recognition. Our code and models are available at https://github.com/facebookresearch/hiera.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Zero-Parameter Geometric Gating for Temporally Stable Low-Altitude UAV Video Semantic Segmentation
cs.CV 2026-06 unverdicted novelty 6.0

A RANSAC-based geometric gate routes regions to homography or optical flow warping before SSP fusion, improving mIoU by 4.24-4.91% on synthetic UAVid with only 211K added parameters to frozen backbones.
Revisiting Feature Prediction for Learning Visual Representations from Video
cs.CV 2024-02 conditional novelty 6.0

V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.