Rotary Position Embedding for Vision Transformer

Byeongho Heo; Dongyoon Han; Sangdoo Yun; Song Park

arxiv: 2403.13298 · v2 · pith:AY72LLX7new · submitted 2024-03-20 · 💻 cs.CV · cs.LG

Rotary Position Embedding for Vision Transformer

Byeongho Heo , Song Park , Dongyoon Han , Sangdoo Yun This is my paper

classification 💻 cs.CV cs.LG

keywords ropeperformancevisionanalysisembeddingextrapolationlanguagemodels

0 comments

read the original abstract

Rotary Position Embedding (RoPE) performs remarkably on language models, especially for length extrapolation of Transformers. However, the impacts of RoPE on computer vision domains have been underexplored, even though RoPE appears capable of enhancing Vision Transformer (ViT) performance in a way similar to the language domain. This study provides a comprehensive analysis of RoPE when applied to ViTs, utilizing practical implementations of RoPE for 2D vision data. The analysis reveals that RoPE demonstrates impressive extrapolation performance, i.e., maintaining precision while increasing image resolution at inference. It eventually leads to performance improvement for ImageNet-1k, COCO detection, and ADE-20k segmentation. We believe this study provides thorough guidelines to apply RoPE into ViT, promising improved backbone performance with minimal extra computational overhead. Our code and pre-trained models are available at https://github.com/naver-ai/rope-vit

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Scaling Storm-Resolving Atmospheric AI Simulation to the Entire Planet
physics.ao-ph 2026-06 unverdicted novelty 7.0

STRATA is the first autoregressive transformer emulator for global 4.9-km storm-resolving atmospheric dynamics, achieving 50x better energy efficiency than the underlying physics model while producing realistic km-sca...
SAM 3: Segment Anything with Concepts
cs.CV 2025-11 unverdicted novelty 7.0

SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception
cs.RO 2026-05 unverdicted novelty 6.0

StereoPolicy fuses stereo image pairs via a Stereo Transformer on pretrained 2D encoders to boost robotic manipulation policies, showing gains over monocular, RGB-D, point cloud, and multi-view methods in simulations ...
SAM 2: Segment Anything in Images and Videos
cs.CV 2024-08 conditional novelty 6.0

SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation datas...
Learning Spatial-Preserving Hierarchical Representations for Digital Pathology
cs.CV 2024-06 unverdicted novelty 6.0

SPAN is a hierarchical attention framework that constructs multi-scale pyramid representations from single-scale patch inputs for WSI classification and segmentation while preserving spatial relationships.
StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception
cs.RO 2026-05 unverdicted novelty 4.0

StereoPolicy fuses left-right image features via cross-attention to deliver consistent gains over RGB, RGB-D, point cloud, and multi-view baselines in simulation and real-robot manipulation tasks.