{"work":{"id":"4e5eee26-cd04-4c7a-988f-3e6d1a1f0eb9","openalex_id":null,"doi":null,"arxiv_id":"2104.09864","raw_key":null,"title":"RoFormer: Enhanced Transformer with Rotary Position Embedding","authors":null,"authors_text":"Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu","year":2021,"venue":"cs.CL","abstract":"Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning process of transformer-based language models. Then, we propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation. Notably, RoPE enables valuable properties, including the flexibility of sequence length, decaying inter-token dependency with increasing relative distances, and the capability of equipping the linear self-attention with relative position encoding. Finally, we evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets. Our experiments show that it consistently overcomes its alternatives. Furthermore, we provide a theoretical analysis to explain some experimental results. RoFormer is already integrated into Huggingface: \\url{https://huggingface.co/docs/transformers/model_doc/roformer}.","external_url":"https://arxiv.org/abs/2104.09864","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-14T21:38:01.095656+00:00","pith_arxiv_id":"2104.09864","created_at":"2026-05-08T20:09:09.940464+00:00","updated_at":"2026-05-14T21:38:01.095656+00:00","title_quality_ok":true,"display_title":"RoFormer: Enhanced Transformer with Rotary Position Embedding","render_title":"RoFormer: Enhanced Transformer with Rotary Position Embedding"},"hub":{"state":{"work_id":"4e5eee26-cd04-4c7a-988f-3e6d1a1f0eb9","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":68,"external_cited_by_count":null,"distinct_field_count":15,"first_pith_cited_at":"2022-04-05T16:11:45+00:00","last_pith_cited_at":"2026-05-13T12:00:11+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-15T00:16:14.196339+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":4},{"context_role":"method","n":1}],"polarity_counts":[{"context_polarity":"background","n":2},{"context_polarity":"unclear","n":2},{"context_polarity":"use_method","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T10:08:42.510920+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"GLU Variants Improve Transformer","work_id":"17d0763c-1016-41ab-a478-478e890765eb","shared_citers":20},{"title":"Attention Is All You Need","work_id":"baafb5a2-5272-43bc-932f-09fa9ffe5316","shared_citers":13},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":12},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":12},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":12},{"title":"Mamba: Linear-Time Sequence Modeling with Selective State Spaces","work_id":"4ee75248-1199-492c-a52f-6661e0f4adff","shared_citers":11},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":11},{"title":"Fast Transformer Decoding: One Write-Head is All You Need","work_id":"160ea164-b1d4-4adb-8ccb-a4655d8a0bb4","shared_citers":10},{"title":"PaLM: Scaling Language Modeling with Pathways","work_id":"a94f3ef7-2c49-4445-93fe-6ec16aafd966","shared_citers":10},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":10},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":9},{"title":"Program Synthesis with Large Language Models","work_id":"fd241a05-03b9-4de2-9588-9d77ce176125","shared_citers":9},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":9},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":8},{"title":"The Pile: An 800GB Dataset of Diverse Text for Language Modeling","work_id":"9b10667a-da61-4358-aceb-10578234d45d","shared_citers":8},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":7},{"title":"doi: 10.18653/v1/D18-2012","work_id":"81a6320b-c2e1-4d74-a03e-9e1ff6bbed8d","shared_citers":7},{"title":"Extending Context Window of Large Language Models via Positional Interpolation","work_id":"c8b6df85-e7da-4bd8-90a4-d309cc2a0f60","shared_citers":7},{"title":"Gaussian Error Linear Units (GELUs)","work_id":"0466fd22-03a1-4a61-af0a-a900e77bb023","shared_citers":7},{"title":"Layer Normalization","work_id":"20a2d720-0046-4c7c-bcd6-327ec8143f69","shared_citers":7},{"title":"Longformer: The Long-Document Transformer","work_id":"abea7a44-6668-4de7-aab6-f53a6e5aa088","shared_citers":7},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":6},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":6},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":6}],"time_series":[{"n":1,"year":2022},{"n":8,"year":2023},{"n":12,"year":2024},{"n":2,"year":2025},{"n":39,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T10:18:41.189098+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T10:08:44.811717+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"RoFormer: Enhanced Transformer with Rotary Position Embedding","claims":[{"claim_text":"Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning process of transformer-based language models. Then, we propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks RoFormer: Enhanced Transformer with Rotary Position Embedding because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T10:18:47.226401+00:00"}},"summary":{"title":"RoFormer: Enhanced Transformer with Rotary Position Embedding","claims":[{"claim_text":"Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning process of transformer-based language models. Then, we propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks RoFormer: Enhanced Transformer with Rotary Position Embedding because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"GLU Variants Improve Transformer","work_id":"17d0763c-1016-41ab-a478-478e890765eb","shared_citers":20},{"title":"Attention Is All You Need","work_id":"baafb5a2-5272-43bc-932f-09fa9ffe5316","shared_citers":13},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":12},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":12},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":12},{"title":"Mamba: Linear-Time Sequence Modeling with Selective State Spaces","work_id":"4ee75248-1199-492c-a52f-6661e0f4adff","shared_citers":11},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":11},{"title":"Fast Transformer Decoding: One Write-Head is All You Need","work_id":"160ea164-b1d4-4adb-8ccb-a4655d8a0bb4","shared_citers":10},{"title":"PaLM: Scaling Language Modeling with Pathways","work_id":"a94f3ef7-2c49-4445-93fe-6ec16aafd966","shared_citers":10},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":10},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":9},{"title":"Program Synthesis with Large Language Models","work_id":"fd241a05-03b9-4de2-9588-9d77ce176125","shared_citers":9},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":9},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":8},{"title":"The Pile: An 800GB Dataset of Diverse Text for Language Modeling","work_id":"9b10667a-da61-4358-aceb-10578234d45d","shared_citers":8},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":7},{"title":"doi: 10.18653/v1/D18-2012","work_id":"81a6320b-c2e1-4d74-a03e-9e1ff6bbed8d","shared_citers":7},{"title":"Extending Context Window of Large Language Models via Positional Interpolation","work_id":"c8b6df85-e7da-4bd8-90a4-d309cc2a0f60","shared_citers":7},{"title":"Gaussian Error Linear Units (GELUs)","work_id":"0466fd22-03a1-4a61-af0a-a900e77bb023","shared_citers":7},{"title":"Layer Normalization","work_id":"20a2d720-0046-4c7c-bcd6-327ec8143f69","shared_citers":7},{"title":"Longformer: The Long-Document Transformer","work_id":"abea7a44-6668-4de7-aab6-f53a6e5aa088","shared_citers":7},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":6},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":6},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":6}],"time_series":[{"n":1,"year":2022},{"n":8,"year":2023},{"n":12,"year":2024},{"n":2,"year":2025},{"n":39,"year":2026}],"dependency_candidates":[]},"authors":[]}}