arxiv: 2306.15595 · v2 · submitted 2023-06-27 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Extending Context Window of Large Language Models via Positional Interpolation

Liangjian Chen, Sherman Wong, Shouyuan Chen, Yuandong Tian

Pith reviewed 2026-05-13 10:14 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords Position InterpolationContext Window ExtensionRotary Position EmbeddingsLarge Language ModelsLong Context ProcessingFine-tuningLLaMA Models

0 comments

The pith

Position Interpolation extends RoPE-based LLMs to 32768 token context with minimal fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Position Interpolation to extend the context window of pretrained LLMs using rotary position embeddings. By linearly down-scaling position indices to fit within the original training range, it prevents the attention score explosion that ruins simple extrapolation. This allows LLaMA models from 7B to 65B parameters to handle up to 32,768 tokens after only about 1000 fine-tuning steps. Strong results are shown on long-context tasks including passkey retrieval, language modeling, and document summarization, while quality on shorter contexts is preserved. The method keeps the original model architecture unchanged.

Core claim

Position Interpolation linearly down-scales the input position indices to match the original context window size, rather than extrapolating beyond the trained context length which may lead to catastrophically high attention scores that completely ruin the self-attention mechanism. Our theoretical study shows that the upper bound of interpolation is at least ~600× smaller than that of extrapolation, further demonstrating its stability. Models extended via Position Interpolation retain its original architecture and can reuse most pre-existing optimization and infrastructure.

What carries the argument

Position Interpolation, a linear down-scaling of position indices during fine-tuning to keep them within the pretrained range.

If this is right

The extended models support context windows up to 32768 tokens.
Only minimal fine-tuning within 1000 steps is required.
Performance on original context length tasks remains relatively unchanged.
Results hold for models ranging from 7B to 65B parameters.
Existing infrastructure and optimizations can be reused without changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This technique may generalize to other position embedding methods beyond RoPE.
Further extensions to even longer contexts could be possible by applying similar scaling.
Integration with other long-context methods might yield additional gains without full retraining.
Deployment in production systems becomes simpler due to architectural compatibility.

Load-bearing premise

Linear down-scaling of position indices during fine-tuning reliably avoids high attention scores without introducing new failure modes on real data.

What would settle it

Measuring whether attention scores on sequences of length 32768 remain bounded similarly to shorter sequences after applying the interpolation and fine-tuning would confirm or refute the stability claim.

read the original abstract

We present Position Interpolation (PI) that extends the context window sizes of RoPE-based pretrained LLMs such as LLaMA models to up to 32768 with minimal fine-tuning (within 1000 steps), while demonstrating strong empirical results on various tasks that require long context, including passkey retrieval, language modeling, and long document summarization from LLaMA 7B to 65B. Meanwhile, the extended model by Position Interpolation preserve quality relatively well on tasks within its original context window. To achieve this goal, Position Interpolation linearly down-scales the input position indices to match the original context window size, rather than extrapolating beyond the trained context length which may lead to catastrophically high attention scores that completely ruin the self-attention mechanism. Our theoretical study shows that the upper bound of interpolation is at least $\sim 600 \times$ smaller than that of extrapolation, further demonstrating its stability. Models extended via Position Interpolation retain its original architecture and can reuse most pre-existing optimization and infrastructure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Position Interpolation is a simple, stable fix for extending RoPE context to 32k via linear down-scaling of positions, backed by a 600x tighter attention bound and quick fine-tuning results on LLaMA models.

read the letter

The core contribution is a direct change to how positions are fed into RoPE: scale the indices by 1/s during fine-tuning so the model never sees values outside its original training range. For LLaMA this gets you from 4k to 32k with roughly 1000 steps, and the paper reports solid numbers on passkey retrieval, language modeling, and long-document summarization from 7B up to 65B while keeping short-context quality close to the base model. The theoretical piece is the cleanest part: they bound the maximum attention score under interpolation and show it is at least 600 times smaller than the extrapolation case, which lines up with why the method does not collapse in practice. The approach also keeps the original architecture untouched, so existing training code and inference stacks carry over without changes. That combination of low cost and reuse is what makes it immediately usable. The main soft spot is the resolution question. Compressing positions by a factor of 8 shrinks the angular difference between adjacent tokens, so the model must re-learn fine local distinctions inside long sequences. The reported fine-tuning appears to handle the tasks they tested, but the abstract gives limited ablation detail on whether this holds when local structure is dense or when the fine-tuning data distribution differs from downstream use. Baselines and exact metric tables are also thin in the summary, which makes it harder to judge how much of the gain is from the interpolation itself versus other training choices. This paper is aimed at groups already running RoPE models who need longer context without a full retrain. It is worth a serious referee because the mechanism is transparent, the bound is falsifiable, and the empirical claims are scoped enough to check quickly. I would bring it to a reading group and would cite the method if I were extending context on similar architectures.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Position Interpolation (PI), a method to extend the context window of RoPE-based pretrained LLMs (e.g., LLaMA 7B–65B) to 32,768 tokens via linear down-scaling of position indices during ~1000 steps of fine-tuning. It reports strong empirical performance on long-context tasks including passkey retrieval, language modeling, and document summarization, while preserving quality on original-length inputs; a theoretical analysis claims the attention-score upper bound under interpolation is ~600× smaller than under extrapolation.

Significance. If the empirical results and bound hold under broader scrutiny, the work provides a lightweight, architecture-preserving route to longer contexts that reuses existing infrastructure and pretraining, with direct applicability to production LLMs.

major comments (2)

[§4] §4 (theoretical bound): the ~600× smaller upper bound on attention scores is derived under the assumption that scaled angles remain within the trained regime, but the derivation does not quantify the resulting loss in angular resolution for small relative distances (scaling by 1/s compresses minimal angle differences proportionally); this directly affects whether adjacent-token distinctions remain recoverable after fine-tuning.
[Experiments] Experimental section: results on passkey retrieval and summarization are reported without explicit baseline comparisons, exact metric definitions, or ablations on the scaling factor s; without these it is impossible to determine whether the observed gains are attributable to PI or to the fine-tuning regime itself.

minor comments (2)

[§3] Notation for the scaling factor s is introduced without a dedicated equation; a single displayed equation defining s = L_new / L_train would improve clarity.
[Conclusion] The claim that 'most pre-existing optimization and infrastructure' can be reused is stated but not supported by any concrete compatibility checks or timing measurements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our theoretical analysis and experimental results. We address each point below and will incorporate the suggested revisions in the next version of the manuscript.

read point-by-point responses

Referee: [§4] §4 (theoretical bound): the ~600× smaller upper bound on attention scores is derived under the assumption that scaled angles remain within the trained regime, but the derivation does not quantify the resulting loss in angular resolution for small relative distances (scaling by 1/s compresses minimal angle differences proportionally); this directly affects whether adjacent-token distinctions remain recoverable after fine-tuning.

Authors: We appreciate this observation on the angular resolution implications. Section 4 derives the upper bound to establish stability relative to extrapolation, showing that interpolation keeps attention scores within a regime where fine-tuning can succeed. While the scaling by 1/s does compress small-angle differences, our empirical results demonstrate that the ~1000-step fine-tuning recovers adjacent-token distinctions effectively, as evidenced by preserved short-context performance and strong long-context results. In the revision we will add a short paragraph in §4 quantifying the resolution compression (via the factor 1/s) and include a brief empirical check of short-range attention patterns before and after fine-tuning to illustrate recoverability. revision: yes
Referee: [Experiments] Experimental section: results on passkey retrieval and summarization are reported without explicit baseline comparisons, exact metric definitions, or ablations on the scaling factor s; without these it is impossible to determine whether the observed gains are attributable to PI or to the fine-tuning regime itself.

Authors: We agree that additional experimental details are needed for clarity. The current manuscript reports absolute performance on passkey retrieval (exact match accuracy) and summarization (ROUGE-1/2/L), but lacks explicit baselines and ablations. In the revised version we will (1) add baseline comparisons including naive fine-tuning without interpolation and results from contemporaneous methods, (2) explicitly define all metrics in the experimental section, and (3) include an ablation table varying the scaling factor s while keeping the fine-tuning budget fixed. These changes will isolate the contribution of Position Interpolation from the fine-tuning procedure itself. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method and bound are independently derived

full rationale

The paper introduces Position Interpolation as an explicit algorithmic change (linear down-scaling of position indices) applied to existing RoPE embeddings, accompanied by a separate mathematical analysis deriving the ~600x smaller attention-score upper bound for interpolation versus extrapolation. Neither the proposal nor the bound reduces to a fitted parameter, self-definition, or load-bearing self-citation; the fine-tuning step is presented as empirical adaptation rather than a constructed prediction. The derivation chain remains self-contained against external RoPE properties and reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that RoPE's relative position properties survive linear index scaling and that the theoretical upper-bound comparison transfers to practical attention behavior.

axioms (1)

domain assumption RoPE positional encodings maintain useful relative attention properties under linear index interpolation
Invoked when claiming that down-scaling avoids catastrophic attention scores

pith-pipeline@v0.9.0 · 5479 in / 1115 out tokens · 52387 ms · 2026-05-13T10:14:54.410573+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.PhiForcing phi_equation unclear
Our theoretical study shows that the upper bound of interpolation is at least ∼600× smaller than that of extrapolation

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
cs.CL 2023-08 unverdicted novelty 8.0

LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models
cs.CV 2026-05 unverdicted novelty 7.0

ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.
Jordan-RoPE: Non-Semisimple Relative Positional Encoding via Complex Jordan Blocks
cs.LG 2026-05 conditional novelty 7.0

Jordan-RoPE realizes a non-semisimple relative positional operator that produces coupled oscillatory-polynomial features such as d e^{i omega d} for causal query-key lags.
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings
q-bio.QM 2026-04 unverdicted novelty 7.0

Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...
Screening Is Enough
cs.LG 2026-04 unverdicted novelty 7.0

Multiscreen replaces softmax attention with screening to provide absolute query-key relevance, resulting in models with 30% fewer parameters that maintain stable performance at long contexts.
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
cs.CV 2026-05 unverdicted novelty 6.0

Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction
cs.AI 2026-05 unverdicted novelty 6.0

Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference
cs.LG 2026-05 conditional novelty 6.0

KV-Fold turns frozen transformers into stable long-context models by folding the KV cache across sequence chunks in repeated forward passes.
Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing
cs.CL 2026-05 conditional novelty 6.0

EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.
Remember to Forget: Gated Adaptive Positional Encoding
cs.LG 2026-05 unverdicted novelty 6.0

GAPE augments RoPE with query- and key-dependent gates to stabilize attention and improve long-context performance in language models.
Self-Consolidating Language Models: Continual Knowledge Incorporation from Context
cs.CL 2026-05 unverdicted novelty 6.0

SCoL trains LLMs via meta-reinforcement learning to generate layer-specific update instructions that improve knowledge acquisition and retention from context streams over standard baselines.
Self-Consolidating Language Models: Continual Knowledge Incorporation from Context
cs.CL 2026-05 unverdicted novelty 6.0

SCoL lets LLMs self-generate sparse layer updates via meta-RL to consolidate knowledge from context, outperforming prompting and fine-tuning baselines on QA and long-context tasks while aligning updates with high-Fish...
Learning to Route Queries to Heads for Attention-based Re-ranking with Large Language Models
cs.IR 2026-04 conditional novelty 6.0

RouteHead trains a lightweight router to dynamically select optimal LLM attention heads per query for improved attention-based document re-ranking.
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
cs.CL 2025-05 conditional novelty 6.0

Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
Long Context Transfer from Language to Vision
cs.CV 2024-06 unverdicted novelty 6.0

Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.
MemGPT: Towards LLMs as Operating Systems
cs.AI 2023-10 unverdicted novelty 6.0

MemGPT uses OS-inspired virtual context management to extend LLM context windows for large document analysis and long-term multi-session chat.
Efficient Streaming Language Models with Attention Sinks
cs.CL 2023-09 accept novelty 6.0

StreamingLLM lets finite-window LLMs generalize to infinite-length sequences by retaining initial-token KV states as attention sinks, enabling stable streaming inference up to 4M tokens.
YaRN: Efficient Context Window Extension of Large Language Models
cs.CL 2023-08 unverdicted novelty 6.0

YaRN extends the context window of RoPE-based LLMs like LLaMA more efficiently than prior methods, using 10x fewer tokens and 2.5x fewer steps while surpassing state-of-the-art performance and enabling extrapolation b...
VIP-COP: Context Optimization for Tabular Foundation Models
cs.LG 2026-05 unverdicted novelty 5.0

VIP-COP is a black-box method that optimizes context for tabular foundation models by ranking and selecting high-value samples and features via online KernelSHAP regression, outperforming baselines on large high-dimen...
MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading
cs.CL 2026-05 unverdicted novelty 5.0

MemReread improves agent long-context reasoning by triggering rereading on insufficient final memory to recover discarded indirect facts, outperforming baselines at linear complexity.
Decouple and Cache: KV Cache Construction for Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 5.0

DSCache decouples cumulative past and instant KV caches with position-agnostic encoding to adapt offline VideoVLLMs to streaming video, delivering 2.5% average accuracy gains on QA benchmarks.
Adaptive 3D-RoPE: Physics-Aligned Rotary Positional Encoding for Wireless Foundation Models
eess.SP 2026-05 unverdicted novelty 5.0

Adaptive 3D-RoPE adapts rotary positional encoding to wireless channel physics via learnable 3D frequencies and dynamic CSI control, yielding up to 10.7 dB NMSE gains in scale extrapolation and 1 dB in zero-shot tasks.
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
cs.SE 2026-04 accept novelty 5.0

LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
cs.SE 2024-01 unverdicted novelty 5.0

DeepSeek-Coder open-source models trained on 2T code tokens with fill-in-the-blank pretraining achieve SOTA results among open models and surpass closed-source Codex and GPT-3.5 on code benchmarks.
A Survey of Context Engineering for Large Language Models
cs.CL 2025-07 accept novelty 4.0

The survey organizes Context Engineering into retrieval, processing, management, and integrated systems like RAG and multi-agent setups while identifying an asymmetry where LLMs handle complex inputs well but struggle...
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
cs.LG 2024-03 accept novelty 4.0

A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
cs.CL 2024-06 unverdicted novelty 3.0

GLM-4 models rival or exceed GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, HumanEval, IFEval, long-context tasks, and Chinese alignment while adding autonomous tool use for web, code, and image generation.
A Survey of Large Language Models
cs.CL 2023-03 accept novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 28 Pith papers · 4 internal anchors

[1]

Colwell, and Adrian Weller

Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tam ´as Sarl ´os, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J. Colwell, and Adrian Weller. Rethinking attention with per- formers. In 9th International Conference on Learning Representations, ICLR 2021 . Open...

work page 2021
[2]

doi: 10.18653/v1/P19-1285

Association for Computational Linguistics. doi: 10.18653/v1/P19-1285. Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Process- ing Systems,

work page doi:10.18653/v1/p19-1285
[3]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

URL https: //openreview.net/forum?id=YicbFdNTTy. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Efficient attentions for long document summarization

Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. Efficient attentions for long document summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pp. 1419–1436, Online, June

work page 2021
[6]

The lessons of developing process reward models in mathematical reasoning

Association for Computational Linguistics. doi: 10.18653/v1/ 2021.naacl-main.112. URL https://aclanthology.org/2021.naacl-main.112. 12 Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Atlas: Few-shot learning with retrieval augmented language models

work page doi:10.18653/v1/ 2021
[7]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781. Association for Computational Linguistics,

work page 2020
[8]

Omar Khattab, Christopher Potts, and Matei Zaharia

doi: 10.18653/ v1/2020.emnlp-main.550. Omar Khattab, Christopher Potts, and Matei Zaharia. Relevance-guided supervision for openqa with colbert. Transactions of the Association for Computational Linguistics , 9:929–944,

work page 2020
[9]

Lost in the Middle: How Language Models Use Long Contexts

doi: 10.1162/tacl a 00405. Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In 8th International Conference on Learning Representations, ICLR

work page internal anchor Pith review doi:10.1162/tacl
[10]

SentencePiece: A simple and language independent sub- word tokenizer and detokenizer for neural text processing

Taku Kudo and John Richardson. SentencePiece: A simple and language independent sub- word tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Con- ference on Empirical Methods in Natural Language Processing: System Demonstrations , pp. 66–71, Brussels, Belgium, November

work page 2018
[11]

S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing

Association for Computational Linguistics. doi: 10.18653/v1/D18-2012. URL https://aclanthology.org/D18-2012. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp. 74–81, Barcelona, Spain, July

work page internal anchor Pith review doi:10.18653/v1/d18-2012 2012
[12]

Landmark attention: Random-access infinite context length for transformers

Amirkeivan Mohtashami and Martin Jaggi. Landmark attention: Random-access infinite context length for transformers. arXiv preprint arXiv:2305.16300,

work page arXiv
[13]

13 Hongyu Ren, Hanjun Dai, Zihang Dai, Mengjiao Yang, Jure Leskovec, Dale Schuurmans, and Bo Dai

URL https://openreview.net/forum?id= SylKikSYDH. 13 Hongyu Ren, Hanjun Dai, Zihang Dai, Mengjiao Yang, Jure Leskovec, Dale Schuurmans, and Bo Dai. Combiner: Full attention transformer with sparse computation cost. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information P...

work page 2021
[14]

Col- bertv2: Effective and efficient retrieval via lightweight late interaction

Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. Col- bertv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies , pp. 3715–3734, Seattle, United States,

work page 2022
[15]

doi: 10.18653/v1/2022.naacl-main.272

Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.272. Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor Geva, Jonathan Berant, and Omer Levy. SCROLLS: Standardized CompaRison over long language sequences. In Proceedings of the 2022 Conference on Empirical Methods in Natural Langu...

work page doi:10.18653/v1/2022.naacl-main.272 2022
[16]

URL https://aclanthology.org/2022

Association for Computational Linguistics. URL https://aclanthology.org/2022. emnlp-main.823. Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding,

work page 2022
[17]

Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma

URL https://proceedings.neurips.cc/paper_files/paper/2017/ file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity

work page 2017
[18]

Memorizing trans- formers

Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing trans- formers. In The Tenth International Conference on Learning Representations, ICLR 2022 . Open- Review.net, April

work page 2022
[19]

Big bird: Transformers for longer sequences

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santi- ago Onta ˜n´on, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Proc...

work page 2020
[20]

While the bounds goes down with large positional difference s, numerically B(s)/d ≥ 1 and at many s much larger than 1 (the dotted horizontal line)

15 0 1000 2000 3000 4000 Positional difference s 2 4 6 8 10 12 14 16B(s)/d Figure 5: The bound B(s)/d decays with s. While the bounds goes down with large positional difference s, numerically B(s)/d ≥ 1 and at many s much larger than 1 (the dotted horizontal line). Please check Appendix C.2 for the source code used to draw the figure. 16 C C ODE C.1 C ODE...

work page 2000
[21]

r") for i in range(25,75): plt.axvline(i, color=

plt.plot(x3, y3, "r") for i in range(25,75): plt.axvline(i, color="k", linestyle="--", linewidth=0.5) plt.title("Effect of Interpolation") plt.xlabel("Positional difference $s$") plt.show() 17 C.2 C ODE FOR FIG. 5 L = 2048 x = torch.arange(0, 2 *L) d = 4096 // 32 theta = 10000 freqs = 1.0 / (theta ** (torch.arange(0, d, 2)[: (d // 2)].float() / d)) xfreq ...

work page 2048