arxiv: 2309.00071 · v3 · submitted 2023-08-31 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng, Enrico Shippole, Honglu Fan, Jeffrey Quesnelle

Pith reviewed 2026-05-12 06:40 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords YaRNRoPEcontext window extensionrotary position embeddingslarge language modelsLLaMAextrapolationfine-tuning efficiency

0 comments

The pith

YaRN extends the context window of RoPE-based language models using 10 times less training tokens and 2.5 times fewer steps than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces YaRN as a compute-efficient technique to scale rotary position embeddings so that transformer models can process sequences far longer than those seen during pre-training. It demonstrates this on LLaMA models, showing effective utilization of extended contexts and extrapolation beyond the lengths used in fine-tuning. A sympathetic reader would care because current models are limited by their original training length, and extending that limit has previously demanded large amounts of additional data and compute. If the method works as described, it reduces the barrier to building models that handle long documents or conversations without full retraining.

Core claim

YaRN scales the frequencies of rotary position embeddings through a combination of interpolation for in-distribution lengths and adjusted extrapolation for longer ones, allowing LLaMA models to maintain performance on contexts much longer than pre-training while outperforming previous state-of-the-art extension techniques with substantially reduced training cost.

What carries the argument

The YaRN scaling adjustments to rotary embedding frequencies, which modify the base wavelength to balance interpolation within the fine-tuning range and controlled extrapolation beyond it.

If this is right

Existing pre-trained models can be adapted to longer contexts with far less additional training data than earlier approaches required.
Models using YaRN can process sequences longer than those present in the fine-tuning dataset without explicit training on those lengths.
Performance on standard benchmarks at extended lengths surpasses results from previous context-extension methods.
The reduced data and step requirements lower the computational barrier for deploying longer-context variants of models like LLaMA.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the scaling rule generalizes, it could be applied to other rotary or positional embedding variants to achieve similar efficiency gains.
Practical systems could fine-tune once on moderate extensions and then rely on extrapolation for even longer inputs in applications such as document analysis.
Testing on tasks that require reasoning across thousands of tokens beyond the fine-tuning range would provide direct evidence for the extrapolation strength.

Load-bearing premise

The scaling adjustments will permit reliable extrapolation to arbitrary lengths beyond both pre-training and fine-tuning data without sudden performance collapse.

What would settle it

A sharp degradation in accuracy on long-context tasks when the input length exceeds the fine-tuning dataset length by a large margin would falsify the extrapolation claim.

read the original abstract

Rotary Position Embeddings (RoPE) have been shown to effectively encode positional information in transformer-based language models. However, these models fail to generalize past the sequence length they were trained on. We present YaRN (Yet another RoPE extensioN method), a compute-efficient method to extend the context window of such models, requiring 10x less tokens and 2.5x less training steps than previous methods. Using YaRN, we show that LLaMA models can effectively utilize and extrapolate to context lengths much longer than their original pre-training would allow, while also surpassing previous the state-of-the-art at context window extension. In addition, we demonstrate that YaRN exhibits the capability to extrapolate beyond the limited context of a fine-tuning dataset. Code is available at https://github.com/jquesnelle/yarn

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces YaRN, a RoPE-based context extension technique that applies NTK-aware interpolation to low frequencies and linear extrapolation to high frequencies, modulated by a dynamic scaling factor. It claims this yields a compute-efficient fine-tuning procedure (10x fewer tokens and 2.5x fewer steps than prior methods) that lets LLaMA models utilize and extrapolate to context lengths well beyond pre-training, surpasses previous SOTA extension methods, and continues to function when the target length exceeds the fine-tuning dataset length. Open-source code is provided.

Significance. If the reported efficiency and extrapolation results hold under broader scrutiny, the work supplies a practical, low-cost route to longer contexts for existing RoPE models. The claimed reduction in training tokens and steps, together with the ability to exceed fine-tuning lengths, would be directly useful for practitioners who cannot afford full retraining.

major comments (2)

[§4.2–4.3] §4.2–4.3 and associated tables/figures: the long-context perplexity and needle-in-haystack results are shown only up to moderate extensions (roughly 4–8× the fine-tuning length). No experiments or analysis are provided at lengths where the linear high-frequency extrapolation would thin positional resolution further, leaving the central claim of reliable extrapolation “much longer than … pre-training” and “beyond the limited context of a fine-tuning dataset” without direct support.
[§3.1] §3.1, the dynamic scaling factor definition: the factor is length-dependent and therefore implicitly tuned to the target context; the manuscript does not demonstrate that the same factor remains stable or optimal when the test length is increased well beyond the lengths used to choose it, which is load-bearing for the “arbitrary extrapolation” assertion.

minor comments (3)

[Abstract] Abstract: “surpassing previous the state-of-the-art” contains a grammatical error.
[§3] The method section would benefit from an explicit statement of all hyper-parameters (including the precise values of α, β and the dynamic scaling schedule) so that the “compute-efficient” and “parameter-light” claims can be reproduced without reference to the released code.
[Figures 2–4] Figure captions and axis labels in the long-context plots should explicitly indicate which curves correspond to YaRN versus the NTK and linear baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, acknowledging where the manuscript can be strengthened through additional experiments and analysis while defending the core contributions on the basis of the presented results.

read point-by-point responses

Referee: [§4.2–4.3] §4.2–4.3 and associated tables/figures: the long-context perplexity and needle-in-haystack results are shown only up to moderate extensions (roughly 4–8× the fine-tuning length). No experiments or analysis are provided at lengths where the linear high-frequency extrapolation would thin positional resolution further, leaving the central claim of reliable extrapolation “much longer than … pre-training” and “beyond the limited context of a fine-tuning dataset” without direct support.

Authors: We appreciate the referee's point that our reported evaluations focus on moderate extensions. The needle-in-haystack results do demonstrate successful retrieval at positions exceeding the fine-tuning length, supporting the claim of extrapolation beyond the fine-tuning dataset. Nevertheless, we agree that direct evaluation at lengths where high-frequency linear extrapolation further reduces positional resolution (e.g., 16× or greater) would provide stronger evidence. In the revised manuscript we will add perplexity and needle-in-haystack experiments at 16× and 32× extensions together with a brief analysis of positional resolution behavior at these scales. revision: yes
Referee: [§3.1] §3.1, the dynamic scaling factor definition: the factor is length-dependent and therefore implicitly tuned to the target context; the manuscript does not demonstrate that the same factor remains stable or optimal when the test length is increased well beyond the lengths used to choose it, which is load-bearing for the “arbitrary extrapolation” assertion.

Authors: The dynamic scaling factor is indeed length-dependent by construction. Once selected for a target extension ratio during fine-tuning, the same factor is applied at inference without retuning. Our existing results already show that this choice supports lengths beyond the fine-tuning dataset. To directly address the concern about stability at arbitrary lengths, we will include an ablation in the revision that evaluates the same scaling factor on test sequences substantially longer than those used for its selection, confirming that performance remains robust without further adjustment. revision: yes

Circularity Check

0 steps flagged

No circularity: YaRN is an empirical extension method with independent derivation and measured results

full rationale

The paper presents YaRN as a practical combination of NTK-aware interpolation for low frequencies, linear extrapolation for high frequencies, and a dynamic scaling factor, then validates it through fine-tuning experiments on LLaMA models. All performance claims (10x fewer tokens, 2.5x fewer steps, extrapolation beyond fine-tune lengths) are reported as measured outcomes on held-out long-context tasks, not derived by construction from the inputs. No self-citations form load-bearing uniqueness theorems, no fitted parameters are relabeled as predictions, and the core scaling formulas are stated explicitly rather than smuggled via prior author work. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Insufficient information from abstract alone to identify specific free parameters, axioms, or invented entities used in the YaRN method.

pith-pipeline@v0.9.0 · 5443 in / 1007 out tokens · 67866 ms · 2026-05-12T06:40:26.803358+00:00 · methodology

discussion (0)

Forward citations

Cited by 37 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention
stat.ML 2026-05 unverdicted novelty 8.0

The upper-tail accumulation scale derived from the gap-counting function N_n sets the critical inverse temperature for softmax attention concentration, unifying prior conflicting laws as special cases of different N_n.
HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model
cs.CL 2026-05 unverdicted novelty 7.0

Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation
cs.DC 2026-04 unverdicted novelty 7.0

GVR uses previous-step Top-K predictions, pre-indexed stats, secant counting, and shared-memory verification to deliver 1.88x average speedup over radix-select while preserving bit-exact Top-K on DeepSeek-V3.2 workloads.
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings
q-bio.QM 2026-04 unverdicted novelty 7.0

Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
cs.CL 2026-04 unverdicted novelty 7.0

TriAttention compresses KV cache by exploiting stable pre-RoPE Q/K concentration and trigonometric distance preferences to match full-attention reasoning accuracy with far lower memory and higher speed.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing
cs.CL 2026-05 conditional novelty 6.0

EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.
Remember to Forget: Gated Adaptive Positional Encoding
cs.LG 2026-05 unverdicted novelty 6.0

GAPE augments RoPE with query- and key-dependent gates to stabilize attention and improve long-context performance in language models.
FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning
cs.CL 2026-05 unverdicted novelty 6.0

FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, y...
ZAYA1-8B Technical Report
cs.AI 2026-05 unverdicted novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
The Impossibility Triangle of Long-Context Modeling
cs.CL 2026-05 unverdicted novelty 6.0

No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
cs.CL 2026-04 unverdicted novelty 6.0

HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing
cs.CR 2026-04 unverdicted novelty 6.0

TIGS detects backdoor-induced attention collapse in LLMs and applies content-aware tail-risk screening plus intrinsic geometric smoothing to suppress attacks while preserving normal performance.
Reducing Detail Hallucinations in Long-Context Regulatory Understanding via Targeted Preference Optimization
cs.SI 2026-04 unverdicted novelty 6.0

DetailDPO cuts detail-level hallucination errors in LLMs on long regulatory documents by 42-61% using targeted contrastive pairs on a new 13,000-pair benchmark.
Democratizing Tool Learning with Environments Fully Simulated by a Free 8B Language Model
cs.LG 2026-04 unverdicted novelty 6.0

TRUSTEE uses an 8B LM to simulate complete dynamic environments for RL-based tool learning and outperforms baselines that require extra external resources.
OPSDL: On-Policy Self-Distillation for Long-Context Language Models
cs.CL 2026-04 unverdicted novelty 6.0

OPSDL improves long-context LLM performance by having the model self-distill from its short-context capability using point-wise reverse KL divergence on generated tokens, outperforming SFT and DPO on benchmarks withou...
Sensitivity-Positional Co-Localization in GQA Transformers
cs.CL 2026-04 unverdicted novelty 6.0

In Llama 3.1 8B, task-sensitive layers cluster late while RoPE adaptation is strongest early, yet applying both adaptations only to sensitivity-identified layers outperforms other layer choices by 4-16 points on MMLU,...
In-Place Test-Time Training
cs.LG 2026-04 conditional novelty 6.0

In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents
cs.SE 2026-04 unverdicted novelty 6.0

A two-stage SFT pipeline distills execution-free then execution-based trajectories from a 480B model into smaller Qwen2.5-Coder agents, yielding 62.2% resolution on SWE-bench Verified and 44.1% zero-shot on the multil...
Kimi Linear: An Expressive, Efficient Attention Architecture
cs.CL 2025-10 unverdicted novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
cs.CL 2025-05 conditional novelty 6.0

Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
Muon is Scalable for LLM Training
cs.LG 2025-02 unverdicted novelty 6.0

Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
cs.CL 2024-04 conditional novelty 6.0

MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
VIP-COP: Context Optimization for Tabular Foundation Models
cs.LG 2026-05 unverdicted novelty 5.0

VIP-COP is a black-box method that optimizes context for tabular foundation models by ranking and selecting high-value samples and features via online KernelSHAP regression, outperforming baselines on large high-dimen...
MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading
cs.CL 2026-05 unverdicted novelty 5.0

MemReread improves agent long-context reasoning by triggering rereading on insufficient final memory to recover discarded indirect facts, outperforming baselines at linear complexity.
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
cs.SE 2026-04 accept novelty 5.0

LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
Kimi K2.5: Visual Agentic Intelligence
cs.CL 2026-02 unverdicted novelty 5.0

Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
gpt-oss-120b & gpt-oss-20b Model Card
cs.CL 2025-08 unverdicted novelty 5.0

OpenAI releases two open-weight reasoning models, gpt-oss-120b and gpt-oss-20b, trained via distillation and RL with claimed strong results on math, coding, and safety benchmarks.
Kimi K2: Open Agentic Intelligence
cs.LG 2025-07 unverdicted novelty 5.0

Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
Qwen3 Technical Report
cs.CL 2025-05 unverdicted novelty 5.0

Pith review generated a malformed one-line summary.
Can LLMs Reason About Attention? Towards Zero-Shot Analysis of Multimodal Classroom Behavior
cs.HC 2026-04 unverdicted novelty 4.0

A pipeline uses OpenPose and Gaze-LLE to extract pose and gaze data from classroom videos, deletes the raw footage, and applies an LLM for zero-shot behavioral analysis of student attention.
Ministral 3
cs.CL 2026-01 unverdicted novelty 4.0

Ministral 3 releases 3B/8B/14B parameter-efficient language models with base, instruction, and reasoning variants derived via iterative pruning and distillation, including image understanding capabilities.
World Simulation with Video Foundation Models for Physical AI
cs.CV 2025-10 unverdicted novelty 4.0

Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
Qwen2.5-Coder Technical Report
cs.CL 2024-09 unverdicted novelty 4.0

Qwen2.5-Coder models claim state-of-the-art results on over 10 code benchmarks, outperforming larger models of similar size.
Phoenix-VL 1.5 Medium Technical Report
cs.CL 2026-05 unverdicted novelty 3.0

Phoenix-VL 1.5 Medium is a 123B-parameter natively multimodal model that reaches state-of-the-art results on Singapore multimodal, legal, and policy benchmarks after localized training on 1T+ tokens while staying comp...
Cosmos World Foundation Model Platform for Physical AI
cs.CV 2025-01 unverdicted novelty 3.0

The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
A Survey of Large Language Models
cs.CL 2023-03 accept novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 37 Pith papers · 8 internal anchors

[1]

S., Purohit, S., Reynolds, L., Tow, J., Wang, B., and Weinbach, S

arXiv: 2204.06745. bloc97. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) con- text size without any fine-tuning and minimal perplexity degradation., 2023a. URL https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_ scaled_rope_allows_llama_models_to_have/. bloc97. Add NTK-Aware interpolation "by parts" correction, 2023b. URL htt...

work page arXiv
[2]

arXiv: 2306.15595. A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y . Tay, N. Shazeer, V . Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S...

work page internal anchor Pith review arXiv
[3]

arXiv: 2204.02311. P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try ARC, the AI2 Reasoning Challenge,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

arXiv: 1803.05457. T. Computer. Redpajama: An open source recipe to reproduce llama training dataset,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

arXiv: 2307.08691. emozilla. Dynamically Scaled RoPE further increases performance of long context LLaMA with zero fine-tuning,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

arXiv: 1705.03122. C. Han, Q. Wang, W. Xiong, Y . Chen, H. Ji, and S. Wang. LM-Infinite: Simple on-the-fly length generalization for large language models,

work page arXiv
[7]

arXiv: 2308.16137. D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR),

work page arXiv
[8]

Huang, S

L. Huang, S. Cao, N. Parulian, H. Ji, and L. Wang. Efficient attentions for long document summariza- tion. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1419–1436. Association for Computational Linguistics, June

work page 2021
[9]

arXiv: 2305.19466. S. Lin, J. Hilton, and O. Evans. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, May

work page arXiv
[10]

arXiv: 2305.16300. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. PyTorch: An imperative style, high-performance deep learning library. InNeurIPS, pages 8024–8035,

work page arXiv
[11]

arXiv: 2308.12950. 11 P. Shaw, J. Uszkoreit, and A. Vaswani. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 464–468, New Orleans, Louisiana, June

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

arXiv: 2104.09864. Y . Sun, L. Dong, B. Patra, S. Ma, S. Huang, A. Benhaim, V . Chaudhary, X. Song, and F. Wei. A length-extrapolatable transformer,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

arXiv: 2212.10554. M. Tancik, P. P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ra- mamoorthi, J. T. Barron, and R. Ng. Fourier features let networks learn high frequency functions in low dimensional domains. InProceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY , USA,

work page arXiv
[14]

URL https://huggingface.co/ togethercomputer/LLaMA-2-7B-32K. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. LLaMA: Open and efficient foundation language models, 2023a. arXiv: 2302.13971. H. Touvron, L. Martin, K. Stone, P. Albert, A. A...

work page internal anchor Pith review Pith/arXiv arXiv
[15]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

arXiv: 2304.11277. 12 A ADDITIONAL DETAILS ON INTERPOLATION METHODS A.1 POSITIONINTERPOLATION As mentioned in Section 2.2, PI is one of the earlier works extending context lengths of RoPE. We include some extra details here: While a direct extrapolation does not perform well on sequences w1,· · ·, w L with L larger than the pre-trained limit, they discove...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

r 1 t = 0.1 ln(s) + 1≈1.208.(23) Figure 4:Fix s= 8 , compare the LLaMA 7b perplexity on 896 16k-token documents over different scaling 1/ √ t

is given by the following. r 1 t = 0.1 ln(s) + 1≈1.208.(23) Figure 4:Fix s= 8 , compare the LLaMA 7b perplexity on 896 16k-token documents over different scaling 1/ √ t. The shaded area represents1standard deviation (68%). To show the impact of the factor 1/ √ t on different token positions, we cut each 16k-token document into chunks of 2048 tokens, and f...

work page 2048
[17]

closely. 15 B ADDITIONAL TABLES AND CHARTS B.1 ABLATIONSTUDY Extension Fine- Training Extension Evaluation Context Window Size Method tuned Steps Scales2048 4096 8192 16384 32768 None✗- -4.05- - - - PI✗- 2k×2 4.36 3.90 - - - NTK-aware✗- 2k×2 4.08 5.97 - - - NTK-by-parts✗- 2k×2 4.12 3.71 - - - YaRN✗- 2k×2 4.07 3.67- - - PI✗- 2k×4 7.09 6.39 6.18 - - NTK-awa...

work page 2023
[18]

NTK-aware

and "NTK-aware" Code Llama (Rozière et al., 2023). The results are summarized in Table 7 (with a more detailed plot in Figure 7). Model Model Context Extension Evaluation Context Window Size Size Name Window Method 8192 32768 65536 98304 131072 7B Together 32k PI3.50 2.64>10 2 >10 3 >10 4 7B Code Llama 100k NTK 3.71 2.74 2.55 2.54 2.71 7B YaRN (s=

work page 2023
[19]

is fine-tuned from Llama 2 7B, trained at 8k context length with PI using the RedPajama dataset (Computer, 2023). 17 0 20000 40000 60000 80000 100000 120000 Context Window 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8Perplexity (lower is better) CodeLlama-13b-hf Yarn-Llama-2-13b-64k Yarn-Llama-2-13b-128k togethercomputer/LLaMA-2-7B-32K CodeLlama-7b-hf Yarn-Llama-2-...

work page 2023