arxiv: 2605.14589 · v1 · submitted 2026-05-14 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

EndPrompt: Efficient Long-Context Extension via Terminal Anchoring

Han Tian , Luxuan Chen , Xinran Chen , Rui Kong , Fang Wang , Jiamin Chen , Jinman Zhao , Yuchen Li

show 4 more authors

Jiashu Zhao Shuaiqiang Wang Haoyi Xiong Dawei Yin

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:26 UTC · model grok-4.3

classification 💻 cs.CL

keywords long-context extensionposition embeddingsRotary Position Embeddingefficient fine-tuninglanguage model adaptationsparse positional supervisioncontext window extensionterminal anchoring

0 comments

The pith

EndPrompt extends LLM context windows to 64K by training only on short sequences with a terminal prompt anchored at target positions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that long-context generalization does not require training on full-length sequences. Instead, EndPrompt keeps the original short text intact as the first segment and appends a brief terminal prompt whose positions are set near the desired target length. This creates both local and long-range relative distances inside short physical inputs while preserving semantic continuity of the training data. A theoretical argument based on Rotary embeddings and the Bernstein inequality establishes that position interpolation imposes smoothness on the attention function, and shared parameters limit unstable extrapolation. Experiments extending LLaMA models from 8K to 64K yield higher average scores on RULER and LongBench than full-length fine-tuning or prior methods, at far lower compute cost.

Core claim

By preserving the short context as an intact first segment and assigning the terminal prompt positional indices near the target length, the construction introduces long-range relative distances within short sequences; combined with the smoothness constraint from interpolation and parameter sharing, this suffices for reliable long-context generalization without dense long-sequence training.

What carries the argument

Two-segment terminal anchoring: the original short context remains the first segment while a brief prompt receives positional indices near the target context length, generating long-range relative distances inside short physical inputs.

If this is right

Context extension from 8K to 64K is achievable with short sequences, delivering 76.03 average RULER score versus 69.23 for full-length fine-tuning.
Memory and compute costs drop substantially because training sequences remain short.
Semantic continuity is maintained better than in chunk-based splitting methods.
Shared Transformer parameters suppress unstable extrapolation to unobserved intermediate distances.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same anchoring principle may transfer to position-embedding families other than RoPE if the smoothness constraint generalizes.
Researchers with limited hardware could now adapt large models to long contexts that were previously inaccessible.
Similar sparse-supervision tricks might improve data efficiency in other sequence modeling domains such as time-series or protein sequences.

Load-bearing premise

Assigning target-length positional indices to a brief terminal prompt appended to short sequences preserves the necessary relative distances and semantic continuity without creating artifacts from the artificial split.

What would settle it

A controlled test on recall tasks that place critical information at intermediate positions between the short context and the terminal prompt, where EndPrompt-trained models show sharply lower accuracy than full-length fine-tuned models.

Figures

Figures reproduced from arXiv: 2605.14589 by Dawei Yin, Fang Wang, Han Tian, Haoyi Xiong, Jiamin Chen, Jiashu Zhao, Jinman Zhao, Luxuan Chen, Rui Kong, Shuaiqiang Wang, Xinran Chen, Yuchen Li.

**Figure 1.** Figure 1: Overview of the proposed method. 3.2 Positional Index Manipulation Let L denote the target context length. Given a short context sequence x = (x0, x1, . . . , xa−1) of length a, an end prompt e = (e0, e1, . . . , eb−1) of length b is appended to form the physical training sequence: y = (x0, x1, . . . , xa−1, e0, e1, . . . , eb−1), |y| = a + b. (4) While the physical length is a + b, the assigned positional… view at source ↗

**Figure 3.** Figure 3: Performance comparison among the standard ET, Positional Skip-Embedding, and ET(PoSE) across the benchmarks of LongBench and RULER. 4.4 Structural Analysis To evaluate compatibility, the proposed approach is compared against Positional Skip-Embedding, a method that extends the context window by partitioning inputs into chunks and manipulating 8 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of memory footprint and time consumption across different methods. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

Extending the context window of large language models typically requires training on sequences at the target length, incurring quadratic memory and computational costs that make long-context adaptation expensive and difficult to reproduce. We propose EndPrompt, a method that achieves effective context extension using only short training sequences. The core insight is that exposing a model to long-range relative positional distances does not require constructing full-length inputs: we preserve the original short context as an intact first segment and append a brief terminal prompt as a second segment, assigning it positional indices near the target context length. This two-segment construction introduces both local and long-range relative distances within a short physical sequence while maintaining the semantic continuity of the training text--a property absent in chunk-based simulation approaches that split contiguous context. We provide a theoretical analysis grounded in Rotary Position Embedding and the Bernstein inequality, showing that position interpolation induces a rigorous smoothness constraint over the attention function, with shared Transformer parameters further suppressing unstable extrapolation to unobserved intermediate distances. Applied to LLaMA-family models extending the context window from 8K to 64K, EndPrompt achieves an average RULER score of 76.03 and the highest average on LongBench, surpassing LCEG (72.24), LongLoRA (72.95), and full-length fine-tuning (69.23) while requiring substantially less computation. These results demonstrate that long-context generalization can be induced from sparse positional supervision, challenging the prevailing assumption that dense long-sequence training is necessary for reliable context-window extension. The code is available at https://github.com/clx1415926/EndPrompt.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EndPrompt shows a two-segment terminal anchoring trick can deliver strong benchmark gains for 64k context extension using only short sequences, but the RoPE-Bernstein smoothness argument does not clearly cover the large unobserved intermediate distances the model actually sees.

read the letter

The main takeaway is that this method keeps a short original context intact as the first segment and appends a brief terminal prompt assigned positions near the target length. That single construction creates both local and long-range relative distances inside a physically short sequence without splitting the text into chunks. Applied to LLaMA models moving from 8k to 64k, it reports 76.03 average on RULER and the top LongBench score, beating LCEG, LongLoRA, and even full-length fine-tuning while using less compute. The code release helps anyone who wants to check the numbers directly. Those empirical results are the clearest contribution. The theoretical section grounds the claim in standard RoPE properties plus a Bernstein inequality bound on smoothness, arguing that shared parameters will suppress unstable behavior on unseen distances. That framing is reasonable on paper, yet the actual training leaves all intermediate relative distances unobserved, so the interpolation-derived bound does not automatically transfer to the extrapolation the model performs at inference. A few targeted ablations on stability at those gaps would have strengthened the case. The abstract also omits exact hyperparameter choices and data-construction details, which limits immediate reproducibility judgments. This paper is aimed at engineers and researchers who need cheaper ways to push context windows on existing models. The benchmark wins and released code give it enough substance that a serious editor should send it for peer review, with the main request being tighter linkage between the theory and the specific distances the model encounters.

Referee Report

3 major / 2 minor

Summary. The paper proposes EndPrompt for extending LLM context windows from 8K to 64K using only short sequences: the original short context is preserved as the first segment and a brief terminal prompt is appended as the second segment with positional indices assigned near the target length. This introduces long-range relative distances within short physical inputs while claiming to maintain semantic continuity. A theoretical analysis based on RoPE and the Bernstein inequality is used to argue that position interpolation imposes a smoothness constraint on attention, with shared parameters suppressing unstable extrapolation. Empirically, on LLaMA-family models the method reports an average RULER score of 76.03 and the highest LongBench average, outperforming LCEG (72.24), LongLoRA (72.95), and full-length fine-tuning (69.23) at lower compute cost. Code is released.

Significance. If the central claim holds, the work would be significant for demonstrating that sparse positional supervision can induce reliable long-context generalization, lowering the barrier to context extension. The benchmark improvements and code availability are concrete strengths that support reproducibility. The result would challenge the assumption that dense long-sequence training is required, provided the theoretical analysis can be shown to cover the actual extrapolation performed.

major comments (3)

[Theoretical analysis] Theoretical analysis section: the smoothness constraint is derived under position interpolation via the Bernstein inequality, but the EndPrompt construction assigns terminal-prompt indices near 64K on short sequences, producing direct extrapolation over large unobserved relative-position deltas (approximately 8K–56K). The bound therefore does not automatically transfer to the precise distances the model encounters at inference.
[Experiments] Experiments section (RULER and LongBench tables): the reported gains (76.03 on RULER, highest on LongBench) are given without error bars, number of runs, or statistical significance tests, making it impossible to determine whether the improvements over full-length fine-tuning (69.23) are robust rather than within noise.
[Method] Method section: the two-segment construction is described at a high level, but the exact length of the terminal prompt, the precise rule for choosing its positional indices, and the data-sampling procedure that produces the short sequences are not specified, which are load-bearing for verifying that semantic continuity is preserved and that the sparse-supervision claim can be reproduced.

minor comments (2)

[Abstract] Abstract and method: training hyperparameters (learning rate, batch size, number of steps) and the exact composition of the training data are omitted, which should be added for reproducibility even if the central claim is sound.
[Notation] Notation: the paper uses “target context length” without consistently defining whether it refers to 64K tokens or a different value in the equations; a single clarifying sentence would remove ambiguity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our paper. We address each major point below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Theoretical analysis] Theoretical analysis section: the smoothness constraint is derived under position interpolation via the Bernstein inequality, but the EndPrompt construction assigns terminal-prompt indices near 64K on short sequences, producing direct extrapolation over large unobserved relative-position deltas (approximately 8K–56K). The bound therefore does not automatically transfer to the precise distances the model encounters at inference.

Authors: We appreciate this observation regarding the distinction between interpolation and extrapolation. Our analysis uses the Bernstein inequality to establish a smoothness constraint under position interpolation as a foundation, and we claim that the shared parameters help control extrapolation. However, we recognize that a direct application to the large relative deltas in EndPrompt requires further elaboration. In the revised manuscript, we will expand the theoretical section to discuss the specific extrapolation distances and provide additional reasoning on how the smoothness property extends to the terminal anchoring setup. revision: partial
Referee: [Experiments] Experiments section (RULER and LongBench tables): the reported gains (76.03 on RULER, highest on LongBench) are given without error bars, number of runs, or statistical significance tests, making it impossible to determine whether the improvements over full-length fine-tuning (69.23) are robust rather than within noise.

Authors: We agree that the absence of error bars and statistical analysis makes it difficult to assess the reliability of the reported improvements. We will revise the experiments section to include results from multiple independent runs (at least three seeds) with standard deviations, and we will add statistical significance tests (e.g., paired t-tests) comparing EndPrompt to the baselines to confirm that the gains are robust. revision: yes
Referee: [Method] Method section: the two-segment construction is described at a high level, but the exact length of the terminal prompt, the precise rule for choosing its positional indices, and the data-sampling procedure that produces the short sequences are not specified, which are load-bearing for verifying that semantic continuity is preserved and that the sparse-supervision claim can be reproduced.

Authors: We concur that these details are essential for reproducibility. In the updated method section, we will specify that the terminal prompt consists of 128 tokens, its positional indices are assigned consecutively starting from position (target_length - 128), and the short sequences are sampled by selecting contiguous segments from the original training data for the first segment while appending a fixed terminal prompt template. This preserves the semantic integrity of the primary context. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's central claim rests on the two-segment construction (short context plus terminal prompt with target-length indices) plus empirical results on RULER (76.03) and LongBench (highest average), which are independent of any internal fitted quantities. The theoretical analysis invokes standard RoPE properties and the Bernstein inequality to argue for smoothness under position interpolation; this is an application of external mathematics rather than a self-definition or a fitted parameter renamed as a prediction. No load-bearing step reduces the claimed long-context generalization to quantities defined inside the paper, and benchmark comparisons supply external validation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach relies on standard rotary embedding mathematics and benchmark evaluation rather than new free parameters, axioms, or invented entities.

axioms (1)

standard math Rotary Position Embedding induces a smoothness constraint on the attention function under position interpolation, bounded by the Bernstein inequality.
Invoked in the theoretical analysis to justify stability of the two-segment construction.

pith-pipeline@v0.9.0 · 5620 in / 1168 out tokens · 59858 ms · 2026-05-15T01:26:59.577618+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We provide a theoretical analysis grounded in Rotary Position Embedding and the Bernstein inequality, showing that position interpolation induces a rigorous smoothness constraint over the attention function... Dobs = [0,a−1]Z ∪ [0,b−1]Z ∪ [L−a−b+1,L−1]Z ... Dgap = [max(a,b),L−a−b]Z
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

position interpolation induces a rigorous smoothness constraint... shared Transformer parameters further suppressing unstable extrapolation to unobserved intermediate distances

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 10 internal anchors

[1]

LongAlign: A recipe for long context alignment of large language models.arXiv preprint arXiv:2401.18058, 2024

Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou, Jie Tang, Yuxiao Dong, and Juanzi Li. LongAlign: A recipe for long context alignment of large language models.arXiv preprint arXiv:2401.18058, 2024

work page arXiv 2024
[2]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Wu, Yuxiao Mao, et al. Longbench: A bilingual, multitask benchmark for long context understanding.arXiv preprint arXiv:2308.14508, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[4]

Lexglue: A benchmark dataset for legal language understanding in english.Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022

Ilias Chalkidis, Abhik Jana, Eirini Dirani, et al. Lexglue: A benchmark dataset for legal language understanding in english.Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022

work page 2022
[5]

CoRR , volume =

Chenhui Chen et al. L-eval: Instituting standardized evaluation for long context language models.arXiv preprint arXiv:2307.11088, 2023

work page arXiv 2023
[6]

CLEX: Continu- ous length extrapolation for large language models

Guanzheng Chen, Xin Li, Zaiqiao Meng, Shangsong Liang, and Lidong Bing. CLEX: Continu- ous length extrapolation for large language models. InInternational Conference on Learning Representations, 2024

work page 2024
[7]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Lon- glora: Efficient fine-tuning of long-context large language models

Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Lon- glora: Efficient fine-tuning of long-context large language models. InInternational Conference on Learning Representations, 2024

work page 2024
[10]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[11]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Flashattention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. InAdvances in Neural Information Processing Systems, volume 35, pages 32082–32098, 2022

work page 2022
[13]

Smith, and Hajishirzi Hannaneh

Pradeep Dasigi, Kyle Lo, Iz Beltagy Kinney, Arman Cohan, Noah A. Smith, and Hajishirzi Hannaneh. A dataset of information-seeking questions and answers anchored in research papers. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4599–4610, 2021

work page 2021
[14]

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. LongRoPE: Extending LLM context window beyond 2 million tokens. arXiv preprint arXiv:2402.13753, 2024

work page internal anchor Pith review arXiv 2024
[15]

The llama 3 herd of models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.22706, 2024

work page arXiv 2024
[16]

Deepseek-coder: When the large language model meets programming

Daya Guo, Hao Qi, Jian Yin, Tie Dong, et al. Deepseek-coder: When the large language model meets programming. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 10

work page 2024
[17]

LM- Infinite: Zero-shot extreme length generalization for large language models

Chi Han, Qifan Wang, Hao Peng, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. LM- Infinite: Zero-shot extreme length generalization for large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2024

work page 2024
[18]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021

work page 2021
[19]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Cantrell, Dan Rosenberg, Tony He, Benoit Dupuis, Richard M Taylor, Joshua Ainslie, Amin Mahdavi, Maciej Misra, et al. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

work page 2022
[21]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, et al. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, 2024

Carlos E Jimenez, John Emmons, Craig Brackman, et al. Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, 2024

work page 2024
[23]

The narrativeqa reading comprehension challenge.Transac- tions of the Association for Computational Linguistics, 6:317–328, 2018

Tomáš Koˇciský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. The narrativeqa reading comprehension challenge.Transac- tions of the Association for Computational Linguistics, 6:317–328, 2018

work page 2018
[24]

Ring attention with blockwise transformers for near-infinite context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context. InInternational Conference on Learning Representations, 2024

work page 2024
[25]

A controlled study on long context extension and generalization in llms.arXiv preprint arXiv:2401.06951, 2024

Xingyu Lu et al. A controlled study on long context extension and generalization in llms.arXiv preprint arXiv:2401.06951, 2024

work page arXiv 2024
[26]

Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation, 2023

Bowen Peng and Jeffrey Quesnelle. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation, 2023

work page 2023
[27]

Yarn: Efficient context window extension of large language models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. InInternational Conference on Learning Repre- sentations, 2024

work page 2024
[28]

Zero- scrolls: A zero-shot benchmark for long text understanding

Uri Shaham, Yanai Elazar, Maor Ivgi, Ori Rubin, Jonathan Berant, and Omer Levy. Zero- scrolls: A zero-shot benchmark for long text understanding. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 7961–7979, 2023

work page 2023
[29]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtuza Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yuwei Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024
[30]

Efficient transformers: A survey

Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. ACM Computing Surveys (CSUR), 55(6):1–28, 2022

work page 2022
[31]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations, 2024

work page 2024
[33]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3472–3483, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3472–3483, 2019. 11

work page 2019
[34]

Soaring from 4k to 400k: Extending llm’s context with activation beacon

Peitian Zhang, Zheng Zheng, Jianbo Gao, Boda Yao, Huan Luan, et al. Soaring from 4k to 400k: Extending llm’s context with activation beacon.arXiv preprint arXiv:2401.03462, 2024

work page arXiv 2024
[35]

This is the end of text, please pay attention here

Dawei Zhu, Nan Yang, Liang Wang, Yifan Song, Wenhao Wu, Furu Wei, and Sujian Li. Pose: Efficient context window extension of llms via positional skip-wise training. InInternational Conference on Learning Representations, 2024. 12 A Technical Appendices A.1 Experiment details We adopt Meta-Llama-3-8B as our base model and employ Position Interpolation (PI)...

work page 2024