pith. machine review for the scientific record. sign in

arxiv: 2108.12409 · v2 · submitted 2021-08-27 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Mike Lewis, Noah A. Smith, Ofir Press

Pith reviewed 2026-05-13 00:55 UTC · model grok-4.3

classification 💻 cs.CL
keywords transformerattentionposition biaslength extrapolationALiBilanguage modelingperplexity
0
0 comments X

The pith

Attention with linear biases enables transformer models to extrapolate to input sequences twice as long as seen in training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a simple change to position handling in transformers supports extrapolation beyond training lengths. ALiBi replaces positional embeddings with a linear distance-based penalty on attention scores. A 1.3 billion parameter model trained on sequences of 1024 tokens then matches the perplexity of a sinusoidal model trained on 2048 tokens. This also cuts training time and memory by 11 percent. The built-in recency preference further improves results on WikiText-103.

Core claim

By adding a fixed negative slope bias to query-key attention scores based on token distance, ALiBi lets models train on length 1024 and extrapolate to length 2048 while matching the perplexity of models trained directly on the longer length.

What carries the argument

Attention with Linear Biases (ALiBi): a bias term subtracted from attention scores that grows linearly with the distance between each query and key position.

Load-bearing premise

A single fixed linear bias slope applied to attention scores is sufficient to produce reliable extrapolation across model sizes and sequence lengths without further changes to the model or training.

What would settle it

Train an ALiBi model on length 1024 and evaluate on length 2048; if its perplexity exceeds that of a sinusoidal model trained and tested on length 2048, the extrapolation claim fails.

read the original abstract

Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question has yet to be answered: how does a model achieve extrapolation at inference time for sequences that are longer than it saw during training? We first show that extrapolation can be enabled by simply changing the position representation method, though we find that current methods do not allow for efficient extrapolation. We therefore introduce a simpler and more efficient position method, Attention with Linear Biases (ALiBi). ALiBi does not add positional embeddings to word embeddings; instead, it biases query-key attention scores with a penalty that is proportional to their distance. We show that this method trains a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048 but training 11% faster and using 11% less memory. ALiBi's inductive bias towards recency also leads it to outperform multiple strong position methods on the WikiText-103 benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper proposes Attention with Linear Biases (ALiBi), a position method that adds a fixed linear penalty to query-key attention scores proportional to their distance, rather than injecting positional embeddings into the input. It claims this enables length extrapolation: a 1.3B-parameter model trained on sequences of length 1024 with ALiBi achieves the same perplexity on length-2048 inputs as a sinusoidal-embedding model trained directly on 2048-length sequences, while training 11% faster and using 11% less memory. ALiBi is also reported to outperform several strong position baselines on WikiText-103 due to its recency bias.

Significance. If the empirical claims hold under broader conditions, the result is significant for practical scaling of language models: it offers a lightweight way to decouple training length from inference length, yielding concrete efficiency gains without architectural changes. The approach is simple to implement and the reported speed/memory savings plus benchmark improvements provide a falsifiable, reproducible contribution to the position-embedding literature.

major comments (3)
  1. [§3] §3 (ALiBi definition): the slope schedule m_h = 2^(-8*h/(n-1)) is presented as a one-time fixed heuristic requiring no per-model retuning, yet no ablation or sensitivity analysis is shown for changes in head count, hidden dimension, or extrapolation ratio (e.g., 1024→4096). This directly supports the central claim that 'a fixed linear bias suffices' and that the method needs 'no further model changes'; without such evidence the simplicity advantage remains unproven.
  2. [§4] §4 (main 1.3B extrapolation experiment): the headline result (perplexity parity with sinusoidal-2048, 11% faster training, 11% less memory) is given without data-split details, hyperparameter grids, number of random seeds, or error bars. Because the comparison is purely empirical and the slope choice is itself a hyperparameter, these omissions make it impossible to verify whether the reported gains are robust or sensitive to training dynamics.
  3. [§4.2] §4.2 (WikiText-103 results): the claim that ALiBi 'outperforms multiple strong position methods' is load-bearing for the inductive-bias argument, but the manuscript does not state whether the sinusoidal and other baselines were also trained at 1024 tokens or at the full evaluation length; this ambiguity weakens the cross-method comparison.
minor comments (3)
  1. [Figure 2] Figure 2 (attention visualization): the color scale and axis labels are not defined in the caption, making it hard to interpret the claimed recency bias.
  2. [§2] Related-work section (§2): the discussion of prior linear-bias or distance-based attention methods omits several recent works on relative position representations that appeared after Vaswani et al. (2017).
  3. [§3] Notation: the symbol m_h is introduced without an explicit equation number; adding 'Eq. (3)' would improve readability when the slope formula is referenced later.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below, providing clarifications and committing to revisions that strengthen the empirical support and reproducibility of the work.

read point-by-point responses
  1. Referee: [§3] §3 (ALiBi definition): the slope schedule m_h = 2^(-8*h/(n-1)) is presented as a one-time fixed heuristic requiring no per-model retuning, yet no ablation or sensitivity analysis is shown for changes in head count, hidden dimension, or extrapolation ratio (e.g., 1024→4096). This directly supports the central claim that 'a fixed linear bias suffices' and that the method needs 'no further model changes'; without such evidence the simplicity advantage remains unproven.

    Authors: We appreciate this observation. The slope schedule was derived from preliminary experiments on smaller models and then applied without modification to all larger models reported in the paper. To directly address the request for sensitivity analysis, the revised manuscript will include new results (in an expanded §3 or appendix) testing the same fixed schedule across varying head counts (8–32 heads), model dimensions, and extrapolation ratios up to 4× on models up to 350M parameters. These additions will provide concrete evidence that the heuristic generalizes without per-model retuning. revision: yes

  2. Referee: [§4] §4 (main 1.3B extrapolation experiment): the headline result (perplexity parity with sinusoidal-2048, 11% faster training, 11% less memory) is given without data-split details, hyperparameter grids, number of random seeds, or error bars. Because the comparison is purely empirical and the slope choice is itself a hyperparameter, these omissions make it impossible to verify whether the reported gains are robust or sensitive to training dynamics.

    Authors: We agree that greater experimental transparency is needed. The revision will add the precise training corpus composition and data splits, the full hyperparameter configuration for the 1.3B model, and an explicit statement that the large-model runs were performed once owing to compute cost. We will also note that smaller-scale ablations (reported in the appendix) were repeated with multiple seeds and exhibited the same qualitative trends. We cannot, however, supply error bars for the 1.3B setting itself. revision: partial

  3. Referee: [§4.2] §4.2 (WikiText-103 results): the claim that ALiBi 'outperforms multiple strong position methods' is load-bearing for the inductive-bias argument, but the manuscript does not state whether the sinusoidal and other baselines were also trained at 1024 tokens or at the full evaluation length; this ambiguity weakens the cross-method comparison.

    Authors: We thank the referee for catching this ambiguity. All models—including the sinusoidal, rotary, and learned-position baselines—were trained with a maximum sequence length of 1024 tokens. WikiText-103 evaluation used the test set’s native (sometimes longer) sequences to measure extrapolation, but training length was identical across methods. The revised §4.2 will state this explicitly, removing any possibility of misinterpretation. revision: yes

standing simulated objections not resolved
  • The 1.3B-parameter experiments were run with only a single random seed due to prohibitive computational cost; consequently we cannot supply error bars or quantify sensitivity to initialization for the headline result.

Circularity Check

0 steps flagged

No significant circularity in the empirical evaluation of ALiBi.

full rationale

The paper introduces ALiBi as a position method that adds a fixed linear bias to attention scores and demonstrates its effectiveness through direct empirical comparison: a 1.3B model trained at length 1024 achieves equivalent perplexity on length 2048 to a sinusoidal baseline trained at 2048, with reported efficiency gains. The slope schedule is a fixed, predetermined choice (geometric progression across heads) presented as part of the method definition rather than fitted to the extrapolation results themselves. No equation or claim reduces the reported perplexity values to a parameter or quantity defined by the same experiment, and the central result remains an independent experimental outcome rather than a tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claim rests on the empirical effectiveness of a linear attention bias whose slope is a tunable hyperparameter; no new entities are postulated and the background assumptions are standard transformer components.

free parameters (1)
  • linear bias slope
    The rate at which the distance penalty increases must be chosen (likely per head or layer) to achieve the reported extrapolation; this value is not derived from first principles.
axioms (1)
  • domain assumption Adding a fixed linear penalty to query-key dot products preserves the core attention mechanism and training dynamics of the transformer.
    Invoked when the authors replace positional embeddings with the bias without further architectural changes.

pith-pipeline@v0.9.0 · 5495 in / 1345 out tokens · 57592 ms · 2026-05-13T00:55:55.919030+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Rethinking Positional Encoding for Neural Vehicle Routing

    cs.AI 2026-05 unverdicted novelty 7.0

    A hierarchical anisometric positional encoding that combines distance-indexed in-route and depot-anchored angular cross-route components improves transformer-based solvers for vehicle routing problems over index-based...

  2. Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases

    cs.LG 2026-05 unverdicted novelty 7.0

    ALiBi bias is the expectation of positional LSH-induced block masks, yielding spectral and max-norm approximation bounds that reduce long-context biased attention to randomized short-context unbiased attention.

  3. PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training

    cs.LG 2026-04 unverdicted novelty 7.0

    Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.

  4. URoPE: Universal Relative Position Embedding across Geometric Spaces

    cs.CV 2026-04 unverdicted novelty 7.0

    URoPE is a parameter-free relative position embedding for transformers that works across arbitrary geometric spaces by ray sampling and projection, yielding consistent gains on novel view synthesis, 3D detection, trac...

  5. Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels

    cs.AI 2026-04 unverdicted novelty 7.0

    Multi-head Gaussian kernels inject temporal scale discrepancy as inductive bias to enable full-duplex talking-listening avatar generation, supported by a new decoupled VoxHear dataset and claimed SOTA naturalness.

  6. Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm

    cs.CL 2026-05 unverdicted novelty 6.0

    Theoretical analysis of continual factual knowledge acquisition shows data replay stabilizes pretrained knowledge by shifting convergence dynamics while regularization only slows forgetting, leading to the STOC method...

  7. Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing

    cs.CL 2026-05 conditional novelty 6.0

    EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.

  8. Remember to Forget: Gated Adaptive Positional Encoding

    cs.LG 2026-05 unverdicted novelty 6.0

    GAPE augments RoPE with query- and key-dependent gates to stabilize attention and improve long-context performance in language models.

  9. HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution

    cs.AI 2026-05 unverdicted novelty 6.0

    HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.

  10. FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

    cs.CL 2026-05 unverdicted novelty 6.0

    FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, y...

  11. It Just Takes Two: Scaling Amortized Inference to Large Sets

    cs.LG 2026-05 unverdicted novelty 6.0

    A mean-pool deep set trained on sets of size at most two produces an encoder that generalizes to arbitrary sizes, decoupling representation learning from posterior modeling and making training cost independent of depl...

  12. FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation

    cs.LG 2026-05 unverdicted novelty 6.0

    FAAST analytically compiles labeled examples into fast weights via a single forward pass, matching backprop adaptation performance with over 90% less time and up to 95% less memory than memory-based methods.

  13. ADE: Adaptive Dictionary Embeddings -- Scaling Multi-Anchor Representations to Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    ADE scales multi-anchor word representations to transformers via Vocabulary Projection, Grouped Positional Encoding, and context-aware reweighting, achieving 98.7% fewer trainable parameters than DeBERTa-v3-base while...

  14. The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

    cs.LG 2026-04 unverdicted novelty 6.0

    Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-mat...

  15. LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention

    cs.AI 2026-04 unverdicted novelty 6.0

    LoopGuard detects attention collapse loops during LLM decoding and prunes repetitive KV cache tail spans under fixed budget, cutting loop incidence by over 90 percentage points on the new LoopBench benchmark.

  16. MT-OSC: Path for LLMs that Get Lost in Multi-Turn Conversation

    cs.CL 2026-04 unverdicted novelty 6.0

    MT-OSC condenses chat history via a one-off sequential process with a few-shot Condenser and lightweight Decider to reduce tokens and preserve LLM accuracy in multi-turn settings.

  17. MemGPT: Towards LLMs as Operating Systems

    cs.AI 2023-10 unverdicted novelty 6.0

    MemGPT uses OS-inspired virtual context management to extend LLM context windows for large document analysis and long-term multi-session chat.

  18. Kaczmarz Linear Attention

    cs.LG 2026-05 unverdicted novelty 5.0

    Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...

  19. FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation

    cs.LG 2026-05 unverdicted novelty 5.0

    FAAST performs test-time supervised adaptation by analytically deriving fast weights from examples in one forward pass, matching backprop performance with over 90% less adaptation time and up to 95% memory savings ver...

  20. Decouple and Cache: KV Cache Construction for Streaming Video Understanding

    cs.CV 2026-05 unverdicted novelty 5.0

    DSCache decouples cumulative past and instant KV caches with position-agnostic encoding to adapt offline VideoVLLMs to streaming video, delivering 2.5% average accuracy gains on QA benchmarks.

  21. Adaptive 3D-RoPE: Physics-Aligned Rotary Positional Encoding for Wireless Foundation Models

    eess.SP 2026-05 unverdicted novelty 5.0

    Adaptive 3D-RoPE adapts rotary positional encoding to wireless channel physics via learnable 3D frequencies and dynamic CSI control, yielding up to 10.7 dB NMSE gains in scale extrapolation and 1 dB in zero-shot tasks.

  22. Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity

    cs.CL 2026-04 unverdicted novelty 5.0

    Fixed-width and decay-based attention mechanisms inspired by working memory improve Transformer grammatical accuracy and human alignment under limited training data.

  23. Learning Class Difficulty in Imbalanced Histopathology Segmentation via Dynamic Focal Attention

    eess.IV 2026-04 unverdicted novelty 5.0

    Dynamic Focal Attention learns class-specific difficulty via per-class biases in attention logits, improving Dice and IoU on imbalanced histopathology segmentation benchmarks.

  24. Galactica: A Large Language Model for Science

    cs.CL 2022-11 unverdicted novelty 5.0

    Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.

  25. A Survey on Large Language Models for Code Generation

    cs.CL 2024-06 unverdicted novelty 3.0

    A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...

  26. Large Language Models: A Survey

    cs.CL 2024-02 accept novelty 3.0

    The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 25 Pith papers · 1 internal anchor

  1. [1]

    arXiv preprint arXiv:1809.10853 , year=

    Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. CoRR, abs/1809.10853, 2018. URL http://arxiv.org/abs/1809.10853

  2. [2]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv:2004.05150, 2020

  3. [3]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  4. [4]

    Proceedings of the Association for Computational Linguistics (ACL) , pages =

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020. doi:10.18653/v1/2020.acl-main.7...

  5. [5]

    doi: 10.18653/v1/P19-1285

    Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer- XL : Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.\ 2978--2988, Florence, Italy, July 2019. Association for Computational Linguistics. doi:10.18653...

  6. [6]

    BERT : Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pp.\ 4171--4186, Minneapo...

  7. [7]

    Openwebtext corpus

    Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019

  8. [8]

    Shazeer, Andrew M

    Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam M. Shazeer, Andrew M. Dai, M. Hoffman, M. Dinculescu, and D. Eck. Music transformer: Generating music with long-term structure. In ICLR, 2019

  9. [9]

    Compositionality decomposed: How do neural networks generalise? Journal of Artificial Intelligence Research, 67: 0 757--795, April 2020

    Dieuwke Hupkes, Verna Dankers, Mathijs Mul, and Elia Bruni. Compositionality decomposed: How do neural networks generalise? Journal of Artificial Intelligence Research, 67: 0 757--795, April 2020. doi:10.1613/jair.1.11674. URL https://doi.org/10.1613/jair.1.11674

  10. [10]

    Tying word vectors and word classifiers: A loss framework for language modeling

    Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. In ICLR, 2017. URL https://openreview.net/forum?id=r1aPbsFle

  11. [11]

    Jumper, Richard Evans, A

    J. Jumper, Richard Evans, A. Pritzel, Tim Green, Michael Figurnov, O. Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Z \'i dek, Anna Potapenko, A. Bridgland, Clemens Meyer, Simon A A Kohl, Andy Ballard, A. Cowie, B. Romera-Paredes, Stanislav Nikolov, Rishub Jain, J. Adler, T. Back, Stig Petersen, D. Reiman, Ellen Clancy, Michal Zielinski, Mart...

  12. [12]

    Generalization through Memorization: Nearest Neighbor Language Models

    Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through Memorization: Nearest Neighbor Language Models . In International Conference on Learning Representations (ICLR), 2020

  13. [13]

    Shape: Shifted absolute position embedding for transformers

    Shun Kiyono, Sosuke Kobayashi, Jun Suzuki, and Kentaro Inui. Shape: Shifted absolute position embedding for transformers. ArXiv, abs/2109.05644, 2021

  14. [14]

    Andrew Kyle Lampinen, Stephanie C. Y. Chan, Andrea Banino, and Felix Hill. Towards mental time travel: a hierarchical memory for reinforcement learning agents. CoRR, abs/2105.14039, 2021. URL https://arxiv.org/abs/2105.14039

  15. [15]

    Base layers: Simplifying training of large, sparse models, 2021

    Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. Base layers: Simplifying training of large, sparse models, 2021

  16. [16]

    Jurassic-1: Technical details and evaluation

    Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. Jurassic-1: Technical details and evaluation. Technical report, AI21 Labs, August 2021

  17. [17]

    CAPE: encoding relative positions with continuous augmented positional embeddings

    Tatiana Likhomanenko, Qiantong Xu, Ronan Collobert, Gabriel Synnaeve, and Alex Rogozhnikov. CAPE: encoding relative positions with continuous augmented positional embeddings. CoRR, abs/2106.03143, 2021. URL https://arxiv.org/abs/2106.03143

  18. [18]

    Roberta: A robustly optimized bert pretraining approach, 2019

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019

  19. [19]

    Pointer sentinel mixture models, 2016

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016

  20. [20]

    Tomas Mikolov and G. Zweig. Context dependent recurrent neural network language model. 2012 IEEE Spoken Language Technology Workshop (SLT), pp.\ 234--239, 2012

  21. [21]

    Karafi \'a t, L

    Tomas Mikolov, M. Karafi \'a t, L. Burget, J. Cernock \'y , and S. Khudanpur. Recurrent neural network based language model. In INTERSPEECH, 2010

  22. [22]

    Sebastian Nagel. Cc-news. https://commoncrawl.org/2016/10/news-dataset-available/, 2016

  23. [23]

    Do transformer modifications transfer across implementations and applications?, 2021

    Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, Karishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, Yanqi Zhou, Wei Li, Nan Ding, Jake Marcus, Adam Roberts, and Colin Raffel. Do transformer modifications transfer across implementations and applications?, 2021

  24. [24]

    On the relation between position information and sentence length in neural machine translation

    Masato Neishi and Naoki Yoshinaga. On the relation between position information and sentence length in neural machine translation. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pp.\ 328--338, Hong Kong, China, November 2019. Association for Computational Linguistics. doi:10.18653/v1/K19-1031. URL https://aclanth...

  25. [25]

    Benjamin Newman, John Hewitt, Percy Liang, and Christopher D. Manning. The eos decision and length extrapolation. In BlackBoxNLP@EMNLP, 2020. URL https://nlp.stanford.edu/pubs/newman2020extrapolation.pdf

  26. [26]

    Rodrigo Nogueira, Zhiying Jiang, and Jimmy J. Li. Investigating the limitations of the transformers with simple arithmetic tasks. ArXiv, abs/2102.13019, 2021

  27. [27]

    Scaling neural machine translation

    Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation. In Proceedings of the Third Conference on Machine Translation (WMT), 2018

  28. [28]

    a ckstr \

    Ankur Parikh, Oscar T \"a ckstr \"o m, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model for natural language inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.\ 2249--2255, Austin, Texas, November 2016. Association for Computational Linguistics. doi:10.18653/v1/D16-1244. URL https://a...

  29. [29]

    Using the output embedding to improve language models

    Ofir Press and Lior Wolf. Using the output embedding to improve language models. In Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pp.\ 157--163, Valencia, Spain, April 2017. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/E17-2025

  30. [30]

    Smith, and Omer Levy

    Ofir Press, Noah A. Smith, and Omer Levy. Improving transformer models by reordering their sublayers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 2996--3005, Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.270. URL https://www.aclweb.org/anthology/2020.acl-main.270

  31. [31]

    Smith, and Mike Lewis

    Ofir Press, Noah A. Smith, and Mike Lewis. Shortformer: Better language modeling using shorter inputs. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.\ 5493--5505, Online, August 2021. Association for Computati...

  32. [32]

    Rae, Anna Potapenko, Siddhant M

    Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SylKikSYDH

  33. [33]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020. URL http://jmlr.org/papers/v21/20-074.html

  34. [34]

    Analysis of positional encodings for neural machine translation

    Jan Rosendahl, Viet Anh Khoa Tran, Weiyue Wang, and Hermann Ney. Analysis of positional encodings for neural machine translation. In International Workshop on Spoken Language Translation, Hong Kong, China, November 2019

  35. [35]

    Efficient content-based sparse attention with routing transformers, 2020

    Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with routing transformers, 2020

  36. [36]

    Self-attention with relative position representations

    Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , pp.\ 464--468, New Orleans, Louisiana, June 2018. Association for Computational...

  37. [37]

    Roformer: Enhanced transformer with rotary position embedding, 2021

    Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2021

  38. [38]

    Trinh and Quoc V

    Trieu H. Trinh and Quoc V. Le. A simple method for commonsense reasoning, 2018

  39. [39]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https:/...

  40. [40]

    GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model

    Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model . https://github.com/kingoflolz/mesh-transformer-jax, May 2021

  41. [41]

    The case for translation-invariant self-attention in transformer-based language models, 2021

    Ulme Wennberg and Gustav Eje Henter. The case for translation-invariant self-attention in transformer-based language models, 2021

  42. [42]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

  43. [43]

    DA -transformer: Distance-aware transformer

    Chuhan Wu, Fangzhao Wu, and Yongfeng Huang. DA -transformer: Distance-aware transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 2059--2068, Online, June 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.naacl-main.166. URL h...

  44. [44]

    Recurrent neural network regularization, 2014

    Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization, 2014

  45. [45]

    Aligning books and movies: Towards story-like visual explanations by watching movies and reading books

    Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pp.\ 19--27, 2015