Recognition: 3 theorem links
· Lean TheoremTrain Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Pith reviewed 2026-05-13 00:55 UTC · model grok-4.3
The pith
Attention with linear biases enables transformer models to extrapolate to input sequences twice as long as seen in training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By adding a fixed negative slope bias to query-key attention scores based on token distance, ALiBi lets models train on length 1024 and extrapolate to length 2048 while matching the perplexity of models trained directly on the longer length.
What carries the argument
Attention with Linear Biases (ALiBi): a bias term subtracted from attention scores that grows linearly with the distance between each query and key position.
Load-bearing premise
A single fixed linear bias slope applied to attention scores is sufficient to produce reliable extrapolation across model sizes and sequence lengths without further changes to the model or training.
What would settle it
Train an ALiBi model on length 1024 and evaluate on length 2048; if its perplexity exceeds that of a sinusoidal model trained and tested on length 2048, the extrapolation claim fails.
read the original abstract
Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question has yet to be answered: how does a model achieve extrapolation at inference time for sequences that are longer than it saw during training? We first show that extrapolation can be enabled by simply changing the position representation method, though we find that current methods do not allow for efficient extrapolation. We therefore introduce a simpler and more efficient position method, Attention with Linear Biases (ALiBi). ALiBi does not add positional embeddings to word embeddings; instead, it biases query-key attention scores with a penalty that is proportional to their distance. We show that this method trains a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048 but training 11% faster and using 11% less memory. ALiBi's inductive bias towards recency also leads it to outperform multiple strong position methods on the WikiText-103 benchmark.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Attention with Linear Biases (ALiBi), a position method that adds a fixed linear penalty to query-key attention scores proportional to their distance, rather than injecting positional embeddings into the input. It claims this enables length extrapolation: a 1.3B-parameter model trained on sequences of length 1024 with ALiBi achieves the same perplexity on length-2048 inputs as a sinusoidal-embedding model trained directly on 2048-length sequences, while training 11% faster and using 11% less memory. ALiBi is also reported to outperform several strong position baselines on WikiText-103 due to its recency bias.
Significance. If the empirical claims hold under broader conditions, the result is significant for practical scaling of language models: it offers a lightweight way to decouple training length from inference length, yielding concrete efficiency gains without architectural changes. The approach is simple to implement and the reported speed/memory savings plus benchmark improvements provide a falsifiable, reproducible contribution to the position-embedding literature.
major comments (3)
- [§3] §3 (ALiBi definition): the slope schedule m_h = 2^(-8*h/(n-1)) is presented as a one-time fixed heuristic requiring no per-model retuning, yet no ablation or sensitivity analysis is shown for changes in head count, hidden dimension, or extrapolation ratio (e.g., 1024→4096). This directly supports the central claim that 'a fixed linear bias suffices' and that the method needs 'no further model changes'; without such evidence the simplicity advantage remains unproven.
- [§4] §4 (main 1.3B extrapolation experiment): the headline result (perplexity parity with sinusoidal-2048, 11% faster training, 11% less memory) is given without data-split details, hyperparameter grids, number of random seeds, or error bars. Because the comparison is purely empirical and the slope choice is itself a hyperparameter, these omissions make it impossible to verify whether the reported gains are robust or sensitive to training dynamics.
- [§4.2] §4.2 (WikiText-103 results): the claim that ALiBi 'outperforms multiple strong position methods' is load-bearing for the inductive-bias argument, but the manuscript does not state whether the sinusoidal and other baselines were also trained at 1024 tokens or at the full evaluation length; this ambiguity weakens the cross-method comparison.
minor comments (3)
- [Figure 2] Figure 2 (attention visualization): the color scale and axis labels are not defined in the caption, making it hard to interpret the claimed recency bias.
- [§2] Related-work section (§2): the discussion of prior linear-bias or distance-based attention methods omits several recent works on relative position representations that appeared after Vaswani et al. (2017).
- [§3] Notation: the symbol m_h is introduced without an explicit equation number; adding 'Eq. (3)' would improve readability when the slope formula is referenced later.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. We address each major comment below, providing clarifications and committing to revisions that strengthen the empirical support and reproducibility of the work.
read point-by-point responses
-
Referee: [§3] §3 (ALiBi definition): the slope schedule m_h = 2^(-8*h/(n-1)) is presented as a one-time fixed heuristic requiring no per-model retuning, yet no ablation or sensitivity analysis is shown for changes in head count, hidden dimension, or extrapolation ratio (e.g., 1024→4096). This directly supports the central claim that 'a fixed linear bias suffices' and that the method needs 'no further model changes'; without such evidence the simplicity advantage remains unproven.
Authors: We appreciate this observation. The slope schedule was derived from preliminary experiments on smaller models and then applied without modification to all larger models reported in the paper. To directly address the request for sensitivity analysis, the revised manuscript will include new results (in an expanded §3 or appendix) testing the same fixed schedule across varying head counts (8–32 heads), model dimensions, and extrapolation ratios up to 4× on models up to 350M parameters. These additions will provide concrete evidence that the heuristic generalizes without per-model retuning. revision: yes
-
Referee: [§4] §4 (main 1.3B extrapolation experiment): the headline result (perplexity parity with sinusoidal-2048, 11% faster training, 11% less memory) is given without data-split details, hyperparameter grids, number of random seeds, or error bars. Because the comparison is purely empirical and the slope choice is itself a hyperparameter, these omissions make it impossible to verify whether the reported gains are robust or sensitive to training dynamics.
Authors: We agree that greater experimental transparency is needed. The revision will add the precise training corpus composition and data splits, the full hyperparameter configuration for the 1.3B model, and an explicit statement that the large-model runs were performed once owing to compute cost. We will also note that smaller-scale ablations (reported in the appendix) were repeated with multiple seeds and exhibited the same qualitative trends. We cannot, however, supply error bars for the 1.3B setting itself. revision: partial
-
Referee: [§4.2] §4.2 (WikiText-103 results): the claim that ALiBi 'outperforms multiple strong position methods' is load-bearing for the inductive-bias argument, but the manuscript does not state whether the sinusoidal and other baselines were also trained at 1024 tokens or at the full evaluation length; this ambiguity weakens the cross-method comparison.
Authors: We thank the referee for catching this ambiguity. All models—including the sinusoidal, rotary, and learned-position baselines—were trained with a maximum sequence length of 1024 tokens. WikiText-103 evaluation used the test set’s native (sometimes longer) sequences to measure extrapolation, but training length was identical across methods. The revised §4.2 will state this explicitly, removing any possibility of misinterpretation. revision: yes
- The 1.3B-parameter experiments were run with only a single random seed due to prohibitive computational cost; consequently we cannot supply error bars or quantify sensitivity to initialization for the headline result.
Circularity Check
No significant circularity in the empirical evaluation of ALiBi.
full rationale
The paper introduces ALiBi as a position method that adds a fixed linear bias to attention scores and demonstrates its effectiveness through direct empirical comparison: a 1.3B model trained at length 1024 achieves equivalent perplexity on length 2048 to a sinusoidal baseline trained at 2048, with reported efficiency gains. The slope schedule is a fixed, predetermined choice (geometric progression across heads) presented as part of the method definition rather than fitted to the extrapolation results themselves. No equation or claim reduces the reported perplexity values to a parameter or quantity defined by the same experiment, and the central result remains an independent experimental outcome rather than a tautology.
Axiom & Free-Parameter Ledger
free parameters (1)
- linear bias slope
axioms (1)
- domain assumption Adding a fixed linear penalty to query-key dot products preserves the core attention mechanism and training dynamics of the transformer.
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DimensionForcingeight_tick_forces_D3 unclearWe show that this method trains a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048 but training 11% faster and using 11% less memory.
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclearALiBi's inductive bias towards recency also leads it to outperform multiple strong position methods on the WikiText-103 benchmark.
Forward citations
Cited by 28 Pith papers
-
Rethinking Positional Encoding for Neural Vehicle Routing
A hierarchical anisometric positional encoding that combines distance-indexed in-route and depot-anchored angular cross-route components improves transformer-based solvers for vehicle routing problems over index-based...
-
Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases
ALiBi bias is the expectation of positional LSH-induced block masks, yielding spectral and max-norm approximation bounds that reduce long-context biased attention to randomized short-context unbiased attention.
-
PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training
Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
-
URoPE: Universal Relative Position Embedding across Geometric Spaces
URoPE is a parameter-free relative position embedding for transformers that works across arbitrary geometric spaces by ray sampling and projection, yielding consistent gains on novel view synthesis, 3D detection, trac...
-
Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels
Multi-head Gaussian kernels inject temporal scale discrepancy as inductive bias to enable full-duplex talking-listening avatar generation, supported by a new decoupled VoxHear dataset and claimed SOTA naturalness.
-
Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory
PMNet uses unitary phasor dynamics and hierarchical anchors to make explicit memory stable for long sequences, matching a 3x larger Mamba model on long-context robustness with a 119M parameter network.
-
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction
Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
-
Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm
Theoretical analysis of continual factual knowledge acquisition shows data replay stabilizes pretrained knowledge by shifting convergence dynamics while regularization only slows forgetting, leading to the STOC method...
-
Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing
EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.
-
Remember to Forget: Gated Adaptive Positional Encoding
GAPE augments RoPE with query- and key-dependent gates to stabilize attention and improve long-context performance in language models.
-
HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution
HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.
-
FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning
FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, y...
-
It Just Takes Two: Scaling Amortized Inference to Large Sets
A mean-pool deep set trained on sets of size at most two produces an encoder that generalizes to arbitrary sizes, decoupling representation learning from posterior modeling and making training cost independent of depl...
-
FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation
FAAST analytically compiles labeled examples into fast weights via a single forward pass, matching backprop adaptation performance with over 90% less time and up to 95% less memory than memory-based methods.
-
ADE: Adaptive Dictionary Embeddings -- Scaling Multi-Anchor Representations to Large Language Models
ADE scales multi-anchor word representations to transformers via Vocabulary Projection, Grouped Positional Encoding, and context-aware reweighting, achieving 98.7% fewer trainable parameters than DeBERTa-v3-base while...
-
The Recurrent Transformer: Greater Effective Depth and Efficient Decoding
Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-mat...
-
LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard detects attention collapse loops during LLM decoding and prunes repetitive KV cache tail spans under fixed budget, cutting loop incidence by over 90 percentage points on the new LoopBench benchmark.
-
MT-OSC: Path for LLMs that Get Lost in Multi-Turn Conversation
MT-OSC condenses chat history via a one-off sequential process with a few-shot Condenser and lightweight Decider to reduce tokens and preserve LLM accuracy in multi-turn settings.
-
MemGPT: Towards LLMs as Operating Systems
MemGPT uses OS-inspired virtual context management to extend LLM context windows for large document analysis and long-term multi-session chat.
-
Kaczmarz Linear Attention
Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...
-
FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation
FAAST performs test-time supervised adaptation by analytically deriving fast weights from examples in one forward pass, matching backprop performance with over 90% less adaptation time and up to 95% memory savings ver...
-
Decouple and Cache: KV Cache Construction for Streaming Video Understanding
DSCache decouples cumulative past and instant KV caches with position-agnostic encoding to adapt offline VideoVLLMs to streaming video, delivering 2.5% average accuracy gains on QA benchmarks.
-
Adaptive 3D-RoPE: Physics-Aligned Rotary Positional Encoding for Wireless Foundation Models
Adaptive 3D-RoPE adapts rotary positional encoding to wireless channel physics via learnable 3D frequencies and dynamic CSI control, yielding up to 10.7 dB NMSE gains in scale extrapolation and 1 dB in zero-shot tasks.
-
Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity
Fixed-width and decay-based attention mechanisms inspired by working memory improve Transformer grammatical accuracy and human alignment under limited training data.
-
Learning Class Difficulty in Imbalanced Histopathology Segmentation via Dynamic Focal Attention
Dynamic Focal Attention learns class-specific difficulty via per-class biases in attention logits, improving Dice and IoU on imbalanced histopathology segmentation benchmarks.
-
Galactica: A Large Language Model for Science
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
-
A Survey on Large Language Models for Code Generation
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:1809.10853 , year=
Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. CoRR, abs/1809.10853, 2018. URL http://arxiv.org/abs/1809.10853
-
[2]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv:2004.05150, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[3]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...
work page 2020
-
[4]
Proceedings of the Association for Computational Linguistics (ACL) , pages =
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020. doi:10.18653/v1/2020.acl-main.7...
-
[5]
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer- XL : Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.\ 2978--2988, Florence, Italy, July 2019. Association for Computational Linguistics. doi:10.18653...
-
[6]
BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pp.\ 4171--4186, Minneapo...
-
[7]
Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019
work page 2019
-
[8]
Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam M. Shazeer, Andrew M. Dai, M. Hoffman, M. Dinculescu, and D. Eck. Music transformer: Generating music with long-term structure. In ICLR, 2019
work page 2019
-
[9]
Dieuwke Hupkes, Verna Dankers, Mathijs Mul, and Elia Bruni. Compositionality decomposed: How do neural networks generalise? Journal of Artificial Intelligence Research, 67: 0 757--795, April 2020. doi:10.1613/jair.1.11674. URL https://doi.org/10.1613/jair.1.11674
-
[10]
Tying word vectors and word classifiers: A loss framework for language modeling
Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. In ICLR, 2017. URL https://openreview.net/forum?id=r1aPbsFle
work page 2017
-
[11]
J. Jumper, Richard Evans, A. Pritzel, Tim Green, Michael Figurnov, O. Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Z \'i dek, Anna Potapenko, A. Bridgland, Clemens Meyer, Simon A A Kohl, Andy Ballard, A. Cowie, B. Romera-Paredes, Stanislav Nikolov, Rishub Jain, J. Adler, T. Back, Stig Petersen, D. Reiman, Ellen Clancy, Michal Zielinski, Mart...
work page 2021
-
[12]
Generalization through Memorization: Nearest Neighbor Language Models
Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through Memorization: Nearest Neighbor Language Models . In International Conference on Learning Representations (ICLR), 2020
work page 2020
-
[13]
Shape: Shifted absolute position embedding for transformers
Shun Kiyono, Sosuke Kobayashi, Jun Suzuki, and Kentaro Inui. Shape: Shifted absolute position embedding for transformers. ArXiv, abs/2109.05644, 2021
- [14]
-
[15]
Base layers: Simplifying training of large, sparse models, 2021
Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. Base layers: Simplifying training of large, sparse models, 2021
work page 2021
-
[16]
Jurassic-1: Technical details and evaluation
Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. Jurassic-1: Technical details and evaluation. Technical report, AI21 Labs, August 2021
work page 2021
-
[17]
CAPE: encoding relative positions with continuous augmented positional embeddings
Tatiana Likhomanenko, Qiantong Xu, Ronan Collobert, Gabriel Synnaeve, and Alex Rogozhnikov. CAPE: encoding relative positions with continuous augmented positional embeddings. CoRR, abs/2106.03143, 2021. URL https://arxiv.org/abs/2106.03143
-
[18]
Roberta: A robustly optimized bert pretraining approach, 2019
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019
work page 2019
-
[19]
Pointer sentinel mixture models, 2016
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016
work page 2016
-
[20]
Tomas Mikolov and G. Zweig. Context dependent recurrent neural network language model. 2012 IEEE Spoken Language Technology Workshop (SLT), pp.\ 234--239, 2012
work page 2012
-
[21]
Tomas Mikolov, M. Karafi \'a t, L. Burget, J. Cernock \'y , and S. Khudanpur. Recurrent neural network based language model. In INTERSPEECH, 2010
work page 2010
-
[22]
Sebastian Nagel. Cc-news. https://commoncrawl.org/2016/10/news-dataset-available/, 2016
work page 2016
-
[23]
Do transformer modifications transfer across implementations and applications?, 2021
Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, Karishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, Yanqi Zhou, Wei Li, Nan Ding, Jake Marcus, Adam Roberts, and Colin Raffel. Do transformer modifications transfer across implementations and applications?, 2021
work page 2021
-
[24]
On the relation between position information and sentence length in neural machine translation
Masato Neishi and Naoki Yoshinaga. On the relation between position information and sentence length in neural machine translation. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pp.\ 328--338, Hong Kong, China, November 2019. Association for Computational Linguistics. doi:10.18653/v1/K19-1031. URL https://aclanth...
-
[25]
Benjamin Newman, John Hewitt, Percy Liang, and Christopher D. Manning. The eos decision and length extrapolation. In BlackBoxNLP@EMNLP, 2020. URL https://nlp.stanford.edu/pubs/newman2020extrapolation.pdf
work page 2020
- [26]
-
[27]
Scaling neural machine translation
Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation. In Proceedings of the Third Conference on Machine Translation (WMT), 2018
work page 2018
-
[28]
Ankur Parikh, Oscar T \"a ckstr \"o m, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model for natural language inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.\ 2249--2255, Austin, Texas, November 2016. Association for Computational Linguistics. doi:10.18653/v1/D16-1244. URL https://a...
-
[29]
Using the output embedding to improve language models
Ofir Press and Lior Wolf. Using the output embedding to improve language models. In Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pp.\ 157--163, Valencia, Spain, April 2017. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/E17-2025
work page 2017
-
[30]
Ofir Press, Noah A. Smith, and Omer Levy. Improving transformer models by reordering their sublayers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 2996--3005, Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.270. URL https://www.aclweb.org/anthology/2020.acl-main.270
-
[31]
Ofir Press, Noah A. Smith, and Mike Lewis. Shortformer: Better language modeling using shorter inputs. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.\ 5493--5505, Online, August 2021. Association for Computati...
work page 2021
-
[32]
Rae, Anna Potapenko, Siddhant M
Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SylKikSYDH
work page 2020
-
[33]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020. URL http://jmlr.org/papers/v21/20-074.html
work page 2020
-
[34]
Analysis of positional encodings for neural machine translation
Jan Rosendahl, Viet Anh Khoa Tran, Weiyue Wang, and Hermann Ney. Analysis of positional encodings for neural machine translation. In International Workshop on Spoken Language Translation, Hong Kong, China, November 2019
work page 2019
-
[35]
Efficient content-based sparse attention with routing transformers, 2020
Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with routing transformers, 2020
work page 2020
-
[36]
Self-attention with relative position representations
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , pp.\ 464--468, New Orleans, Louisiana, June 2018. Association for Computational...
-
[37]
Roformer: Enhanced transformer with rotary position embedding, 2021
Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2021
work page 2021
-
[38]
Trieu H. Trinh and Quoc V. Le. A simple method for commonsense reasoning, 2018
work page 2018
-
[39]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https:/...
work page 2017
-
[40]
GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model
Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model . https://github.com/kingoflolz/mesh-transformer-jax, May 2021
work page 2021
-
[41]
The case for translation-invariant self-attention in transformer-based language models, 2021
Ulme Wennberg and Gustav Eje Henter. The case for translation-invariant self-attention in transformer-based language models, 2021
work page 2021
-
[42]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...
work page 2020
-
[43]
DA -transformer: Distance-aware transformer
Chuhan Wu, Fangzhao Wu, and Yongfeng Huang. DA -transformer: Distance-aware transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 2059--2068, Online, June 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.naacl-main.166. URL h...
-
[44]
Recurrent neural network regularization, 2014
Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization, 2014
work page 2014
-
[45]
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pp.\ 19--27, 2015
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.