pith. sign in

hub Canonical reference

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Canonical reference. 71% of citing Pith papers cite this work as background.

66 Pith papers citing it
Background 71% of classified citations
abstract

Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question has yet to be answered: how does a model achieve extrapolation at inference time for sequences that are longer than it saw during training? We first show that extrapolation can be enabled by simply changing the position representation method, though we find that current methods do not allow for efficient extrapolation. We therefore introduce a simpler and more efficient position method, Attention with Linear Biases (ALiBi). ALiBi does not add positional embeddings to word embeddings; instead, it biases query-key attention scores with a penalty that is proportional to their distance. We show that this method trains a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048 but training 11% faster and using 11% less memory. ALiBi's inductive bias towards recency also leads it to outperform multiple strong position methods on the WikiText-103 benchmark.

hub tools

citation-role summary

background 4 method 2 other 1

citation-polarity summary

clear filters

representative citing papers

Forget Attention: Importance-Aware Attention Is All You Need

cs.AI · 2026-06-01 · unverdicted · novelty 7.0

SISA adds an SSM importance term inside the attention score and runs the full operation as one SDPA call on augmented Q/K vectors, reporting better LAMBADA and perfect NIAH at small scale.

Rethinking Positional Encoding for Neural Vehicle Routing

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

A hierarchical anisometric positional encoding that combines distance-indexed in-route and depot-anchored angular cross-route components improves transformer-based solvers for vehicle routing problems over index-based alternatives.

Group Representational Position Encoding

cs.LG · 2025-12-08 · unverdicted · novelty 7.0

GRAPE unifies RoPE and ALiBi as special cases of group actions on positions, providing a principled design space for positional encodings via SO(d) rotations and GL unipotent transformations.

Exact Sequence Interpolation with Transformers

cs.LG · 2025-02-04 · conditional · novelty 7.0

Transformers with O(sum m^j) blocks and O(d sum m^j) parameters can exactly interpolate any finite dataset of input sequences in R^d to output sequences of lengths m^j.

Massive Activations in Large Language Models

cs.CL · 2024-02-27 · unverdicted · novelty 7.0

Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.

Chiaroscuro Attention: Spending Compute in the Dark

cs.CL · 2026-06-06 · unverdicted · novelty 6.0

CHIAR-Former routes tokens via spectral entropy to DCT mixing or attention, yielding 35-40% FLOP savings at 400M parameters with modest perplexity increase on WikiText-103.

Pretraining Recurrent Networks without Recurrence

cs.LG · 2026-06-04 · unverdicted · novelty 6.0

SMT reduces RNN training to supervised learning on memory transitions (m_t, x_{t+1}) to m_{t+1} obtained from a Transformer encoder, enabling time-parallel training with O(1) gradient paths.

citing papers explorer

Showing 2 of 2 citing papers after filters.

  • FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation cs.LG · 2026-05-06 · unverdicted · none · ref 17 · 2 links · internal anchor

    FAAST performs test-time supervised adaptation by analytically deriving fast weights from examples in one forward pass, matching backprop performance with over 90% less adaptation time and up to 95% memory savings versus memory-based methods.

  • A Survey on Large Language Models for Code Generation cs.CL · 2024-06-01 · unverdicted · none · ref 216 · internal anchor

    A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.