pith. machine review for the scientific record. sign in

arxiv: 2006.04768 · v3 · submitted 2020-06-08 · 💻 cs.LG · stat.ML

Recognition: 3 theorem links

· Lean Theorem

Linformer: Self-Attention with Linear Complexity

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:29 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords self-attentiontransformerlinear complexitylow-rank approximationefficient NLPsequence lengthmemory efficiency
0
0 comments X

The pith

Self-attention in transformers can be approximated by a low-rank matrix to reduce complexity to linear in sequence length.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large transformer models achieve strong results on language tasks but face high costs from the quadratic scaling of standard self-attention. The paper establishes that the attention matrix can be closely approximated by a low-rank form. Projecting the key and value sequences to a much smaller fixed dimension before the dot-product step turns the full computation linear in sequence length. The resulting Linformer model matches the accuracy of the original transformer on typical benchmarks while using substantially less memory and time.

Core claim

The self-attention mechanism can be approximated by a low-rank matrix. We further exploit this finding to propose a new self-attention mechanism, which reduces the overall self-attention complexity from O(n²) to O(n) in both time and space. The resulting linear transformer, the Linformer, performs on par with standard Transformer models, while being much more memory- and time-efficient.

What carries the argument

Low-rank projection matrices applied to the key and value vectors before attention, which replace the full n-by-n matrix with a much smaller n-by-k matrix where k is fixed and far smaller than n.

Load-bearing premise

The low-rank projections, whether learned or fixed, retain enough information from the original attention scores for the model to succeed on the tasks and sequence lengths it will see.

What would settle it

If the Linformer shows a clear accuracy gap compared with the standard transformer on a task that uses sequences several times longer than those seen during training, the low-rank approximation would be shown insufficient.

read the original abstract

Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, training and deploying these models can be prohibitively costly for long sequences, as the standard self-attention mechanism of the Transformer uses $O(n^2)$ time and space with respect to sequence length. In this paper, we demonstrate that the self-attention mechanism can be approximated by a low-rank matrix. We further exploit this finding to propose a new self-attention mechanism, which reduces the overall self-attention complexity from $O(n^2)$ to $O(n)$ in both time and space. The resulting linear transformer, the \textit{Linformer}, performs on par with standard Transformer models, while being much more memory- and time-efficient.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that self-attention in Transformers can be approximated via low-rank projections on the key and value matrices (using fixed or learned E, F matrices of size k x n with k << n), reducing attention complexity from O(n²) to O(n) in time and space. The resulting Linformer model is shown to achieve competitive performance with standard Transformers on GLUE, WikiText-103, and machine translation benchmarks while being more memory- and time-efficient.

Significance. If the empirical claims hold under broader validation, this is a significant contribution to efficient sequence modeling. It offers a practical architectural change that preserves the core attention mechanism while delivering linear scaling, which is valuable for long-context applications. The algebraic correctness of the low-rank rewriting and the competitive numbers on public NLP benchmarks are strengths; the work provides a clear efficiency gain without requiring entirely new attention formulations.

major comments (3)
  1. [§3] §3 (Method), around the definition of the projected attention: the low-rank approximation is presented without any error bound or analysis showing how the approximation error depends on sequence length n, rank k, or the effective rank of the attention matrix. This is load-bearing for the central claim of retained performance, as the paper's own skeptic note and experiments are confined to fixed training lengths.
  2. [§4] §4 (Experiments), Tables 1-3 and associated text: no standard deviations or results across multiple random seeds are reported for the GLUE or MT scores, and there are no ablations on the choice of projection dimension k as a function of n or task. This makes the 'on par' claim difficult to assess rigorously and directly tests the weakest assumption about projection sufficiency.
  3. [§4.2] §4.2 and §5: all reported experiments use fixed sequence lengths matching the training regime; no results are provided for substantially longer sequences or domain shifts. This leaves untested whether the learned low-rank projections preserve the necessary subspace when the effective rank of attention grows with n.
minor comments (2)
  1. [Figure 1] Figure 1 and the surrounding text could include a small diagram explicitly showing the shapes of E and F and how they are applied to K and V.
  2. [§3.2] The complexity analysis in §3.2 would benefit from an explicit step-by-step derivation of the O(n) claim including the cost of the projections themselves.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We provide point-by-point responses to the major comments below, indicating the revisions we intend to make.

read point-by-point responses
  1. Referee: [§3] §3 (Method), around the definition of the projected attention: the low-rank approximation is presented without any error bound or analysis showing how the approximation error depends on sequence length n, rank k, or the effective rank of the attention matrix. This is load-bearing for the central claim of retained performance, as the paper's own skeptic note and experiments are confined to fixed training lengths.

    Authors: We appreciate this observation. While the manuscript does not include a formal error bound, we provide empirical analysis demonstrating that attention matrices exhibit low effective rank, justifying the projection (see the singular value plots in the paper). The performance retention is validated across multiple tasks. In revision, we will add further discussion on how the approximation error scales with k and n based on these observations, though a complete theoretical bound remains an open question for future work. revision: partial

  2. Referee: [§4] §4 (Experiments), Tables 1-3 and associated text: no standard deviations or results across multiple random seeds are reported for the GLUE or MT scores, and there are no ablations on the choice of projection dimension k as a function of n or task. This makes the 'on par' claim difficult to assess rigorously and directly tests the weakest assumption about projection sufficiency.

    Authors: We agree that multiple random seeds and ablations would enhance the rigor. Our reported results follow the single-run convention common for such large-scale experiments due to resource constraints. We will rerun key experiments with multiple seeds to report means and standard deviations, and include ablations on the projection dimension k for different n and tasks in the revised manuscript. revision: yes

  3. Referee: [§4.2] §4.2 and §5: all reported experiments use fixed sequence lengths matching the training regime; no results are provided for substantially longer sequences or domain shifts. This leaves untested whether the learned low-rank projections preserve the necessary subspace when the effective rank of attention grows with n.

    Authors: This point highlights an important aspect of generalization. The current experiments adhere to the standard fixed-length settings of the benchmarks. We will extend the evaluation in the revision to include tests with longer sequences and some domain shifts to verify that the learned projections maintain effectiveness when the attention rank increases with n. revision: yes

standing simulated objections not resolved
  • Providing a formal error bound or complete theoretical analysis of the approximation error's dependence on n, k, and effective rank

Circularity Check

0 steps flagged

No significant circularity; architectural proposal with independent empirical validation

full rationale

The Linformer derivation proposes an explicit architectural change—projecting the key and value matrices via learned low-rank matrices E and F of size k x n (k << n)—to approximate the O(n^2) attention matrix with O(n) complexity. This is not obtained by fitting parameters to a target quantity and then renaming the fit as a prediction, nor by self-referential definitions or load-bearing self-citations. The low-rank property is motivated by empirical observation of attention matrices but the method itself is a constructive proposal whose performance is measured on held-out public benchmarks (e.g., GLUE, SQuAD) with standard training protocols. No equation reduces to its own input by construction, and the central claim retains independent content beyond any cited prior work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that attention matrices are approximately low-rank and on the assumption that a fixed or learned projection dimension k suffices for downstream tasks. No new physical entities or unproven mathematical axioms are introduced.

free parameters (1)
  • projection dimension k
    Chosen by the authors (typically 128 or 256) and controls the quality-efficiency trade-off; its value is not derived from first principles.
axioms (1)
  • domain assumption The attention matrix admits a useful low-rank approximation for the tasks considered.
    Invoked in Section 3 to justify the projection; no proof is given that this holds for arbitrary sequences or domains.

pith-pipeline@v0.9.0 · 5433 in / 1290 out tokens · 46410 ms · 2026-05-12T00:29:37.213350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 49 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Convergent Stochastic Training of Attention and Understanding LoRA

    cs.LG 2026-05 unverdicted novelty 8.0

    Attention and LoRA regression losses induce Poincaré inequalities under mild regularization, so SGD-mimicking SDEs converge to minimizers with no assumptions on data or model size.

  2. Nearly Optimal Attention Coresets

    cs.DS 2026-05 unverdicted novelty 8.0

    ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.

  3. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    cs.CL 2023-08 unverdicted novelty 8.0

    LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).

  4. ASAP: Amortized Doubly-Stochastic Attention via Sliced Dual Projection

    cs.LG 2026-05 conditional novelty 7.0

    ASAP amortizes Sinkhorn-based doubly-stochastic attention by learning a parametric map from 1D potentials to the Sinkhorn dual and reconstructing the plan via two-sided entropic c-transform, delivering 5.3x faster inf...

  5. VORT: Adaptive Power-Law Memory for NLP Transformers

    cs.LG 2026-05 unverdicted novelty 7.0

    VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.

  6. Projection-Free Transformers via Gaussian Kernel Attention

    cs.LG 2026-05 unverdicted novelty 7.0

    Gaussian Kernel Attention replaces learned QKV projections with a Gaussian RBF kernel on per-head token features, using 0.42x parameters and 0.49x FLOPs while showing competitive language modeling performance at depth 20.

  7. Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences

    cs.LG 2026-04 unverdicted novelty 7.0

    Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.

  8. Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling

    cs.LG 2026-04 unverdicted novelty 7.0

    Stream-CQSA uses CQS-based decomposition to stream exact attention computations for billion-token sequences on limited-memory hardware.

  9. Learning Posterior Predictive Distributions for Node Classification from Synthetic Graph Priors

    cs.LG 2026-04 unverdicted novelty 7.0

    NodePFN pre-trains on synthetic graphs with controllable homophily and causal feature-label models to achieve 71.27 average accuracy on 23 node classification benchmarks without graph-specific training.

  10. Hierarchical Kernel Transformer: Multi-Scale Attention with an Information-Theoretic Approximation Analysis

    cs.LG 2026-04 unverdicted novelty 7.0

    HKT is a multi-scale attention architecture that bounds computation at 1.31x standard attention, proves kernel and decomposition properties, and reports accuracy gains on ListOps, sequential CIFAR-10, and character-le...

  11. Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling

    cs.LG 2026-04 unverdicted novelty 7.0

    Cactus uses constrained optimization to guarantee bounded divergence from the verifier LLM distribution during speculative sampling, raising acceptance rates without the distortion seen in typical acceptance sampling.

  12. Collapse-Free Prototype Readout Layer for Transformer Encoders

    cs.LG 2026-04 unverdicted novelty 7.0

    DDCL-Attention introduces a collapse-free prototype readout for transformers that decomposes the training loss exactly into reconstruction and diversity terms while providing stability guarantees via singular perturba...

  13. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    cs.LG 2024-05 unverdicted novelty 7.0

    Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.

  14. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

    cs.LG 2024-01 conditional novelty 7.0

    Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

  15. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

    cs.CV 2024-01 conditional novelty 7.0

    Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.

  16. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    cs.LG 2022-05 accept novelty 7.0

    FlashAttention reduces GPU high-bandwidth memory accesses in self-attention via tiling, delivering exact attention with lower IO complexity, 2-3x wall-clock speedups on models like GPT-2, and the ability to train on s...

  17. Rethinking Attention with Performers

    cs.LG 2020-09 unverdicted novelty 7.0

    Performers approximate full-rank softmax attention in Transformers via FAVOR+ random features for linear complexity, with theoretical guarantees of unbiased estimation and competitive results on pixel, text, and prote...

  18. SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection

    cs.CV 2026-05 unverdicted novelty 6.0

    SToRe3D delivers up to 3x faster inference for multi-view 3D object detection in ViTs by selecting relevant 2D tokens and 3D queries via mutual relevance heads with only marginal accuracy loss.

  19. Elastic Attention Cores for Scalable Vision Transformers

    cs.CV 2026-05 unverdicted novelty 6.0

    VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...

  20. Search Your Block Floating Point Scales!

    cs.LG 2026-05 unverdicted novelty 6.0

    ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.

  21. Nectar: Neural Estimation of Cached-Token Attention via Regression

    cs.LG 2026-05 unverdicted novelty 6.0

    Nectar fits small per-layer per-head neural networks via regression to predict attention outputs and normalizers, enabling constant-time inference independent of context length while preserving semantic generation quality.

  22. Practical Wi-Fi-based Motion Recognition Under Variable Traffic Patterns

    cs.LG 2026-05 unverdicted novelty 6.0

    A sampling-rate-versatile transformer network with dynamic augmentation achieves stable high accuracy for Wi-Fi-based motion and gesture recognition across variable sampling rates and traffic patterns.

  23. Gated Subspace Inference for Transformer Acceleration

    cs.LG 2026-05 unverdicted novelty 6.0

    Gated Subspace Inference accelerates transformer linear layers 3-10x via low-rank cached subspace computation and per-token gating to skip residuals while preserving output distribution to high accuracy.

  24. Stochastic Sparse Attention for Memory-Bound Inference

    cs.LG 2026-05 accept novelty 6.0

    SANTA sparsifies post-softmax value aggregation via stratified sampling of S << n_k indices to produce an unbiased estimator, delivering 1.5x decode attention speedup on RTX 6000 Ada at 32k contexts while matching bas...

  25. Linear-Time Global Visual Modeling without Explicit Attention

    cs.CV 2026-05 unverdicted novelty 6.0

    Dynamic parameterization of standard layers can replace explicit attention for linear-time global visual modeling.

  26. GateMOT: Q-Gated Attention for Dense Object Tracking

    cs.CV 2026-04 unverdicted novelty 6.0

    GateMOT proposes Q-Gated Attention to enable linear-complexity, spatially aware attention for state-of-the-art dense object tracking on benchmarks like BEE24.

  27. ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers

    cs.LG 2026-04 unverdicted novelty 6.0

    ELSA casts online softmax attention as a prefix scan over monoid (m,S,W) to deliver exact FP32 semantics, O(n) memory, O(log n) depth, and Tensor-Core independence as a drop-in kernel.

  28. HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models

    cs.LG 2026-04 unverdicted novelty 6.0

    HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.

  29. DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion

    cs.CV 2026-04 unverdicted novelty 6.0

    DynamicRad achieves 1.7x-2.5x inference speedups in long video diffusion with over 80% sparsity by grounding adaptive selection in a radial locality prior, using dual-mode static/dynamic strategies and offline BO with...

  30. RankUp: Towards High-rank Representations for Large Scale Advertising Recommender Systems

    cs.IR 2026-04 unverdicted novelty 6.0

    RankUp raises effective rank of representations in deep MetaFormer recommenders via randomized splitting and multi-embeddings, delivering 2-5% GMV gains in production deployments at Weixin.

  31. On the Effectiveness of Context Compression for Repository-Level Tasks: An Empirical Investigation

    cs.SE 2026-04 unverdicted novelty 6.0

    Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.

  32. Tracing the Chain: Deep Learning for Stepping-Stone Intrusion Detection

    cs.CR 2026-04 unverdicted novelty 6.0

    ESPRESSO achieves over 0.99 true positive rate at 10^{-3} false positive rate for stepping-stone intrusion detection on synthetic data for SSH, SOCAT, ICMP, DNS and mixed protocols, outperforming DeepCoFFEA while also...

  33. PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

    cs.CV 2026-04 unverdicted novelty 6.0

    PoM is a new linear-complexity token mixer using learned polynomials that matches attention performance in transformers while enabling efficient long-sequence processing.

  34. Kimi Linear: An Expressive, Efficient Attention Architecture

    cs.CL 2025-10 unverdicted novelty 6.0

    Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.

  35. YOLOv12: Attention-Centric Real-Time Object Detectors

    cs.CV 2025-02 unverdicted novelty 6.0

    YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.

  36. SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

    cs.CV 2024-10 unverdicted novelty 6.0

    Sana-0.6B produces high-resolution images with strong text alignment at 20x smaller size and 100x higher throughput than Flux-12B by combining 32x image compression, linear DiT blocks, and a decoder-only LLM text encoder.

  37. MemGPT: Towards LLMs as Operating Systems

    cs.AI 2023-10 unverdicted novelty 6.0

    MemGPT uses OS-inspired virtual context management to extend LLM context windows for large document analysis and long-term multi-session chat.

  38. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    cs.LG 2023-07 accept novelty 6.0

    FlashAttention-2 achieves roughly 2x speedup over FlashAttention by parallelizing attention across thread blocks and distributing work within blocks, reaching 50-73% of theoretical peak FLOPs/s on A100 GPUs.

  39. Token Merging: Your ViT But Faster

    cs.CV 2022-10 unverdicted novelty 6.0

    Token Merging (ToMe) doubles the throughput of large Vision Transformers on images, video, and audio by merging similar tokens with a fast matching algorithm, incurring only 0.2-0.4% accuracy loss.

  40. USEMA: a Scalable Efficient Mamba Like Attention for Medical Image Segmentation

    cs.CV 2026-05 unverdicted novelty 5.0

    USEMA is a hybrid UNet architecture merging CNNs with scalable Mamba-like attention (SEMA) that achieves better efficiency than transformers and superior segmentation accuracy than pure CNN or Mamba models across medi...

  41. PhysEDA: Physics-Aware Learning Framework for Efficient EDA With Manhattan Distance Decay

    cs.LG 2026-05 unverdicted novelty 5.0

    PhysEDA folds separable Manhattan-distance exponential decay into linear attention and potential-based rewards, cutting complexity to linear while improving zero-shot transfer and sparse-reward performance on decoupli...

  42. Kaczmarz Linear Attention

    cs.LG 2026-05 unverdicted novelty 5.0

    Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...

  43. Convexity in Disguise: A Theoretical Framework for Nonconvex Low-Rank Matrix Estimation

    stat.ML 2026-05 unverdicted novelty 5.0

    Nonconvex low-rank matrix estimation procedures are shown to be equivalent to locally strongly convex formulations via a benign regularizer that does not change the algorithm's update rule.

  44. Cascade Token Selection for Transformer Attention Acceleration

    cs.LG 2026-05 unverdicted novelty 5.0

    Cascade token selection inherits and updates a small set of representative tokens across layers using cross-Gram validation, reducing selection cost from O(T²d) to O(Trd) per layer with observed Gram savings of 22-63%...

  45. StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

    cs.LG 2026-05 accept novelty 5.0

    Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.

  46. Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction

    cs.MM 2026-04 unverdicted novelty 5.0

    A new joint spatio-temporal enlargement model for micro-video popularity prediction using frame scoring for long sequences and a topology-aware memory bank for unbounded historical associations.

  47. RankUp: Towards High-rank Representations for Large Scale Advertising Recommender Systems

    cs.IR 2026-04 unverdicted novelty 5.0

    RankUp enhances representation capacity in deep MetaFormer recommenders via permutation splitting and multi-embeddings, achieving GMV improvements of 2-5% in Weixin production systems.

  48. Sinkhorn doubly stochastic attention rank decay analysis

    cs.LG 2026-04 unverdicted novelty 4.0

    Sinkhorn-normalized doubly stochastic attention preserves rank more effectively than Softmax row-stochastic attention, with both showing doubly exponential rank decay to one with network depth.

  49. A Survey on Efficient Inference for Large Language Models

    cs.CL 2024-04 accept novelty 3.0

    The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 48 Pith papers · 11 internal anchors

  1. [1]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150,

  2. [2]

    Language Models are Few-Shot Learners

    Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165,

  3. [3]

    Training Deep Nets with Sublinear Memory Cost

    Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174,

  4. [4]

    Generating Long Sequences with Sparse Transformers

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509,

  5. [5]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186,

  6. [6]

    arXiv preprint arXiv:2004.07320 , year=

    Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Remi Gribonval, Herve Jegou, and Armand Joulin. Training with quantization noise for extreme fixed-point compression. arXiv preprint arXiv:2004.07320,

  7. [7]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531,

  8. [8]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692,

  9. [9]

    Pointer Sentinel Mixture Models

    9 Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843,

  10. [10]

    Mixed Precision Training

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740,

  11. [11]

    Mohamed, D

    Abdelrahman Mohamed, Dmytro Okhonko, and Luke Zettlemoyer. Transformers with convolutional context for asr. arXiv preprint arXiv:1904.11660,

  12. [12]

    fairseq: A fast, extensible toolkit for sequence modeling

    Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp. 48–53,

  13. [13]

    Blockwise self-attention for long document understanding

    Jiezhong Qiu, Hao Ma, Omer Levy, Scott Wen-tau Yih, Sinong Wang, and Jie Tang. Blockwise self-attention for long document understanding. arXiv preprint arXiv:1911.02972,

  14. [14]

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683,

  15. [15]

    Squad: 100,000+ questions for machine comprehension of text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392,

  16. [16]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108,

  17. [17]

    Recursive deep models for semantic compositionality over a sentiment treebank

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642,

  18. [19]

    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

    URL http://arxiv.org/abs/1804.07461. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pp. 19–27,

  19. [20]

    (JL, for short), the following version is from (Arriaga & Vempala, 2006). Lemma