pith. machine review for the scientific record. sign in

arxiv: 2310.01889 · v4 · submitted 2023-10-03 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Ring Attention with Blockwise Transformers for Near-Infinite Context

Hao Liu, Matei Zaharia, Pieter Abbeel

Pith reviewed 2026-05-12 19:23 UTC · model grok-4.3

classification 💻 cs.CL
keywords ring attentionblockwise transformerslong contextdistributed attentionself-attentionlanguage modelingreinforcement learning
0
0 comments X

The pith

Ring attention distributes sequences across devices to reach lengths proportional to device count without approximations or added overhead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Transformers face memory limits that restrict long sequences such as videos or extended actions in complex tasks. The paper introduces Ring Attention with Blockwise Transformers to partition sequences into blocks and circulate key-value information in a ring across devices. Blockwise self-attention and feedforward computations run locally while communication overlaps completely with those computations. This setup supports sequences up to the number of devices times longer than previous memory-efficient methods. Experiments on language modeling and reinforcement learning confirm the approach reaches millions of tokens while preserving exact attention.

Core claim

We present Ring Attention with Blockwise Transformers, which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices while fully overlapping the communication of key-value blocks with the computation of blockwise attention. Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers, without resorting to approximations or incurring additional communication and computation overheads.

What carries the argument

Ring Attention with Blockwise Transformers: partitions the sequence into blocks, computes attention and feedforward locally on each device, and passes key-value blocks around a ring topology so that communication overlaps fully with local computation.

If this is right

  • Training and inference become feasible for sequences millions of tokens long.
  • Exact attention is preserved without approximations such as sparsity or low-rank methods.
  • No additional communication volume or computation is required beyond standard block operations.
  • Performance gains appear on long-context language modeling and reinforcement learning tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could extend to other sequence models that rely on attention by applying the same block-ring pattern.
  • Linear scaling with device count suggests that larger clusters would directly yield proportionally longer usable contexts.
  • If overlap remains perfect at scale, hybrid systems combining ring attention with other parallelism techniques could reach even greater lengths.

Load-bearing premise

Blockwise attention and ring communication can be implemented with perfect overlap and no hidden synchronization or memory costs on real hardware and software stacks.

What would settle it

Measure wall-clock time and memory usage when scaling from one device with a baseline-length sequence to N devices with an N-times-longer sequence; any deviation from linear scaling in length or from zero extra overhead would falsify the claim.

read the original abstract

Transformers have emerged as the architecture of choice for many state-of-the-art AI models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands imposed by Transformers limit their ability to handle long sequences, thereby posing challenges in utilizing videos, actions, and other long-form sequences and modalities in complex environments. We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices while fully overlapping the communication of key-value blocks with the computation of blockwise attention. Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers, without resorting to approximations or incurring additional communication and computation overheads. Extensive experiments on language modeling and reinforcement learning tasks demonstrate the effectiveness of our approach in allowing millions of tokens context size and improving performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Ring Attention with Blockwise Transformers, an algorithmic construction that partitions long input sequences into blocks and distributes them across devices arranged in a ring topology. Blockwise self-attention and feed-forward computations are performed locally while key-value blocks are streamed around the ring; the central claim is that this fully overlaps communication with computation, enabling exact (non-approximate) attention on sequences up to N times longer than single-device limits on N devices, with no additional communication or computation overheads. Experiments on language modeling and reinforcement learning tasks are asserted to demonstrate effectiveness at million-token context sizes and performance gains.

Significance. If the zero-overhead and perfect-overlap claims hold under realistic hardware conditions, the method would provide a practical, exact-attention route to scaling context length linearly with device count. This is a meaningful advance over both memory-bound standard Transformers and approximation-based long-context techniques, with potential impact on long-form language modeling, video understanding, and sequential RL. The construction is parameter-free and does not introduce new learned components.

major comments (2)
  1. [Abstract / Experiments] Abstract and Experiments section: the manuscript states that 'extensive experiments ... demonstrate the effectiveness' and 'no extra overhead,' yet provides no quantitative baselines, throughput numbers, memory scaling curves, or comparisons against prior memory-efficient attention implementations (e.g., FlashAttention or standard ring-allreduce attention). Without these data the central 'no additional overheads' claim cannot be assessed.
  2. [Method] Method (blockwise formulation): the claim that ring KV communication is 'fully overlapping' the blockwise attention and FFN compute is presented as an identity, but no analysis or bounds are given on when local compute time exceeds ring latency for given model dimension, block size, or interconnect bandwidth. If the assumption fails, residual synchronization or idle time would scale with device count and violate the zero-overhead guarantee.
minor comments (1)
  1. [Notation / Method] Clarify in the notation section how block size is chosen relative to total sequence length and device count, and whether it must be uniform across devices.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify how to better substantiate our claims. We respond to each major point below and will revise the manuscript to address the identified gaps.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the manuscript states that 'extensive experiments ... demonstrate the effectiveness' and 'no extra overhead,' yet provides no quantitative baselines, throughput numbers, memory scaling curves, or comparisons against prior memory-efficient attention implementations (e.g., FlashAttention or standard ring-allreduce attention). Without these data the central 'no additional overheads' claim cannot be assessed.

    Authors: We agree that the current manuscript does not provide sufficient quantitative evidence to fully support the 'no extra overhead' claim. The experiments primarily demonstrate scaling to million-token contexts and task-level improvements. In the revised version we will expand the Experiments section with: (i) throughput and latency measurements comparing Ring Attention against FlashAttention and standard ring-allreduce attention, (ii) memory-usage scaling curves across device counts and sequence lengths, and (iii) explicit overhead measurements for the ring communication phase. These additions will allow direct empirical assessment of the zero-overhead assertion. revision: yes

  2. Referee: [Method] Method (blockwise formulation): the claim that ring KV communication is 'fully overlapping' the blockwise attention and FFN compute is presented as an identity, but no analysis or bounds are given on when local compute time exceeds ring latency for given model dimension, block size, or interconnect bandwidth. If the assumption fails, residual synchronization or idle time would scale with device count and violate the zero-overhead guarantee.

    Authors: The blockwise formulation pipelines KV-block communication around the ring concurrently with local attention and FFN computation on the received block. We acknowledge that the manuscript presents this overlap as holding by construction without supplying explicit bounds or analysis on the required compute-to-communication ratio. In the revision we will add a dedicated subsection in the Method section that derives the conditions (in terms of model dimension d, block size b, and interconnect bandwidth) under which communication latency is fully hidden. We will also discuss the scaling implications when the assumption does not hold and quantify potential idle time. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic construction with independent design claims

full rationale

The paper presents Ring Attention as a direct algorithmic construction that splits sequences into blocks, computes attention and FFN blockwise, and pipelines ring communication of KV blocks to overlap with local compute. No equations, predictions, or results are shown to reduce by construction to fitted inputs, self-referential definitions, or unverified self-citations. The central claim of device-count scaling without added overheads follows from the explicit blockwise formulation and overlap assumption rather than any tautological renaming or parameter fitting. The derivation chain is self-contained as a systems-level design choice.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no new free parameters, axioms, or invented entities; it relies on standard distributed systems assumptions about communication latency and hardware topology.

pith-pipeline@v0.9.0 · 5452 in / 965 out tokens · 43188 ms · 2026-05-12T19:23:21.107377+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Long Context Pre-Training with Lighthouse Attention

    cs.CL 2026-05 conditional novelty 7.0

    Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower los...

  2. SCOUT: Active Information Foraging for Long-Text Understanding with Decoupled Epistemic States

    cs.CL 2026-05 unverdicted novelty 7.0

    SCOUT achieves state-of-the-art long-text understanding with up to 8x lower token use by actively foraging for sparse query-relevant information and updating a compact provenance-grounded epistemic state.

  3. Internalized Reasoning for Long-Context Visual Document Understanding

    cs.CV 2026-03 unverdicted novelty 7.0

    A synthetic pipeline creates and internalizes reasoning traces in VLMs for long-context visual document understanding, with a 32B model surpassing a 235B model on MMLongBenchDoc and showing 12.4x fewer output tokens.

  4. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    cs.LG 2025-02 unverdicted novelty 7.0

    A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

  5. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    cs.LG 2024-05 unverdicted novelty 7.0

    Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.

  6. ShardTensor: Domain Parallelism for Scientific Machine Learning

    cs.DC 2026-05 unverdicted novelty 6.0

    ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.

  7. Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing

    cs.CL 2026-05 conditional novelty 6.0

    EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.

  8. FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

    cs.CL 2026-05 unverdicted novelty 6.0

    FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, y...

  9. ZAYA1-8B Technical Report

    cs.AI 2026-05 unverdicted novelty 6.0

    ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

  10. The Impossibility Triangle of Long-Context Modeling

    cs.CL 2026-05 unverdicted novelty 6.0

    No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.

  11. ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving

    cs.LG 2026-05 unverdicted novelty 6.0

    ZeRO-Prefill achieves 1.35-1.59x higher throughput for MoE prefill serving by replacing per-layer activation AllToAll with overlapped asynchronous weight AllGather and prefix-aware routing.

  12. Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

    cs.LG 2026-04 unverdicted novelty 6.0

    SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.

  13. ChipLight: Cross-Layer Optimization of Chiplet Design with Optical Interconnects for LLM Training

    cs.AR 2026-04 unverdicted novelty 6.0

    ChipLight is a multi-objective optimization framework that co-designs chiplet hardware, training parallelism, and optical networks to improve efficiency in distributed LLM training clusters.

  14. Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    Q-Gate dynamically routes keyframe selection in long videos via query-modulated gating across visual grounding, global matching, and contextual alignment experts to improve MLLM performance.

  15. Hive: A Multi-Agent Infrastructure for Algorithm- and Task-Level Scaling

    cs.AI 2026-04 unverdicted novelty 6.0

    Hive is a multi-agent infrastructure with a logits cache for reducing cross-path redundancy in sampling and agent-aware scheduling for better compute and KV-cache allocation, shown to deliver 1.11x-1.76x speedups and ...

  16. CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism

    cs.DC 2026-04 unverdicted novelty 6.0

    CoCoDiff achieves 3.6x average and 8.4x peak speedup for distributed DiT inference on up to 96 GPU tiles via tile-aware all-to-all, V-first scheduling, and selective V communication.

  17. LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows

    cs.CV 2026-04 conditional novelty 6.0

    LSRM scales transformer context windows with native sparse attention and geometric routing to deliver high-fidelity feed-forward 3D reconstruction and inverse rendering that approaches dense optimization quality.

  18. DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators

    cs.AR 2026-04 conditional novelty 6.0

    DeepStack introduces a fast performance model and hierarchical search method for co-optimizing 3D DRAM stacking, interconnects, and distributed scheduling in AI accelerators, delivering up to 9.5x throughput gains ove...

  19. GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads

    cs.DC 2026-04 unverdicted novelty 6.0

    GENSERVE improves SLO attainment by up to 44% for co-serving heterogeneous T2I and T2V diffusion workloads via step-level preemption, elastic parallelism, and joint scheduling.

  20. MAGI-1: Autoregressive Video Generation at Scale

    cs.CV 2025-05 unverdicted novelty 6.0

    MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.

  21. MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading

    cs.CL 2026-05 unverdicted novelty 5.0

    MemReread improves agent long-context reasoning by triggering rereading on insufficient final memory to recover discarded indirect facts, outperforming baselines at linear complexity.

  22. Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP

    cs.DC 2026-05 unverdicted novelty 5.0

    FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.

  23. An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference

    cs.LG 2026-05 unverdicted novelty 5.0

    Fluxion achieves 1.5x-3.7x speedup in long-context LLM inference with CPU KV caches while limiting accuracy degradation to at most 0.26 relative to full attention.

  24. StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

    cs.LG 2026-05 accept novelty 5.0

    Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.

  25. Movie Gen: A Cast of Media Foundation Models

    cs.CV 2024-10 unverdicted novelty 5.0

    A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.

  26. Seed1.5-VL Technical Report

    cs.CV 2025-05 unverdicted novelty 4.0

    Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

  27. Cosmos World Foundation Model Platform for Physical AI

    cs.CV 2025-01 unverdicted novelty 3.0

    The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 27 Pith papers · 4 internal anchors

  1. [1]

    Introducing claude, 2023

    Anthropic. Introducing claude, 2023. URL https://www.anthropic.com/index/ introducing-claude

  2. [2]

    Parallel computing: Architectures, algorithms, and applications, volume 15

    Christian Bischof. Parallel computing: Architectures, algorithms, and applications, volume 15. IOS Press, 2008

  3. [3]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  4. [4]

    Decision transformer: Reinforcement learning via sequence modeling

    Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021

  5. [5]

    Training Deep Nets with Sublinear Memory Cost

    Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016. 10

  6. [6]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna.lmsys.org, 2023

  7. [7]

    Transformations to parallel codes for communication-computation overlap

    Anthony Danalis, Ki-Yong Kim, Lori Pollock, and Martin Swany. Transformations to parallel codes for communication-computation overlap. In SC’05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, pages 58–58. IEEE, 2005

  8. [8]

    Mpi-aware compiler op- timizations for improving communication-computation overlap

    Anthony Danalis, Lori Pollock, Martin Swany, and John Cavazos. Mpi-aware compiler op- timizations for improving communication-computation overlap. In Proceedings of the 23rd international conference on Supercomputing, pages 316–325, 2009

  9. [9]

    Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022

  10. [10]

    Large scale distributed deep networks

    Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. Large scale distributed deep networks. Advances in neural information processing systems, 25, 2012

  11. [11]

    Fully Sharded Data Parallel: faster AI training with fewer GPUs — engineer- ing.fb.com

    Facebook. Fully Sharded Data Parallel: faster AI training with fewer GPUs — engineer- ing.fb.com. https://engineering.fb.com/2021/07/15/open-source/fsdp/, 2023

  12. [12]

    Openllama: An open reproduction of llama, may 2023

    Xinyang Geng and Hao Liu. Openllama: An open reproduction of llama, may 2023. URL https://github. com/openlm-research/open_llama, 2023

  13. [13]

    Koala: A dialogue model for academic research

    Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song. Koala: A dialogue model for academic research. Blog post, April, 1, 2023

  14. [14]

    Bringing hpc techniques to deep learning

    Andrew Gibiansky. Bringing hpc techniques to deep learning. Baidu Research, Tech. Rep., 2017

  15. [15]

    Gpipe: Efficient training of giant neural networks using pipeline parallelism

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, Hy- oukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019

  16. [16]

    Building a fault tolerant mpi application: A ring communication example

    Joshua Hursey and Richard L Graham. Building a fault tolerant mpi application: A ring communication example. In 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pages 1549–1556. IEEE, 2011

  17. [17]

    DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023

  18. [18]

    Reducing activation recomputation in large transformer models, 2022

    Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Moham- mad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models. arXiv preprint arXiv:2205.05198, 2022

  19. [19]

    Urlb: Unsupervised reinforcement learning benchmark

    Michael Laskin, Denis Yarats, Hao Liu, Kimin Lee, Albert Zhan, Kevin Lu, Catherine Cang, Lerrel Pinto, and Pieter Abbeel. Urlb: Unsupervised reinforcement learning benchmark. arXiv preprint arXiv:2110.15191, 2021

  20. [20]

    Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang

    Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph E. Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. How long can open-source llms truly promise on context length?, June 2023. URL https://lmsys.org/blog/2023-06-29-longchat

  21. [21]

    Sequence paral- lelism: Long sequence training from system perspective

    Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. Sequence paral- lelism: Long sequence training from system perspective. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2391–2404, Toronto, Canada, July

  22. [22]

    Sequence Parallelism: Long Sequence Training from System Perspective , url =

    Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.134. URL https://aclanthology.org/2023.acl-long.134. 11

  23. [23]

    Emergent agentic transformer from chain of hindsight experience

    Hao Liu and Pieter Abbeel. Emergent agentic transformer from chain of hindsight experience. International Conference on Machine Learning, 2023

  24. [24]

    Blockwise parallel transformer for large context models

    Hao Liu and Pieter Abbeel. Blockwise parallel transformer for large context models. Advances in neural information processing systems, 2023

  25. [25]

    Online normalizer calculation for softmax,

    Maxim Milakov and Natalia Gimelshein. Online normalizer calculation for softmax. arXiv preprint arXiv:1805.02867, 2018

  26. [26]

    Introducing mpt-7b: A new standard for open-source, commercially usable llms,

    MosaicML. Introducing mpt-7b: A new standard for open-source, commercially usable llms,

  27. [27]

    URL https://www.mosaicml.com/blog/mpt-7b

  28. [28]

    Do transformer modifications transfer across implementations and applications? arXiv preprint arXiv:2102.11972, 2021

    Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, Karishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, et al. Do transformer modifi- cations transfer across implementations and applications? arXiv preprint arXiv:2102.11972, 2021

  29. [29]

    Pipedream: Generalized pipeline parallelism for dnn training

    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019

  30. [30]

    Memory- efficient pipeline-parallel dnn training

    Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. Memory- efficient pipeline-parallel dnn training. InInternational Conference on Machine Learning, pages 7937–7947. PMLR, 2021

  31. [31]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  32. [32]

    Self-attention does not needo(n 2)memory.arXiv preprint arXiv:2112.05682,

    Markus N Rabe and Charles Staats. Self-attention does not need o(n2) memory. arXiv preprint arXiv:2112.05682, 2021

  33. [33]

    Zero: Memory optimiza- tions toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020

  34. [34]

    Schulman, B

    J. Schulman, B. Zoph, C. Kim, J. Hilton, J. Menick, J. Weng, J. F. C. Uribe, L. Fedus, L. Metz, M. Pokorny, R. G. Lopes, S. Zhao, A. Vijayvergiya, E. Sigler, A. Perelman, C. V oss, M. Heaton, J. Parish, D. Cummings, R. Nayak, V . Balcom, D. Schnurr, T. Kaftan, C. Hallacy, N. Turley, N. Deutsch, and V . Goel. Chatgpt: Optimizing language models for dialogu...

  35. [35]

    URL https://openai.com/blog/chatgpt

  36. [36]

    Horovod: fast and easy distributed deep learning in tensorflow,

    Alexander Sergeev and Mike Del Balso. Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799, 2018

  37. [37]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

  38. [38]

    Scaling laws vs model architectures: How does inductive bias influence scaling? arXiV preprint arXiV:2207.10551, 2022 a

    Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Won Chung, William Fedus, Jinfeng Rao, Sharan Narang, Vinh Q Tran, Dani Yogatama, and Donald Metzler. Scaling laws vs model architectures: How does inductive bias influence scaling? arXiv preprint arXiv:2207.10551, 2022

  39. [39]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  40. [40]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 12

  41. [41]

    Overlap communication with dependent computation via decomposition in large deep learning models

    Shibo Wang, Jinliang Wei, Amit Sabne, Andy Davis, Berkin Ilbeyi, Blake Hechtman, Dehao Chen, Karthik Srinivasa Murthy, Marcello Maggioni, Qiao Zhang, et al. Overlap communication with dependent computation via decomposition in large deep learning models. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Langua...

  42. [42]

    query_chunk_size

    Denis Yarats, David Brandfonbrener, Hao Liu, Michael Laskin, Pieter Abbeel, Alessandro Lazaric, and Lerrel Pinto. Don’t change the algorithm, change the data: Exploratory data for offline reinforcement learning. arXiv preprint arXiv:2201.13425, 2022. 13 A Code The implementation of Ring Attention in Jax is provided in Figure 4. We use defvjp function to d...