arxiv: 2310.01889 · v4 · submitted 2023-10-03 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Ring Attention with Blockwise Transformers for Near-Infinite Context

Hao Liu, Matei Zaharia, Pieter Abbeel

Pith reviewed 2026-05-12 19:23 UTC · model grok-4.3

classification 💻 cs.CL

keywords ring attentionblockwise transformerslong contextdistributed attentionself-attentionlanguage modelingreinforcement learning

0 comments

The pith

Ring attention distributes sequences across devices to reach lengths proportional to device count without approximations or added overhead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Transformers face memory limits that restrict long sequences such as videos or extended actions in complex tasks. The paper introduces Ring Attention with Blockwise Transformers to partition sequences into blocks and circulate key-value information in a ring across devices. Blockwise self-attention and feedforward computations run locally while communication overlaps completely with those computations. This setup supports sequences up to the number of devices times longer than previous memory-efficient methods. Experiments on language modeling and reinforcement learning confirm the approach reaches millions of tokens while preserving exact attention.

Core claim

We present Ring Attention with Blockwise Transformers, which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices while fully overlapping the communication of key-value blocks with the computation of blockwise attention. Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers, without resorting to approximations or incurring additional communication and computation overheads.

What carries the argument

Ring Attention with Blockwise Transformers: partitions the sequence into blocks, computes attention and feedforward locally on each device, and passes key-value blocks around a ring topology so that communication overlaps fully with local computation.

If this is right

Training and inference become feasible for sequences millions of tokens long.
Exact attention is preserved without approximations such as sparsity or low-rank methods.
No additional communication volume or computation is required beyond standard block operations.
Performance gains appear on long-context language modeling and reinforcement learning tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could extend to other sequence models that rely on attention by applying the same block-ring pattern.
Linear scaling with device count suggests that larger clusters would directly yield proportionally longer usable contexts.
If overlap remains perfect at scale, hybrid systems combining ring attention with other parallelism techniques could reach even greater lengths.

Load-bearing premise

Blockwise attention and ring communication can be implemented with perfect overlap and no hidden synchronization or memory costs on real hardware and software stacks.

What would settle it

Measure wall-clock time and memory usage when scaling from one device with a baseline-length sequence to N devices with an N-times-longer sequence; any deviation from linear scaling in length or from zero extra overhead would falsify the claim.

read the original abstract

Transformers have emerged as the architecture of choice for many state-of-the-art AI models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands imposed by Transformers limit their ability to handle long sequences, thereby posing challenges in utilizing videos, actions, and other long-form sequences and modalities in complex environments. We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices while fully overlapping the communication of key-value blocks with the computation of blockwise attention. Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers, without resorting to approximations or incurring additional communication and computation overheads. Extensive experiments on language modeling and reinforcement learning tasks demonstrate the effectiveness of our approach in allowing millions of tokens context size and improving performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ring Attention gives a clean ring-topology way to scale exact attention context linearly with devices via overlapped blockwise compute and KV comms, but the zero-overhead guarantee is an engineering assumption that needs hardware proof.

read the letter

The core idea here is straightforward: split the sequence into fixed blocks, keep Q/K/V local to each device, and stream the KV blocks around a ring while each device computes attention on its current block plus the incoming ones. This keeps the math identical to standard attention—no approximations—and the claim is that ring communication is fully hidden behind the local blockwise attention and FFN work, so total time and memory stay close to single-device levels scaled by device count. That lets context reach millions of tokens on modest clusters for language and RL tasks. The construction itself looks new in how it arranges the full overlap without extra passes or barriers. It builds on existing blockwise attention but adds the ring topology to get linear device scaling while preserving exactness, which is the practical win for people hitting memory walls on long video or agent sequences. The paper does a decent job laying out the algorithm and arguing why prior memory-efficient methods fall short on either approximation or scaling. The soft spot is the overlap assumption. If local compute finishes before the ring all-reduce of the next KV block arrives, or if NCCL/TPU collectives add sync or bandwidth contention, idle time appears and the no-extra-overhead guarantee slips. The abstract says experiments show effectiveness and no overhead, but without numbers, baselines, or hardware details visible here, it's hard to know how robust the pipelining is across model sizes or interconnects. The math is clean and parameter-free, which is good. This is for engineers and researchers scaling Transformer training or inference who care about exact long context rather than approximations. It deserves a serious referee because the problem is real and the method is simple enough to test, even if the results section will need close scrutiny on measured overheads. I'd send it to review.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Ring Attention with Blockwise Transformers, an algorithmic construction that partitions long input sequences into blocks and distributes them across devices arranged in a ring topology. Blockwise self-attention and feed-forward computations are performed locally while key-value blocks are streamed around the ring; the central claim is that this fully overlaps communication with computation, enabling exact (non-approximate) attention on sequences up to N times longer than single-device limits on N devices, with no additional communication or computation overheads. Experiments on language modeling and reinforcement learning tasks are asserted to demonstrate effectiveness at million-token context sizes and performance gains.

Significance. If the zero-overhead and perfect-overlap claims hold under realistic hardware conditions, the method would provide a practical, exact-attention route to scaling context length linearly with device count. This is a meaningful advance over both memory-bound standard Transformers and approximation-based long-context techniques, with potential impact on long-form language modeling, video understanding, and sequential RL. The construction is parameter-free and does not introduce new learned components.

major comments (2)

[Abstract / Experiments] Abstract and Experiments section: the manuscript states that 'extensive experiments ... demonstrate the effectiveness' and 'no extra overhead,' yet provides no quantitative baselines, throughput numbers, memory scaling curves, or comparisons against prior memory-efficient attention implementations (e.g., FlashAttention or standard ring-allreduce attention). Without these data the central 'no additional overheads' claim cannot be assessed.
[Method] Method (blockwise formulation): the claim that ring KV communication is 'fully overlapping' the blockwise attention and FFN compute is presented as an identity, but no analysis or bounds are given on when local compute time exceeds ring latency for given model dimension, block size, or interconnect bandwidth. If the assumption fails, residual synchronization or idle time would scale with device count and violate the zero-overhead guarantee.

minor comments (1)

[Notation / Method] Clarify in the notation section how block size is chosen relative to total sequence length and device count, and whether it must be uniform across devices.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify how to better substantiate our claims. We respond to each major point below and will revise the manuscript to address the identified gaps.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the manuscript states that 'extensive experiments ... demonstrate the effectiveness' and 'no extra overhead,' yet provides no quantitative baselines, throughput numbers, memory scaling curves, or comparisons against prior memory-efficient attention implementations (e.g., FlashAttention or standard ring-allreduce attention). Without these data the central 'no additional overheads' claim cannot be assessed.

Authors: We agree that the current manuscript does not provide sufficient quantitative evidence to fully support the 'no extra overhead' claim. The experiments primarily demonstrate scaling to million-token contexts and task-level improvements. In the revised version we will expand the Experiments section with: (i) throughput and latency measurements comparing Ring Attention against FlashAttention and standard ring-allreduce attention, (ii) memory-usage scaling curves across device counts and sequence lengths, and (iii) explicit overhead measurements for the ring communication phase. These additions will allow direct empirical assessment of the zero-overhead assertion. revision: yes
Referee: [Method] Method (blockwise formulation): the claim that ring KV communication is 'fully overlapping' the blockwise attention and FFN compute is presented as an identity, but no analysis or bounds are given on when local compute time exceeds ring latency for given model dimension, block size, or interconnect bandwidth. If the assumption fails, residual synchronization or idle time would scale with device count and violate the zero-overhead guarantee.

Authors: The blockwise formulation pipelines KV-block communication around the ring concurrently with local attention and FFN computation on the received block. We acknowledge that the manuscript presents this overlap as holding by construction without supplying explicit bounds or analysis on the required compute-to-communication ratio. In the revision we will add a dedicated subsection in the Method section that derives the conditions (in terms of model dimension d, block size b, and interconnect bandwidth) under which communication latency is fully hidden. We will also discuss the scaling implications when the assumption does not hold and quantify potential idle time. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic construction with independent design claims

full rationale

The paper presents Ring Attention as a direct algorithmic construction that splits sequences into blocks, computes attention and FFN blockwise, and pipelines ring communication of KV blocks to overlap with local compute. No equations, predictions, or results are shown to reduce by construction to fitted inputs, self-referential definitions, or unverified self-citations. The central claim of device-count scaling without added overheads follows from the explicit blockwise formulation and overlap assumption rather than any tautological renaming or parameter fitting. The derivation chain is self-contained as a systems-level design choice.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no new free parameters, axioms, or invented entities; it relies on standard distributed systems assumptions about communication latency and hardware topology.

pith-pipeline@v0.9.0 · 5452 in / 965 out tokens · 43188 ms · 2026-05-12T19:23:21.107377+00:00 · methodology

discussion (0)

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Long Context Pre-Training with Lighthouse Attention
cs.CL 2026-05 conditional novelty 7.0

Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower los...
SCOUT: Active Information Foraging for Long-Text Understanding with Decoupled Epistemic States
cs.CL 2026-05 unverdicted novelty 7.0

SCOUT achieves state-of-the-art long-text understanding with up to 8x lower token use by actively foraging for sparse query-relevant information and updating a compact provenance-grounded epistemic state.
Internalized Reasoning for Long-Context Visual Document Understanding
cs.CV 2026-03 unverdicted novelty 7.0

A synthetic pipeline creates and internalizes reasoning traces in VLMs for long-context visual document understanding, with a 32B model surpassing a 235B model on MMLongBenchDoc and showing 12.4x fewer output tokens.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
cs.LG 2025-02 unverdicted novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
cs.LG 2024-05 unverdicted novelty 7.0

Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
ShardTensor: Domain Parallelism for Scientific Machine Learning
cs.DC 2026-05 unverdicted novelty 6.0

ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing
cs.CL 2026-05 conditional novelty 6.0

EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.
FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning
cs.CL 2026-05 unverdicted novelty 6.0

FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, y...
ZAYA1-8B Technical Report
cs.AI 2026-05 unverdicted novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
The Impossibility Triangle of Long-Context Modeling
cs.CL 2026-05 unverdicted novelty 6.0

No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
cs.LG 2026-05 unverdicted novelty 6.0

ZeRO-Prefill achieves 1.35-1.59x higher throughput for MoE prefill serving by replacing per-layer activation AllToAll with overlapped asynchronous weight AllGather and prefix-aware routing.
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
cs.LG 2026-04 unverdicted novelty 6.0

SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
ChipLight: Cross-Layer Optimization of Chiplet Design with Optical Interconnects for LLM Training
cs.AR 2026-04 unverdicted novelty 6.0

ChipLight is a multi-objective optimization framework that co-designs chiplet hardware, training parallelism, and optical networks to improve efficiency in distributed LLM training clusters.
Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Q-Gate dynamically routes keyframe selection in long videos via query-modulated gating across visual grounding, global matching, and contextual alignment experts to improve MLLM performance.
Hive: A Multi-Agent Infrastructure for Algorithm- and Task-Level Scaling
cs.AI 2026-04 unverdicted novelty 6.0

Hive is a multi-agent infrastructure with a logits cache for reducing cross-path redundancy in sampling and agent-aware scheduling for better compute and KV-cache allocation, shown to deliver 1.11x-1.76x speedups and ...
CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism
cs.DC 2026-04 unverdicted novelty 6.0

CoCoDiff achieves 3.6x average and 8.4x peak speedup for distributed DiT inference on up to 96 GPU tiles via tile-aware all-to-all, V-first scheduling, and selective V communication.
LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows
cs.CV 2026-04 conditional novelty 6.0

LSRM scales transformer context windows with native sparse attention and geometric routing to deliver high-fidelity feed-forward 3D reconstruction and inverse rendering that approaches dense optimization quality.
DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators
cs.AR 2026-04 conditional novelty 6.0

DeepStack introduces a fast performance model and hierarchical search method for co-optimizing 3D DRAM stacking, interconnects, and distributed scheduling in AI accelerators, delivering up to 9.5x throughput gains ove...
GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads
cs.DC 2026-04 unverdicted novelty 6.0

GENSERVE improves SLO attainment by up to 44% for co-serving heterogeneous T2I and T2V diffusion workloads via step-level preemption, elastic parallelism, and joint scheduling.
MAGI-1: Autoregressive Video Generation at Scale
cs.CV 2025-05 unverdicted novelty 6.0

MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading
cs.CL 2026-05 unverdicted novelty 5.0

MemReread improves agent long-context reasoning by triggering rereading on insufficient final memory to recover discarded indirect facts, outperforming baselines at linear complexity.
Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP
cs.DC 2026-05 unverdicted novelty 5.0

FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.
An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference
cs.LG 2026-05 unverdicted novelty 5.0

Fluxion achieves 1.5x-3.7x speedup in long-context LLM inference with CPU KV caches while limiting accuracy degradation to at most 0.26 relative to full attention.
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
cs.LG 2026-05 accept novelty 5.0

Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
Movie Gen: A Cast of Media Foundation Models
cs.CV 2024-10 unverdicted novelty 5.0

A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
Seed1.5-VL Technical Report
cs.CV 2025-05 unverdicted novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
Cosmos World Foundation Model Platform for Physical AI
cs.CV 2025-01 unverdicted novelty 3.0

The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 27 Pith papers · 4 internal anchors

[1]

Introducing claude, 2023

Anthropic. Introducing claude, 2023. URL https://www.anthropic.com/index/ introducing-claude

work page 2023
[2]

Parallel computing: Architectures, algorithms, and applications, volume 15

Christian Bischof. Parallel computing: Architectures, algorithms, and applications, volume 15. IOS Press, 2008

work page 2008
[3]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[4]

Decision transformer: Reinforcement learning via sequence modeling

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021

work page 2021
[5]

Training Deep Nets with Sublinear Memory Cost

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016. 10

work page internal anchor Pith review Pith/arXiv arXiv 2016
[6]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna.lmsys.org, 2023

work page 2023
[7]

Transformations to parallel codes for communication-computation overlap

Anthony Danalis, Ki-Yong Kim, Lori Pollock, and Martin Swany. Transformations to parallel codes for communication-computation overlap. In SC’05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, pages 58–58. IEEE, 2005

work page 2005
[8]

Mpi-aware compiler op- timizations for improving communication-computation overlap

Anthony Danalis, Lori Pollock, Martin Swany, and John Cavazos. Mpi-aware compiler op- timizations for improving communication-computation overlap. In Proceedings of the 23rd international conference on Supercomputing, pages 316–325, 2009

work page 2009
[9]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022

work page 2022
[10]

Large scale distributed deep networks

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. Large scale distributed deep networks. Advances in neural information processing systems, 25, 2012

work page 2012
[11]

Fully Sharded Data Parallel: faster AI training with fewer GPUs — engineer- ing.fb.com

Facebook. Fully Sharded Data Parallel: faster AI training with fewer GPUs — engineer- ing.fb.com. https://engineering.fb.com/2021/07/15/open-source/fsdp/, 2023

work page 2021
[12]

Openllama: An open reproduction of llama, may 2023

Xinyang Geng and Hao Liu. Openllama: An open reproduction of llama, may 2023. URL https://github. com/openlm-research/open_llama, 2023

work page 2023
[13]

Koala: A dialogue model for academic research

Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song. Koala: A dialogue model for academic research. Blog post, April, 1, 2023

work page 2023
[14]

Bringing hpc techniques to deep learning

Andrew Gibiansky. Bringing hpc techniques to deep learning. Baidu Research, Tech. Rep., 2017

work page 2017
[15]

Gpipe: Efficient training of giant neural networks using pipeline parallelism

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, Hy- oukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019

work page 2019
[16]

Building a fault tolerant mpi application: A ring communication example

Joshua Hursey and Richard L Graham. Building a fault tolerant mpi application: A ring communication example. In 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pages 1549–1556. IEEE, 2011

work page 2011
[17]

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023

work page internal anchor Pith review arXiv 2023
[18]

Reducing activation recomputation in large transformer models, 2022

Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Moham- mad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models. arXiv preprint arXiv:2205.05198, 2022

work page arXiv 2022
[19]

Urlb: Unsupervised reinforcement learning benchmark

Michael Laskin, Denis Yarats, Hao Liu, Kimin Lee, Albert Zhan, Kevin Lu, Catherine Cang, Lerrel Pinto, and Pieter Abbeel. Urlb: Unsupervised reinforcement learning benchmark. arXiv preprint arXiv:2110.15191, 2021

work page arXiv 2021
[20]

Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang

Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph E. Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. How long can open-source llms truly promise on context length?, June 2023. URL https://lmsys.org/blog/2023-06-29-longchat

work page 2023
[21]

Sequence paral- lelism: Long sequence training from system perspective

Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. Sequence paral- lelism: Long sequence training from system perspective. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2391–2404, Toronto, Canada, July

work page
[22]

Sequence Parallelism: Long Sequence Training from System Perspective , url =

Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.134. URL https://aclanthology.org/2023.acl-long.134. 11

work page doi:10.18653/v1/2023.acl-long.134 2023
[23]

Emergent agentic transformer from chain of hindsight experience

Hao Liu and Pieter Abbeel. Emergent agentic transformer from chain of hindsight experience. International Conference on Machine Learning, 2023

work page 2023
[24]

Blockwise parallel transformer for large context models

Hao Liu and Pieter Abbeel. Blockwise parallel transformer for large context models. Advances in neural information processing systems, 2023

work page 2023
[25]

Online normalizer calculation for softmax,

Maxim Milakov and Natalia Gimelshein. Online normalizer calculation for softmax. arXiv preprint arXiv:1805.02867, 2018

work page arXiv 2018
[26]

Introducing mpt-7b: A new standard for open-source, commercially usable llms,

MosaicML. Introducing mpt-7b: A new standard for open-source, commercially usable llms,

work page
[27]

URL https://www.mosaicml.com/blog/mpt-7b

work page
[28]

Do transformer modifications transfer across implementations and applications? arXiv preprint arXiv:2102.11972, 2021

Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, Karishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, et al. Do transformer modifi- cations transfer across implementations and applications? arXiv preprint arXiv:2102.11972, 2021

work page arXiv 2021
[29]

Pipedream: Generalized pipeline parallelism for dnn training

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019

work page 2019
[30]

Memory- efficient pipeline-parallel dnn training

Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. Memory- efficient pipeline-parallel dnn training. InInternational Conference on Machine Learning, pages 7937–7947. PMLR, 2021

work page 2021
[31]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[32]

Self-attention does not needo(n 2)memory.arXiv preprint arXiv:2112.05682,

Markus N Rabe and Charles Staats. Self-attention does not need o(n2) memory. arXiv preprint arXiv:2112.05682, 2021

work page arXiv 2021
[33]

Zero: Memory optimiza- tions toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020

work page 2020
[34]

Schulman, B

J. Schulman, B. Zoph, C. Kim, J. Hilton, J. Menick, J. Weng, J. F. C. Uribe, L. Fedus, L. Metz, M. Pokorny, R. G. Lopes, S. Zhao, A. Vijayvergiya, E. Sigler, A. Perelman, C. V oss, M. Heaton, J. Parish, D. Cummings, R. Nayak, V . Balcom, D. Schnurr, T. Kaftan, C. Hallacy, N. Turley, N. Deutsch, and V . Goel. Chatgpt: Optimizing language models for dialogu...

work page
[35]

URL https://openai.com/blog/chatgpt

work page
[36]

Horovod: fast and easy distributed deep learning in tensorflow,

Alexander Sergeev and Mike Del Balso. Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799, 2018

work page arXiv 2018
[37]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[38]

Scaling laws vs model architectures: How does inductive bias influence scaling? arXiV preprint arXiV:2207.10551, 2022 a

Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Won Chung, William Fedus, Jinfeng Rao, Sharan Narang, Vinh Q Tran, Dani Yogatama, and Donald Metzler. Scaling laws vs model architectures: How does inductive bias influence scaling? arXiv preprint arXiv:2207.10551, 2022

work page arXiv 2022
[39]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 12

work page 2017
[41]

Overlap communication with dependent computation via decomposition in large deep learning models

Shibo Wang, Jinliang Wei, Amit Sabne, Andy Davis, Berkin Ilbeyi, Blake Hechtman, Dehao Chen, Karthik Srinivasa Murthy, Marcello Maggioni, Qiao Zhang, et al. Overlap communication with dependent computation via decomposition in large deep learning models. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Langua...

work page 2022
[42]

query_chunk_size

Denis Yarats, David Brandfonbrener, Hao Liu, Michael Laskin, Pieter Abbeel, Alessandro Lazaric, and Lerrel Pinto. Don’t change the algorithm, change the data: Exploratory data for offline reinforcement learning. arXiv preprint arXiv:2201.13425, 2022. 13 A Code The implementation of Ring Attention in Jax is provided in Figure 4. We use defvjp function to d...

work page arXiv 2022