Recognition: 2 theorem links
· Lean TheoremRing Attention with Blockwise Transformers for Near-Infinite Context
Pith reviewed 2026-05-12 19:23 UTC · model grok-4.3
The pith
Ring attention distributes sequences across devices to reach lengths proportional to device count without approximations or added overhead.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present Ring Attention with Blockwise Transformers, which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices while fully overlapping the communication of key-value blocks with the computation of blockwise attention. Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers, without resorting to approximations or incurring additional communication and computation overheads.
What carries the argument
Ring Attention with Blockwise Transformers: partitions the sequence into blocks, computes attention and feedforward locally on each device, and passes key-value blocks around a ring topology so that communication overlaps fully with local computation.
If this is right
- Training and inference become feasible for sequences millions of tokens long.
- Exact attention is preserved without approximations such as sparsity or low-rank methods.
- No additional communication volume or computation is required beyond standard block operations.
- Performance gains appear on long-context language modeling and reinforcement learning tasks.
Where Pith is reading between the lines
- The method could extend to other sequence models that rely on attention by applying the same block-ring pattern.
- Linear scaling with device count suggests that larger clusters would directly yield proportionally longer usable contexts.
- If overlap remains perfect at scale, hybrid systems combining ring attention with other parallelism techniques could reach even greater lengths.
Load-bearing premise
Blockwise attention and ring communication can be implemented with perfect overlap and no hidden synchronization or memory costs on real hardware and software stacks.
What would settle it
Measure wall-clock time and memory usage when scaling from one device with a baseline-length sequence to N devices with an N-times-longer sequence; any deviation from linear scaling in length or from zero extra overhead would falsify the claim.
read the original abstract
Transformers have emerged as the architecture of choice for many state-of-the-art AI models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands imposed by Transformers limit their ability to handle long sequences, thereby posing challenges in utilizing videos, actions, and other long-form sequences and modalities in complex environments. We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices while fully overlapping the communication of key-value blocks with the computation of blockwise attention. Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers, without resorting to approximations or incurring additional communication and computation overheads. Extensive experiments on language modeling and reinforcement learning tasks demonstrate the effectiveness of our approach in allowing millions of tokens context size and improving performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Ring Attention with Blockwise Transformers, an algorithmic construction that partitions long input sequences into blocks and distributes them across devices arranged in a ring topology. Blockwise self-attention and feed-forward computations are performed locally while key-value blocks are streamed around the ring; the central claim is that this fully overlaps communication with computation, enabling exact (non-approximate) attention on sequences up to N times longer than single-device limits on N devices, with no additional communication or computation overheads. Experiments on language modeling and reinforcement learning tasks are asserted to demonstrate effectiveness at million-token context sizes and performance gains.
Significance. If the zero-overhead and perfect-overlap claims hold under realistic hardware conditions, the method would provide a practical, exact-attention route to scaling context length linearly with device count. This is a meaningful advance over both memory-bound standard Transformers and approximation-based long-context techniques, with potential impact on long-form language modeling, video understanding, and sequential RL. The construction is parameter-free and does not introduce new learned components.
major comments (2)
- [Abstract / Experiments] Abstract and Experiments section: the manuscript states that 'extensive experiments ... demonstrate the effectiveness' and 'no extra overhead,' yet provides no quantitative baselines, throughput numbers, memory scaling curves, or comparisons against prior memory-efficient attention implementations (e.g., FlashAttention or standard ring-allreduce attention). Without these data the central 'no additional overheads' claim cannot be assessed.
- [Method] Method (blockwise formulation): the claim that ring KV communication is 'fully overlapping' the blockwise attention and FFN compute is presented as an identity, but no analysis or bounds are given on when local compute time exceeds ring latency for given model dimension, block size, or interconnect bandwidth. If the assumption fails, residual synchronization or idle time would scale with device count and violate the zero-overhead guarantee.
minor comments (1)
- [Notation / Method] Clarify in the notation section how block size is chosen relative to total sequence length and device count, and whether it must be uniform across devices.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which help clarify how to better substantiate our claims. We respond to each major point below and will revise the manuscript to address the identified gaps.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and Experiments section: the manuscript states that 'extensive experiments ... demonstrate the effectiveness' and 'no extra overhead,' yet provides no quantitative baselines, throughput numbers, memory scaling curves, or comparisons against prior memory-efficient attention implementations (e.g., FlashAttention or standard ring-allreduce attention). Without these data the central 'no additional overheads' claim cannot be assessed.
Authors: We agree that the current manuscript does not provide sufficient quantitative evidence to fully support the 'no extra overhead' claim. The experiments primarily demonstrate scaling to million-token contexts and task-level improvements. In the revised version we will expand the Experiments section with: (i) throughput and latency measurements comparing Ring Attention against FlashAttention and standard ring-allreduce attention, (ii) memory-usage scaling curves across device counts and sequence lengths, and (iii) explicit overhead measurements for the ring communication phase. These additions will allow direct empirical assessment of the zero-overhead assertion. revision: yes
-
Referee: [Method] Method (blockwise formulation): the claim that ring KV communication is 'fully overlapping' the blockwise attention and FFN compute is presented as an identity, but no analysis or bounds are given on when local compute time exceeds ring latency for given model dimension, block size, or interconnect bandwidth. If the assumption fails, residual synchronization or idle time would scale with device count and violate the zero-overhead guarantee.
Authors: The blockwise formulation pipelines KV-block communication around the ring concurrently with local attention and FFN computation on the received block. We acknowledge that the manuscript presents this overlap as holding by construction without supplying explicit bounds or analysis on the required compute-to-communication ratio. In the revision we will add a dedicated subsection in the Method section that derives the conditions (in terms of model dimension d, block size b, and interconnect bandwidth) under which communication latency is fully hidden. We will also discuss the scaling implications when the assumption does not hold and quantify potential idle time. revision: yes
Circularity Check
No circularity: algorithmic construction with independent design claims
full rationale
The paper presents Ring Attention as a direct algorithmic construction that splits sequences into blocks, computes attention and FFN blockwise, and pipelines ring communication of KV blocks to overlap with local compute. No equations, predictions, or results are shown to reduce by construction to fitted inputs, self-referential definitions, or unverified self-citations. The central claim of device-count scaling without added overheads follows from the explicit blockwise formulation and overlap assumption rather than any tautological renaming or parameter fitting. The derivation chain is self-contained as a systems-level design choice.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 27 Pith papers
-
Long Context Pre-Training with Lighthouse Attention
Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower los...
-
SCOUT: Active Information Foraging for Long-Text Understanding with Decoupled Epistemic States
SCOUT achieves state-of-the-art long-text understanding with up to 8x lower token use by actively foraging for sparse query-relevant information and updating a compact provenance-grounded epistemic state.
-
Internalized Reasoning for Long-Context Visual Document Understanding
A synthetic pipeline creates and internalizes reasoning traces in VLMs for long-context visual document understanding, with a 32B model surpassing a 235B model on MMLongBenchDoc and showing 12.4x fewer output tokens.
-
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
-
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
-
ShardTensor: Domain Parallelism for Scientific Machine Learning
ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
-
Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing
EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.
-
FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning
FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, y...
-
ZAYA1-8B Technical Report
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
-
The Impossibility Triangle of Long-Context Modeling
No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
-
ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
ZeRO-Prefill achieves 1.35-1.59x higher throughput for MoE prefill serving by replacing per-layer activation AllToAll with overlapped asynchronous weight AllGather and prefix-aware routing.
-
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
-
ChipLight: Cross-Layer Optimization of Chiplet Design with Optical Interconnects for LLM Training
ChipLight is a multi-objective optimization framework that co-designs chiplet hardware, training parallelism, and optical networks to improve efficiency in distributed LLM training clusters.
-
Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding
Q-Gate dynamically routes keyframe selection in long videos via query-modulated gating across visual grounding, global matching, and contextual alignment experts to improve MLLM performance.
-
Hive: A Multi-Agent Infrastructure for Algorithm- and Task-Level Scaling
Hive is a multi-agent infrastructure with a logits cache for reducing cross-path redundancy in sampling and agent-aware scheduling for better compute and KV-cache allocation, shown to deliver 1.11x-1.76x speedups and ...
-
CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism
CoCoDiff achieves 3.6x average and 8.4x peak speedup for distributed DiT inference on up to 96 GPU tiles via tile-aware all-to-all, V-first scheduling, and selective V communication.
-
LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows
LSRM scales transformer context windows with native sparse attention and geometric routing to deliver high-fidelity feed-forward 3D reconstruction and inverse rendering that approaches dense optimization quality.
-
DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators
DeepStack introduces a fast performance model and hierarchical search method for co-optimizing 3D DRAM stacking, interconnects, and distributed scheduling in AI accelerators, delivering up to 9.5x throughput gains ove...
-
GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads
GENSERVE improves SLO attainment by up to 44% for co-serving heterogeneous T2I and T2V diffusion workloads via step-level preemption, elastic parallelism, and joint scheduling.
-
MAGI-1: Autoregressive Video Generation at Scale
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
-
MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading
MemReread improves agent long-context reasoning by triggering rereading on insufficient final memory to recover discarded indirect facts, outperforming baselines at linear complexity.
-
Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP
FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.
-
An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference
Fluxion achieves 1.5x-3.7x speedup in long-context LLM inference with CPU KV caches while limiting accuracy degradation to at most 0.26 relative to full attention.
-
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
-
Movie Gen: A Cast of Media Foundation Models
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
-
Cosmos World Foundation Model Platform for Physical AI
The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
Reference graph
Works this paper leans on
-
[1]
Anthropic. Introducing claude, 2023. URL https://www.anthropic.com/index/ introducing-claude
work page 2023
-
[2]
Parallel computing: Architectures, algorithms, and applications, volume 15
Christian Bischof. Parallel computing: Architectures, algorithms, and applications, volume 15. IOS Press, 2008
work page 2008
-
[3]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020
work page 1901
-
[4]
Decision transformer: Reinforcement learning via sequence modeling
Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021
work page 2021
-
[5]
Training Deep Nets with Sublinear Memory Cost
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016. 10
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[6]
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna.lmsys.org, 2023
work page 2023
-
[7]
Transformations to parallel codes for communication-computation overlap
Anthony Danalis, Ki-Yong Kim, Lori Pollock, and Martin Swany. Transformations to parallel codes for communication-computation overlap. In SC’05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, pages 58–58. IEEE, 2005
work page 2005
-
[8]
Mpi-aware compiler op- timizations for improving communication-computation overlap
Anthony Danalis, Lori Pollock, Martin Swany, and John Cavazos. Mpi-aware compiler op- timizations for improving communication-computation overlap. In Proceedings of the 23rd international conference on Supercomputing, pages 316–325, 2009
work page 2009
-
[9]
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022
work page 2022
-
[10]
Large scale distributed deep networks
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. Large scale distributed deep networks. Advances in neural information processing systems, 25, 2012
work page 2012
-
[11]
Fully Sharded Data Parallel: faster AI training with fewer GPUs — engineer- ing.fb.com
Facebook. Fully Sharded Data Parallel: faster AI training with fewer GPUs — engineer- ing.fb.com. https://engineering.fb.com/2021/07/15/open-source/fsdp/, 2023
work page 2021
-
[12]
Openllama: An open reproduction of llama, may 2023
Xinyang Geng and Hao Liu. Openllama: An open reproduction of llama, may 2023. URL https://github. com/openlm-research/open_llama, 2023
work page 2023
-
[13]
Koala: A dialogue model for academic research
Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song. Koala: A dialogue model for academic research. Blog post, April, 1, 2023
work page 2023
-
[14]
Bringing hpc techniques to deep learning
Andrew Gibiansky. Bringing hpc techniques to deep learning. Baidu Research, Tech. Rep., 2017
work page 2017
-
[15]
Gpipe: Efficient training of giant neural networks using pipeline parallelism
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, Hy- oukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019
work page 2019
-
[16]
Building a fault tolerant mpi application: A ring communication example
Joshua Hursey and Richard L Graham. Building a fault tolerant mpi application: A ring communication example. In 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pages 1549–1556. IEEE, 2011
work page 2011
-
[17]
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023
work page internal anchor Pith review arXiv 2023
-
[18]
Reducing activation recomputation in large transformer models, 2022
Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Moham- mad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models. arXiv preprint arXiv:2205.05198, 2022
-
[19]
Urlb: Unsupervised reinforcement learning benchmark
Michael Laskin, Denis Yarats, Hao Liu, Kimin Lee, Albert Zhan, Kevin Lu, Catherine Cang, Lerrel Pinto, and Pieter Abbeel. Urlb: Unsupervised reinforcement learning benchmark. arXiv preprint arXiv:2110.15191, 2021
-
[20]
Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang
Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph E. Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. How long can open-source llms truly promise on context length?, June 2023. URL https://lmsys.org/blog/2023-06-29-longchat
work page 2023
-
[21]
Sequence paral- lelism: Long sequence training from system perspective
Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. Sequence paral- lelism: Long sequence training from system perspective. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2391–2404, Toronto, Canada, July
-
[22]
Sequence Parallelism: Long Sequence Training from System Perspective , url =
Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.134. URL https://aclanthology.org/2023.acl-long.134. 11
-
[23]
Emergent agentic transformer from chain of hindsight experience
Hao Liu and Pieter Abbeel. Emergent agentic transformer from chain of hindsight experience. International Conference on Machine Learning, 2023
work page 2023
-
[24]
Blockwise parallel transformer for large context models
Hao Liu and Pieter Abbeel. Blockwise parallel transformer for large context models. Advances in neural information processing systems, 2023
work page 2023
-
[25]
Online normalizer calculation for softmax,
Maxim Milakov and Natalia Gimelshein. Online normalizer calculation for softmax. arXiv preprint arXiv:1805.02867, 2018
-
[26]
Introducing mpt-7b: A new standard for open-source, commercially usable llms,
MosaicML. Introducing mpt-7b: A new standard for open-source, commercially usable llms,
-
[27]
URL https://www.mosaicml.com/blog/mpt-7b
-
[28]
Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, Karishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, et al. Do transformer modifi- cations transfer across implementations and applications? arXiv preprint arXiv:2102.11972, 2021
-
[29]
Pipedream: Generalized pipeline parallelism for dnn training
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019
work page 2019
-
[30]
Memory- efficient pipeline-parallel dnn training
Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. Memory- efficient pipeline-parallel dnn training. InInternational Conference on Machine Learning, pages 7937–7947. PMLR, 2021
work page 2021
- [31]
-
[32]
Self-attention does not needo(n 2)memory.arXiv preprint arXiv:2112.05682,
Markus N Rabe and Charles Staats. Self-attention does not need o(n2) memory. arXiv preprint arXiv:2112.05682, 2021
-
[33]
Zero: Memory optimiza- tions toward training trillion parameter models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020
work page 2020
-
[34]
J. Schulman, B. Zoph, C. Kim, J. Hilton, J. Menick, J. Weng, J. F. C. Uribe, L. Fedus, L. Metz, M. Pokorny, R. G. Lopes, S. Zhao, A. Vijayvergiya, E. Sigler, A. Perelman, C. V oss, M. Heaton, J. Parish, D. Cummings, R. Nayak, V . Balcom, D. Schnurr, T. Kaftan, C. Hallacy, N. Turley, N. Deutsch, and V . Goel. Chatgpt: Optimizing language models for dialogu...
-
[35]
URL https://openai.com/blog/chatgpt
-
[36]
Horovod: fast and easy distributed deep learning in tensorflow,
Alexander Sergeev and Mike Del Balso. Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799, 2018
-
[37]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[38]
Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Won Chung, William Fedus, Jinfeng Rao, Sharan Narang, Vinh Q Tran, Dani Yogatama, and Donald Metzler. Scaling laws vs model architectures: How does inductive bias influence scaling? arXiv preprint arXiv:2207.10551, 2022
-
[39]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 12
work page 2017
-
[41]
Overlap communication with dependent computation via decomposition in large deep learning models
Shibo Wang, Jinliang Wei, Amit Sabne, Andy Davis, Berkin Ilbeyi, Blake Hechtman, Dehao Chen, Karthik Srinivasa Murthy, Marcello Maggioni, Qiao Zhang, et al. Overlap communication with dependent computation via decomposition in large deep learning models. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Langua...
work page 2022
-
[42]
Denis Yarats, David Brandfonbrener, Hao Liu, Michael Laskin, Pieter Abbeel, Alessandro Lazaric, and Lerrel Pinto. Don’t change the algorithm, change the data: Exploratory data for offline reinforcement learning. arXiv preprint arXiv:2201.13425, 2022. 13 A Code The implementation of Ring Attention in Jax is provided in Figure 4. We use defvjp function to d...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.