HSAP: A Hierarchical Sequence-aware Parallelism for Hybrid-Context Generative Models

Bingyi Jing; Cong Lin; Jiaxing Zhang; Junyu Lu; Songxin Zhang; Zejian Xie; Zhuoyang Song

arxiv: 2606.30460 · v2 · pith:L46JMFBTnew · submitted 2026-06-29 · 💻 cs.LG · cs.DC

HSAP: A Hierarchical Sequence-aware Parallelism for Hybrid-Context Generative Models

Songxin Zhang , Zejian Xie , Zhuoyang Song , Cong lin , Junyu Lu , Jiaxing Zhang , Bingyi Jing This is my paper

Pith reviewed 2026-07-01 06:55 UTC · model grok-4.3

classification 💻 cs.LG cs.DC

keywords sequence parallelismhybrid-context sequencescausal attentionpacked sequenceslarge language modelsJIT compilationhierarchical parallelismgenerative models

0 comments

The pith

A sequence-aware parallelism algorithm correctly computes causal attention on hybrid-context packed sequences across device groups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to combine strengths of existing sequence parallelism methods while fixing their inability to handle hybrid-context packed sequences without cross-contamination in attention. It introduces a Sequence-Aware Parallelism algorithm that uses JIT compilation to optimize tensor transmission and partial attention across multiple device groups in NCCL. This is then embedded in a Hierarchical Sequence-Aware Parallelism framework with explicit memory and communication management. A sympathetic reader would care because packing sequences is a standard efficiency technique in large model training, and prior parallelism approaches either break correctness or limit scaling when contexts are mixed.

Core claim

The authors propose an efficient Sequence-Aware Parallelism algorithm that addresses intensive tensor transmission and partial attention computation across device groups by leveraging JIT compilation to optimize the communication strategy at the NCCL level, then integrate existing paradigms into a Hierarchical Sequence-Aware Parallelism framework that manages memory and communication overhead to support correct causal attention on hybrid-context sequences.

What carries the argument

The Sequence-Aware Parallelism algorithm with JIT-optimized NCCL communication, which enables correct partial attention computation on hybrid-context sequences across device groups inside the hierarchical framework.

If this is right

Higher degrees of sequence-length parallelism become possible during pretraining and fine-tuning without sacrificing attention correctness.
Packed sequences mixing multiple contexts can be processed in parallel while preserving the causal mask integrity.
Communication volume between device groups is reduced through NCCL-level JIT optimizations.
The hierarchical framework allows combining multiple existing parallelism paradigms with controlled memory overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same communication optimization pattern could be applied to other collective operations beyond attention in distributed training.
The approach might reduce the need for padding in mixed-length datasets, lowering overall memory use during fine-tuning.
If the JIT strategy generalizes, it could support dynamic sequence packing schedules that change per training step.

Load-bearing premise

The sequence-aware algorithm with JIT compilation can correctly compute causal attention on hybrid-context sequences across device groups without introducing errors or prohibitive communication overhead.

What would settle it

Run the proposed parallel implementation and a single-device reference on identical hybrid-context packed input sequences and check whether the attention output tensors match exactly.

Figures

Figures reproduced from arXiv: 2606.30460 by Bingyi Jing, Cong Lin, Jiaxing Zhang, Junyu Lu, Songxin Zhang, Zejian Xie, Zhuoyang Song.

**Figure 2.** Figure 2: SAP’s just-in-time compile-execute architecture. According to the structure of hybrid-context, attention is [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Compilation Algorithms for Computationally Efficient Communication Strategies. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The hierachical network hardware topology. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Evaluation Megatron vs ColAL-SP vs Ulysses [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

In this paper, we aim to combine the advantages of existing sequence parallelism paradigms and overcomes their drawbacks, the most serious of which is the incapability to correctly compute causal attention on the hybrid-context packed sequences, in a stronger sequence parallelism framework. The practical technique of packing sequences for efficiently pretraining and fine-tuning large language models causes cross-contamination problem in attention computation, which can be effectively solved when no parallelism in the sequence length dimension is taken. However, in sequence parallelism, existing approaches either ignore the scenario of hybrid-context sequences or conversely sacrifice and limit parallelism degree for supporting the scenario. To this end, we innovatively propose an efficient Sequence-Aware Parallelism algorithm to conquer the obstacles of intensive tensor transmission and partial attention computation across multiple device groups. Our algorithm utilizes JIT (Just-In-Time) compilation to optimize the communication strategy of all device groups in NCCL level. Further, we integrate existing sequence parallelism paradigms into a Hierarchical Sequence-Aware Parallelism framework which benefits from our sequence-aware algorithm. We additionally elaborate on the memory and communication overhead management of the hierarchical framework to optimize its performance. Through multiple experiments, we demonstrate that our proposed approach outperform other state-of-the-arts sequence parallelism approches in multiple metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HSAP claims to fix causal attention for packed hybrid-context sequences under sequence parallelism via a sequence-aware algorithm and hierarchical framework, but the abstract supplies no mechanism, equations, or verification for how causality is preserved across device groups.

read the letter

The main takeaway is that this paper identifies a practical issue in LLM pretraining—packing sequences of different contexts breaks causal attention under sequence parallelism—and proposes HSAP as a fix through a sequence-aware algorithm plus JIT-optimized communication at the NCCL level, wrapped in a hierarchical integration of existing paradigms.

It does a reasonable job naming the trade-offs in prior work: either ignore hybrid contexts or cap the degree of parallelism. The focus on managing memory and communication overhead in the hierarchical setup shows some attention to real deployment constraints.

The soft spots are the lack of any concrete logic. No equations, pseudocode, or mask-construction details explain how partial attention is computed without cross-contamination or causality violations when sequences span device groups. The abstract asserts that the sequence-aware approach “conquers the obstacles,” but nothing shows the actual computation or verification steps. Experiments are mentioned as outperforming state-of-the-art methods on multiple metrics, yet no setup, baselines, or error analysis appears, so those claims cannot be assessed.

This is aimed at engineers and researchers working on distributed training systems for large models. A reader already deep in sequence parallelism might pick up the high-level framing as a prompt for their own thinking, but the missing technical substance limits how far the ideas can travel.

It deserves peer review so referees can check whether the full manuscript supplies the missing verification and reproducible results; the underlying problem is real enough to warrant that step.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes HSAP, a Hierarchical Sequence-aware Parallelism framework for hybrid-context generative models. It introduces a Sequence-Aware Parallelism algorithm that uses JIT compilation to optimize communication in NCCL for correctly computing causal attention on packed hybrid-context sequences across device groups, addressing cross-contamination issues in sequence parallelism. The framework integrates existing paradigms hierarchically, manages memory and communication overheads, and claims superior performance over state-of-the-art approaches in multiple experiments and metrics.

Significance. If the correctness of causal attention computation is verified and the performance gains hold, this work could significantly advance sequence parallelism techniques for efficient pretraining and fine-tuning of large language models by allowing higher parallelism degrees without sacrificing correctness on packed sequences. The use of JIT at NCCL level for sequence-aware strategies is a potentially novel optimization if demonstrated.

major comments (2)

Abstract: the claim that the Sequence-Aware Parallelism algorithm 'conquers the obstacles of ... partial attention computation across multiple device groups' is unsupported by any equations, pseudocode, mask construction, or partial-attention formula demonstrating how causality is preserved when attention spans device groups on hybrid-context sequences.
Abstract: the assertion that 'Through multiple experiments, we demonstrate that our proposed approach outperform other state-of-the-arts sequence parallelism approches in multiple metrics' supplies no experimental setup, baselines, datasets, quantitative results, or error analysis, so the central outperformance claim cannot be assessed.

minor comments (2)

Abstract: grammatical issues include 'outperform' (should be 'outperforms'), 'approches' (should be 'approaches'), and 'state-of-the-arts' (should be 'state-of-the-art').
Abstract: the description of the hierarchical integration and overhead management is high-level and would benefit from additional concrete details on implementation even if not load-bearing for the main claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. We address each major comment below. The abstract is intentionally concise, but we agree it can be strengthened with explicit references to the supporting material in the body of the paper.

read point-by-point responses

Referee: Abstract: the claim that the Sequence-Aware Parallelism algorithm 'conquers the obstacles of ... partial attention computation across multiple device groups' is unsupported by any equations, pseudocode, mask construction, or partial-attention formula demonstrating how causality is preserved when attention spans device groups on hybrid-context sequences.

Authors: The abstract summarizes the contribution at a high level. The full description of the Sequence-Aware Parallelism algorithm, including the equations for partial attention, the mask construction for preserving causality across device groups, the pseudocode, and the JIT-optimized NCCL communication strategy, appears in Section 3. We will revise the abstract to include a brief parenthetical reference to Section 3 so that the claim is explicitly tied to the supporting technical material. revision: yes
Referee: Abstract: the assertion that 'Through multiple experiments, we demonstrate that our proposed approach outperform other state-of-the-arts sequence parallelism approches in multiple metrics' supplies no experimental setup, baselines, datasets, quantitative results, or error analysis, so the central outperformance claim cannot be assessed.

Authors: The abstract again serves as a summary. Complete experimental details—setup, baselines (including the compared sequence-parallelism methods), datasets, quantitative tables with metrics and speedups, and error analysis—are presented in Section 5. We will revise the abstract to incorporate one or two key quantitative results (e.g., relative throughput or memory savings) together with a reference to Section 5, making the outperformance claim directly assessable from the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic proposal with no derivation chain or fitted quantities

full rationale

The paper presents HSAP as a new algorithmic framework for sequence parallelism, describing its design, JIT-based communication optimization, hierarchical integration, and experimental outperformance. No equations, parameters fitted to data subsets, or self-citation chains appear in the provided text that reduce any claimed result to its own inputs by construction. The central claims concern the correctness and efficiency of the proposed method rather than a mathematical derivation, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, derivations, or implementation details from which to extract free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5766 in / 1073 out tokens · 39074 ms · 2026-07-01T06:55:42.870371+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 20 canonical work pages · 11 internal anchors

[2]

Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, PeterJ. , year=. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , journal=
[3]

Online normalizer calculation for softmax

Milakov, Maxim and Gimelshein, Natalia , year=. Online normalizer calculation for softmax. , journal=
[4]

2023 , month=

Structured Packing in LLM Training Improves Long Context Utilization , author=. 2023 , month=

2023
[5]

LLaMA: Open and Efficient Foundation Language Models

LLaMA: Open and Efficient Foundation Language Models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

and Stoica, Ion and Xing, Eric P

Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =
[7]

Proceedings of the 34th International Conference on Neural Information Processing Systems , pages=

Language models are few-shot learners , author=. Proceedings of the 34th International Conference on Neural Information Processing Systems , pages=
[8]

Proceedings of the 31st International Conference on Neural Information Processing Systems , pages=

Attention is all you need , author=. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages=
[10]

NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following , year=

How Long Can Context Length of Open-Source LLMs truly Promise? , author=. NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following , year=

2023
[11]

2023 , url =

MosaicML NLP Team , title =. 2023 , url =

2023
[13]

YaRN: Efficient Context Window Extension of Large Language Models

Yarn: Efficient context window extension of large language models , author=. arXiv preprint arXiv:2309.00071 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Scaling vision transformers to gigapixel images via hierarchical self-supervised learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[16]

Advances in Neural Information Processing Systems , volume=

Combiner: Full attention transformer with sparse computation cost , author=. Advances in Neural Information Processing Systems , volume=
[17]

Transactions of the Association for Computational Linguistics , volume=

Efficient Content-Based Sparse Attention with Routing Transformers , author=. Transactions of the Association for Computational Linguistics , volume=
[19]

NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following , year=

Ring Attention with Blockwise Transformers for Near-Infinite Context , author=. NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following , year=

2023
[21]

Proceedings of Machine Learning and Systems , volume=

Reducing activation recomputation in large transformer models , author=. Proceedings of Machine Learning and Systems , volume=
[23]

Proceedings of the 52nd International Conference on Parallel Processing , pages=

Colossal-ai: A unified deep learning system for large-scale parallel training , author=. Proceedings of the 52nd International Conference on Parallel Processing , pages=
[24]

Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ NeurIPS 2023) , year=

LightSeq:: Sequence Level Parallelism for Distributed Training of Long Context Transformers , author=. Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ NeurIPS 2023) , year=

2023
[25]

arXiv preprint arXiv:2311.02382 , year=

Ultra-Long Sequence Distributed Transformer , author=. arXiv preprint arXiv:2311.02382 , year=

work page arXiv
[26]

International Conference on Learning Representations , year=

Reformer: The Efficient Transformer , author=. International Conference on Learning Representations , year=
[27]

Linformer: Self-Attention with Linear Complexity

Linformer: Self-attention with linear complexity , author=. arXiv preprint arXiv:2006.04768 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2006
[28]

Advances in Neural Information Processing Systems , volume=

Luna: Linear Unified Nested Attention , author=. Advances in Neural Information Processing Systems , volume=
[29]

Advances in Neural Information Processing Systems , volume=

Flashattention: Fast and memory-efficient exact attention with io-awareness , author=. Advances in Neural Information Processing Systems , volume=
[30]

The Twelfth International Conference on Learning Representations , year=

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , author=. The Twelfth International Conference on Learning Representations , year=
[31]

and Ermon, Stefano and Rudra, Atri and Re, Christopher , year =

Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and Re, Christopher , year =. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , DOI =
[32]

and Kavukcuoglu, Koray and Kohli, Pushmeet and Hassabis, Demis , year =

Jumper, John and Evans, Richard and Pritzel, Alexander and Green, Tim and Figurnov, Michael and Ronneberger, Olaf and Tunyasuvunakool, Kathryn and Bates, Russ and Zidek, Augustin and Potapenko, Anna..nyals, Oriol and Senior, Andrew W. and Kavukcuoglu, Koray and Kohli, Pushmeet and Hassabis, Demis , year =. Highly accurate protein structure prediction with...

work page doi:10.1038/s41586-021-03819-2
[33]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces , DOI =

Gu, Albert and Dao, Tri , year =. Mamba: Linear-Time Sequence Modeling with Selective State Spaces , DOI =
[34]

Zhang, Zhenyuan and Zhao, Qihang and Zhou, Peng and Zhou, Qinghua and Zhu, Jian and Zhu, Rui-Jie , year =

Peng, Bo and Alcaide, Eric and Anthony, Quentin and Albalak, Alon and Arcadinho, Samuel and Biderman, Stella and Cao, Huanqi and Cheng, Xin and Chung, Michael and Grella, Matteo and GV, Kranthi Kira.. Zhang, Zhenyuan and Zhao, Qihang and Zhou, Peng and Zhou, Qinghua and Zhu, Jian and Zhu, Rui-Jie , year =. RWKV: Reinventing RNNs for the Transformer Era , DOI =
[35]

and Salakhutdinov, Ruslan , year =

Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc V. and Salakhutdinov, Ruslan , year =. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context , DOI =
[36]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , DOI =

Dao, Tri , year =. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , DOI =
[37]

and Zettlemoyer, Luke and Yih, Scott and Lewis, Mike , year =

Shi, Weijia and Min, Sewon and Lomeli, Maria and Zhou, Chunting and Li, Margaret and James, Rich and Lin, Xi Victoria and Smith, Noah A. and Zettlemoyer, Luke and Yih, Scott and Lewis, Mike , year =. In-Context Pretraining: Language Modeling Beyond Document Boundaries , DOI =
[38]

Evolutionary-scale prediction of atomic-level protein structure with a language model , volume =

Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Smetanin, Nikita and Verkuil, Robert and Kabeli, Ori and Shmueli, Yaniv and dos Santos Costa, Allan and Fazel-Zarandi, Maryam and Sercu, Tom and Candido, Salvatore and Rives, Alexander , year =. Evolutionary-scale prediction of atomic-level protein structure w...

work page doi:10.1126/science.ade2574
[39]

Block-State Transformers , repository =

Fathi, Mahan and Pilault, Jonathan and Firat, Orhan and Pal, Christopher and Bacon, Pierre-Luc and Goroshin, Ross , year =. Block-State Transformers , repository =
[40]

01-ai/Yi: A series of large language models trained from scratch by developers @01-ai , URL =

01-ai, , year =. 01-ai/Yi: A series of large language models trained from scratch by developers @01-ai , URL =
[41]

, year =

Taori, Rohan and Gulrajani, Ishaan and Zhang, Tianyi and Dubois, Yann and Li, Xuechen and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B. , year =. Stanford Alpaca: An Instruction-following LLaMA model , publisher =
[42]

and Gonzalez, Joseph E

Li, Dacheng and Shao, Rulin and Xie, Anze and Xing, Eric P. and Gonzalez, Joseph E. and Stoica, Ion and Ma, Xuezhe and Zhang, Hao , year =. LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers , repository =
[43]

and Fitzgibbon, Andrew , year =

Krell, Mario Michael and Kosec, Matej and Perez, Sergio P. and Fitzgibbon, Andrew , year =. Efficient Sequence Packing without Cross-contamination: Accelerating Large Language Models without Impacting Performance , DOI =
[44]

De Vries, Harm , title =
[45]

2024 , eprint=

World Model on Million-Length Video And Language With Blockwise RingAttention , author=. 2024 , eprint=

2024
[46]

2024 , url=

Video generation models as world simulators , author=. 2024 , url=

2024
[47]

2023 , month =

GPT-4 Technical Report , DOI =. 2023 , month =

2023
[48]

2023 , month =

Gemini: A Family of Highly Capable Multimodal Models , DOI =. 2023 , month =

2023
[50]

Together Computer , title =
[51]

2023 , eprint=

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations , author=. 2023 , eprint=

2023
[52]

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. 2024. https://openai.com/research/video-generation-models-as-world-simulators Video generation models as world simulators

2024
[53]

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pages 1877--1901

2020
[54]

Together Computer. 2023. https://github.com/togethercomputer/RedPajama-Data Redpajama: an open dataset for training large language models

2023
[55]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344--16359

2022
[56]

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. 2023. https://arxiv.org/abs/2305.14233 Enhancing chat language models by scaling high-quality instructional conversations . Preprint, arXiv:2305.14233

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team , Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Millican..na, Tim Green, Demis Hassabis, Koray Kavukcuoglu, Jeffrey Dean, and Oriol Vinyals. 2023. https://doi.org/10.48550/arXiv.2312.11805 Gemini: A family of highly capable multimodal models . ArXiv:2312....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805 2023
[58]

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Leon Song, Samyam Rajbhandari, and Yuxiong He. 2023. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. 2023. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5

2023
[60]

Perez, and Andrew Fitzgibbon

Mario Michael Krell, Matej Kosec, Sergio P. Perez, and Andrew Fitzgibbon. 2022. https://doi.org/10.48550/arXiv.2107.02027 Efficient sequence packing without cross-contamination: Accelerating large language models without impacting performance . ArXiv:2107.02027 [cs, math]

work page doi:10.48550/arxiv.2107.02027 2022
[61]

Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. 2023 a . How long can context length of open-source llms truly promise? In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following

2023
[62]

Dacheng Li, Rulin Shao, Anze Xie, Eric P Xing, Joseph E Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. 2023 b . Lightseq:: Sequence level parallelism for distributed training of long context transformers. In Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ NeurIPS 2023)

2023
[63]

Shenggui Li, Hongxin Liu, Zhengda Bian, Jiarui Fang, Haichen Huang, Yuliang Liu, Boxiang Wang, and Yang You. 2023 c . Colossal-ai: A unified deep learning system for large-scale parallel training. In Proceedings of the 52nd International Conference on Parallel Processing, pages 766--775

2023
[64]

Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. 2023 d . https://doi.org/10.18653/v1/2023.acl-long.134 Sequence parallelism: Long sequence training from system perspective . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2391--2404, Toronto, Canada. Associati...

work page doi:10.18653/v1/2023.acl-long.134 2023
[65]

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. 2024. https://arxiv.org/abs/2402.08268 World model on million-length video and language with blockwise ringattention . Preprint, arXiv:2402.08268

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023 a . Ring attention with blockwise transformers for near-infinite context. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following

2023
[67]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023 b . Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. 2024. https://doi.org/10.57967/hf/2497 Fineweb-edu

work page doi:10.57967/hf/2497 2024
[69]

Maxim Milakov and Natalia Gimelshein. 2018. Online normalizer calculation for softmax. arXiv: Performance,arXiv: Performance

2018
[70]

OpenAI , Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, ..rvin Anadkat, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. 2023. https://doi.org/10.48550/arXiv.2303.08774 Gpt-4 technical report . ArXiv:2303.08774 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
[71]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and PeterJ. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv: Learning,arXiv: Learning

2019
[72]

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. https://doi.org/10.1145/3394486.3406703 Deepspeed

work page doi:10.1145/3394486.3406703 2020
[73]

Smith, Luke Zettlemoyer, Scott Yih, and Mike Lewis

Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Rich James, Xi Victoria Lin, Noah A. Smith, Luke Zettlemoyer, Scott Yih, and Mike Lewis. 2023. https://doi.org/10.48550/arXiv.2310.10638 In-context pretraining: Language modeling beyond document boundaries . ArXiv:2310.10638 [cs]

work page doi:10.48550/arxiv.2310.10638 2023
[74]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053

work page internal anchor Pith review Pith/arXiv arXiv 2019
[75]

Konrad Staniszewski, Szymon Tworkowski, Sebastian Jaszczur, Henryk Michalewski, -L ukasz Kuci’nski, and Piotr Mi l o’s. 2023. Structured packing in llm training improves long context utilization

2023
[76]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6000--6010

2017
[77]

Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, et al. 2023. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039

work page arXiv 2023

[1] [2]

Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, PeterJ. , year=. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , journal=

[2] [3]

Online normalizer calculation for softmax

Milakov, Maxim and Gimelshein, Natalia , year=. Online normalizer calculation for softmax. , journal=

[3] [4]

2023 , month=

Structured Packing in LLM Training Improves Long Context Utilization , author=. 2023 , month=

2023

[4] [5]

LLaMA: Open and Efficient Foundation Language Models

LLaMA: Open and Efficient Foundation Language Models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [6]

and Stoica, Ion and Xing, Eric P

Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =

[6] [7]

Proceedings of the 34th International Conference on Neural Information Processing Systems , pages=

Language models are few-shot learners , author=. Proceedings of the 34th International Conference on Neural Information Processing Systems , pages=

[7] [8]

Proceedings of the 31st International Conference on Neural Information Processing Systems , pages=

Attention is all you need , author=. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages=

[8] [10]

NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following , year=

How Long Can Context Length of Open-Source LLMs truly Promise? , author=. NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following , year=

2023

[9] [11]

2023 , url =

MosaicML NLP Team , title =. 2023 , url =

2023

[10] [13]

YaRN: Efficient Context Window Extension of Large Language Models

Yarn: Efficient context window extension of large language models , author=. arXiv preprint arXiv:2309.00071 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [14]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [15]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Scaling vision transformers to gigapixel images via hierarchical self-supervised learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[13] [16]

Advances in Neural Information Processing Systems , volume=

Combiner: Full attention transformer with sparse computation cost , author=. Advances in Neural Information Processing Systems , volume=

[14] [17]

Transactions of the Association for Computational Linguistics , volume=

Efficient Content-Based Sparse Attention with Routing Transformers , author=. Transactions of the Association for Computational Linguistics , volume=

[15] [19]

NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following , year=

Ring Attention with Blockwise Transformers for Near-Infinite Context , author=. NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following , year=

2023

[16] [21]

Proceedings of Machine Learning and Systems , volume=

Reducing activation recomputation in large transformer models , author=. Proceedings of Machine Learning and Systems , volume=

[17] [23]

Proceedings of the 52nd International Conference on Parallel Processing , pages=

Colossal-ai: A unified deep learning system for large-scale parallel training , author=. Proceedings of the 52nd International Conference on Parallel Processing , pages=

[18] [24]

Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ NeurIPS 2023) , year=

LightSeq:: Sequence Level Parallelism for Distributed Training of Long Context Transformers , author=. Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ NeurIPS 2023) , year=

2023

[19] [25]

arXiv preprint arXiv:2311.02382 , year=

Ultra-Long Sequence Distributed Transformer , author=. arXiv preprint arXiv:2311.02382 , year=

work page arXiv

[20] [26]

International Conference on Learning Representations , year=

Reformer: The Efficient Transformer , author=. International Conference on Learning Representations , year=

[21] [27]

Linformer: Self-Attention with Linear Complexity

Linformer: Self-attention with linear complexity , author=. arXiv preprint arXiv:2006.04768 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2006

[22] [28]

Advances in Neural Information Processing Systems , volume=

Luna: Linear Unified Nested Attention , author=. Advances in Neural Information Processing Systems , volume=

[23] [29]

Advances in Neural Information Processing Systems , volume=

Flashattention: Fast and memory-efficient exact attention with io-awareness , author=. Advances in Neural Information Processing Systems , volume=

[24] [30]

The Twelfth International Conference on Learning Representations , year=

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , author=. The Twelfth International Conference on Learning Representations , year=

[25] [31]

and Ermon, Stefano and Rudra, Atri and Re, Christopher , year =

Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and Re, Christopher , year =. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , DOI =

[26] [32]

and Kavukcuoglu, Koray and Kohli, Pushmeet and Hassabis, Demis , year =

Jumper, John and Evans, Richard and Pritzel, Alexander and Green, Tim and Figurnov, Michael and Ronneberger, Olaf and Tunyasuvunakool, Kathryn and Bates, Russ and Zidek, Augustin and Potapenko, Anna..nyals, Oriol and Senior, Andrew W. and Kavukcuoglu, Koray and Kohli, Pushmeet and Hassabis, Demis , year =. Highly accurate protein structure prediction with...

work page doi:10.1038/s41586-021-03819-2

[27] [33]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces , DOI =

Gu, Albert and Dao, Tri , year =. Mamba: Linear-Time Sequence Modeling with Selective State Spaces , DOI =

[28] [34]

Zhang, Zhenyuan and Zhao, Qihang and Zhou, Peng and Zhou, Qinghua and Zhu, Jian and Zhu, Rui-Jie , year =

Peng, Bo and Alcaide, Eric and Anthony, Quentin and Albalak, Alon and Arcadinho, Samuel and Biderman, Stella and Cao, Huanqi and Cheng, Xin and Chung, Michael and Grella, Matteo and GV, Kranthi Kira.. Zhang, Zhenyuan and Zhao, Qihang and Zhou, Peng and Zhou, Qinghua and Zhu, Jian and Zhu, Rui-Jie , year =. RWKV: Reinventing RNNs for the Transformer Era , DOI =

[29] [35]

and Salakhutdinov, Ruslan , year =

Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc V. and Salakhutdinov, Ruslan , year =. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context , DOI =

[30] [36]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , DOI =

Dao, Tri , year =. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , DOI =

[31] [37]

and Zettlemoyer, Luke and Yih, Scott and Lewis, Mike , year =

Shi, Weijia and Min, Sewon and Lomeli, Maria and Zhou, Chunting and Li, Margaret and James, Rich and Lin, Xi Victoria and Smith, Noah A. and Zettlemoyer, Luke and Yih, Scott and Lewis, Mike , year =. In-Context Pretraining: Language Modeling Beyond Document Boundaries , DOI =

[32] [38]

Evolutionary-scale prediction of atomic-level protein structure with a language model , volume =

Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Smetanin, Nikita and Verkuil, Robert and Kabeli, Ori and Shmueli, Yaniv and dos Santos Costa, Allan and Fazel-Zarandi, Maryam and Sercu, Tom and Candido, Salvatore and Rives, Alexander , year =. Evolutionary-scale prediction of atomic-level protein structure w...

work page doi:10.1126/science.ade2574

[33] [39]

Block-State Transformers , repository =

Fathi, Mahan and Pilault, Jonathan and Firat, Orhan and Pal, Christopher and Bacon, Pierre-Luc and Goroshin, Ross , year =. Block-State Transformers , repository =

[34] [40]

01-ai/Yi: A series of large language models trained from scratch by developers @01-ai , URL =

01-ai, , year =. 01-ai/Yi: A series of large language models trained from scratch by developers @01-ai , URL =

[35] [41]

, year =

Taori, Rohan and Gulrajani, Ishaan and Zhang, Tianyi and Dubois, Yann and Li, Xuechen and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B. , year =. Stanford Alpaca: An Instruction-following LLaMA model , publisher =

[36] [42]

and Gonzalez, Joseph E

Li, Dacheng and Shao, Rulin and Xie, Anze and Xing, Eric P. and Gonzalez, Joseph E. and Stoica, Ion and Ma, Xuezhe and Zhang, Hao , year =. LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers , repository =

[37] [43]

and Fitzgibbon, Andrew , year =

Krell, Mario Michael and Kosec, Matej and Perez, Sergio P. and Fitzgibbon, Andrew , year =. Efficient Sequence Packing without Cross-contamination: Accelerating Large Language Models without Impacting Performance , DOI =

[38] [44]

De Vries, Harm , title =

[39] [45]

2024 , eprint=

World Model on Million-Length Video And Language With Blockwise RingAttention , author=. 2024 , eprint=

2024

[40] [46]

2024 , url=

Video generation models as world simulators , author=. 2024 , url=

2024

[41] [47]

2023 , month =

GPT-4 Technical Report , DOI =. 2023 , month =

2023

[42] [48]

2023 , month =

Gemini: A Family of Highly Capable Multimodal Models , DOI =. 2023 , month =

2023

[43] [50]

Together Computer , title =

[44] [51]

2023 , eprint=

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations , author=. 2023 , eprint=

2023

[45] [52]

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. 2024. https://openai.com/research/video-generation-models-as-world-simulators Video generation models as world simulators

2024

[46] [53]

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pages 1877--1901

2020

[47] [54]

Together Computer. 2023. https://github.com/togethercomputer/RedPajama-Data Redpajama: an open dataset for training large language models

2023

[48] [55]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344--16359

2022

[49] [56]

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. 2023. https://arxiv.org/abs/2305.14233 Enhancing chat language models by scaling high-quality instructional conversations . Preprint, arXiv:2305.14233

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [57]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team , Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Millican..na, Tim Green, Demis Hassabis, Koray Kavukcuoglu, Jeffrey Dean, and Oriol Vinyals. 2023. https://doi.org/10.48550/arXiv.2312.11805 Gemini: A family of highly capable multimodal models . ArXiv:2312....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805 2023

[51] [58]

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Leon Song, Samyam Rajbhandari, and Yuxiong He. 2023. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [59]

Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. 2023. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5

2023

[53] [60]

Perez, and Andrew Fitzgibbon

Mario Michael Krell, Matej Kosec, Sergio P. Perez, and Andrew Fitzgibbon. 2022. https://doi.org/10.48550/arXiv.2107.02027 Efficient sequence packing without cross-contamination: Accelerating large language models without impacting performance . ArXiv:2107.02027 [cs, math]

work page doi:10.48550/arxiv.2107.02027 2022

[54] [61]

Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. 2023 a . How long can context length of open-source llms truly promise? In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following

2023

[55] [62]

Dacheng Li, Rulin Shao, Anze Xie, Eric P Xing, Joseph E Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. 2023 b . Lightseq:: Sequence level parallelism for distributed training of long context transformers. In Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ NeurIPS 2023)

2023

[56] [63]

Shenggui Li, Hongxin Liu, Zhengda Bian, Jiarui Fang, Haichen Huang, Yuliang Liu, Boxiang Wang, and Yang You. 2023 c . Colossal-ai: A unified deep learning system for large-scale parallel training. In Proceedings of the 52nd International Conference on Parallel Processing, pages 766--775

2023

[57] [64]

Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. 2023 d . https://doi.org/10.18653/v1/2023.acl-long.134 Sequence parallelism: Long sequence training from system perspective . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2391--2404, Toronto, Canada. Associati...

work page doi:10.18653/v1/2023.acl-long.134 2023

[58] [65]

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. 2024. https://arxiv.org/abs/2402.08268 World model on million-length video and language with blockwise ringattention . Preprint, arXiv:2402.08268

work page internal anchor Pith review Pith/arXiv arXiv 2024

[59] [66]

Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023 a . Ring attention with blockwise transformers for near-infinite context. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following

2023

[60] [67]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023 b . Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172

work page internal anchor Pith review Pith/arXiv arXiv 2023

[61] [68]

Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. 2024. https://doi.org/10.57967/hf/2497 Fineweb-edu

work page doi:10.57967/hf/2497 2024

[62] [69]

Maxim Milakov and Natalia Gimelshein. 2018. Online normalizer calculation for softmax. arXiv: Performance,arXiv: Performance

2018

[63] [70]

OpenAI , Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, ..rvin Anadkat, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. 2023. https://doi.org/10.48550/arXiv.2303.08774 Gpt-4 technical report . ArXiv:2303.08774 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023

[64] [71]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and PeterJ. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv: Learning,arXiv: Learning

2019

[65] [72]

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. https://doi.org/10.1145/3394486.3406703 Deepspeed

work page doi:10.1145/3394486.3406703 2020

[66] [73]

Smith, Luke Zettlemoyer, Scott Yih, and Mike Lewis

Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Rich James, Xi Victoria Lin, Noah A. Smith, Luke Zettlemoyer, Scott Yih, and Mike Lewis. 2023. https://doi.org/10.48550/arXiv.2310.10638 In-context pretraining: Language modeling beyond document boundaries . ArXiv:2310.10638 [cs]

work page doi:10.48550/arxiv.2310.10638 2023

[67] [74]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053

work page internal anchor Pith review Pith/arXiv arXiv 2019

[68] [75]

Konrad Staniszewski, Szymon Tworkowski, Sebastian Jaszczur, Henryk Michalewski, -L ukasz Kuci’nski, and Piotr Mi l o’s. 2023. Structured packing in llm training improves long context utilization

2023

[69] [76]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6000--6010

2017

[70] [77]

Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, et al. 2023. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039

work page arXiv 2023