HSAP: A Hierarchical Sequence-aware Parallelism for Hybrid-Context Generative Models
Pith reviewed 2026-07-01 06:55 UTC · model grok-4.3
The pith
A sequence-aware parallelism algorithm correctly computes causal attention on hybrid-context packed sequences across device groups.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors propose an efficient Sequence-Aware Parallelism algorithm that addresses intensive tensor transmission and partial attention computation across device groups by leveraging JIT compilation to optimize the communication strategy at the NCCL level, then integrate existing paradigms into a Hierarchical Sequence-Aware Parallelism framework that manages memory and communication overhead to support correct causal attention on hybrid-context sequences.
What carries the argument
The Sequence-Aware Parallelism algorithm with JIT-optimized NCCL communication, which enables correct partial attention computation on hybrid-context sequences across device groups inside the hierarchical framework.
If this is right
- Higher degrees of sequence-length parallelism become possible during pretraining and fine-tuning without sacrificing attention correctness.
- Packed sequences mixing multiple contexts can be processed in parallel while preserving the causal mask integrity.
- Communication volume between device groups is reduced through NCCL-level JIT optimizations.
- The hierarchical framework allows combining multiple existing parallelism paradigms with controlled memory overhead.
Where Pith is reading between the lines
- The same communication optimization pattern could be applied to other collective operations beyond attention in distributed training.
- The approach might reduce the need for padding in mixed-length datasets, lowering overall memory use during fine-tuning.
- If the JIT strategy generalizes, it could support dynamic sequence packing schedules that change per training step.
Load-bearing premise
The sequence-aware algorithm with JIT compilation can correctly compute causal attention on hybrid-context sequences across device groups without introducing errors or prohibitive communication overhead.
What would settle it
Run the proposed parallel implementation and a single-device reference on identical hybrid-context packed input sequences and check whether the attention output tensors match exactly.
Figures
read the original abstract
In this paper, we aim to combine the advantages of existing sequence parallelism paradigms and overcomes their drawbacks, the most serious of which is the incapability to correctly compute causal attention on the hybrid-context packed sequences, in a stronger sequence parallelism framework. The practical technique of packing sequences for efficiently pretraining and fine-tuning large language models causes cross-contamination problem in attention computation, which can be effectively solved when no parallelism in the sequence length dimension is taken. However, in sequence parallelism, existing approaches either ignore the scenario of hybrid-context sequences or conversely sacrifice and limit parallelism degree for supporting the scenario. To this end, we innovatively propose an efficient Sequence-Aware Parallelism algorithm to conquer the obstacles of intensive tensor transmission and partial attention computation across multiple device groups. Our algorithm utilizes JIT (Just-In-Time) compilation to optimize the communication strategy of all device groups in NCCL level. Further, we integrate existing sequence parallelism paradigms into a Hierarchical Sequence-Aware Parallelism framework which benefits from our sequence-aware algorithm. We additionally elaborate on the memory and communication overhead management of the hierarchical framework to optimize its performance. Through multiple experiments, we demonstrate that our proposed approach outperform other state-of-the-arts sequence parallelism approches in multiple metrics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes HSAP, a Hierarchical Sequence-aware Parallelism framework for hybrid-context generative models. It introduces a Sequence-Aware Parallelism algorithm that uses JIT compilation to optimize communication in NCCL for correctly computing causal attention on packed hybrid-context sequences across device groups, addressing cross-contamination issues in sequence parallelism. The framework integrates existing paradigms hierarchically, manages memory and communication overheads, and claims superior performance over state-of-the-art approaches in multiple experiments and metrics.
Significance. If the correctness of causal attention computation is verified and the performance gains hold, this work could significantly advance sequence parallelism techniques for efficient pretraining and fine-tuning of large language models by allowing higher parallelism degrees without sacrificing correctness on packed sequences. The use of JIT at NCCL level for sequence-aware strategies is a potentially novel optimization if demonstrated.
major comments (2)
- Abstract: the claim that the Sequence-Aware Parallelism algorithm 'conquers the obstacles of ... partial attention computation across multiple device groups' is unsupported by any equations, pseudocode, mask construction, or partial-attention formula demonstrating how causality is preserved when attention spans device groups on hybrid-context sequences.
- Abstract: the assertion that 'Through multiple experiments, we demonstrate that our proposed approach outperform other state-of-the-arts sequence parallelism approches in multiple metrics' supplies no experimental setup, baselines, datasets, quantitative results, or error analysis, so the central outperformance claim cannot be assessed.
minor comments (2)
- Abstract: grammatical issues include 'outperform' (should be 'outperforms'), 'approches' (should be 'approaches'), and 'state-of-the-arts' (should be 'state-of-the-art').
- Abstract: the description of the hierarchical integration and overhead management is high-level and would benefit from additional concrete details on implementation even if not load-bearing for the main claim.
Simulated Author's Rebuttal
We thank the referee for the detailed feedback. We address each major comment below. The abstract is intentionally concise, but we agree it can be strengthened with explicit references to the supporting material in the body of the paper.
read point-by-point responses
-
Referee: Abstract: the claim that the Sequence-Aware Parallelism algorithm 'conquers the obstacles of ... partial attention computation across multiple device groups' is unsupported by any equations, pseudocode, mask construction, or partial-attention formula demonstrating how causality is preserved when attention spans device groups on hybrid-context sequences.
Authors: The abstract summarizes the contribution at a high level. The full description of the Sequence-Aware Parallelism algorithm, including the equations for partial attention, the mask construction for preserving causality across device groups, the pseudocode, and the JIT-optimized NCCL communication strategy, appears in Section 3. We will revise the abstract to include a brief parenthetical reference to Section 3 so that the claim is explicitly tied to the supporting technical material. revision: yes
-
Referee: Abstract: the assertion that 'Through multiple experiments, we demonstrate that our proposed approach outperform other state-of-the-arts sequence parallelism approches in multiple metrics' supplies no experimental setup, baselines, datasets, quantitative results, or error analysis, so the central outperformance claim cannot be assessed.
Authors: The abstract again serves as a summary. Complete experimental details—setup, baselines (including the compared sequence-parallelism methods), datasets, quantitative tables with metrics and speedups, and error analysis—are presented in Section 5. We will revise the abstract to incorporate one or two key quantitative results (e.g., relative throughput or memory savings) together with a reference to Section 5, making the outperformance claim directly assessable from the abstract. revision: yes
Circularity Check
No circularity: algorithmic proposal with no derivation chain or fitted quantities
full rationale
The paper presents HSAP as a new algorithmic framework for sequence parallelism, describing its design, JIT-based communication optimization, hierarchical integration, and experimental outperformance. No equations, parameters fitted to data subsets, or self-citation chains appear in the provided text that reduce any claimed result to its own inputs by construction. The central claims concern the correctness and efficiency of the proposed method rather than a mathematical derivation, making the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[2]
Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, PeterJ. , year=. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , journal=
-
[3]
Online normalizer calculation for softmax
Milakov, Maxim and Gimelshein, Natalia , year=. Online normalizer calculation for softmax. , journal=
-
[4]
2023 , month=
Structured Packing in LLM Training Improves Long Context Utilization , author=. 2023 , month=
2023
-
[5]
LLaMA: Open and Efficient Foundation Language Models
LLaMA: Open and Efficient Foundation Language Models , author=. arXiv preprint arXiv:2302.13971 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
and Stoica, Ion and Xing, Eric P
Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =
-
[7]
Proceedings of the 34th International Conference on Neural Information Processing Systems , pages=
Language models are few-shot learners , author=. Proceedings of the 34th International Conference on Neural Information Processing Systems , pages=
-
[8]
Proceedings of the 31st International Conference on Neural Information Processing Systems , pages=
Attention is all you need , author=. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages=
-
[10]
NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following , year=
How Long Can Context Length of Open-Source LLMs truly Promise? , author=. NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following , year=
2023
-
[11]
2023 , url =
MosaicML NLP Team , title =. 2023 , url =
2023
-
[13]
YaRN: Efficient Context Window Extension of Large Language Models
Yarn: Efficient context window extension of large language models , author=. arXiv preprint arXiv:2309.00071 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Scaling vision transformers to gigapixel images via hierarchical self-supervised learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[16]
Advances in Neural Information Processing Systems , volume=
Combiner: Full attention transformer with sparse computation cost , author=. Advances in Neural Information Processing Systems , volume=
-
[17]
Transactions of the Association for Computational Linguistics , volume=
Efficient Content-Based Sparse Attention with Routing Transformers , author=. Transactions of the Association for Computational Linguistics , volume=
-
[19]
NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following , year=
Ring Attention with Blockwise Transformers for Near-Infinite Context , author=. NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following , year=
2023
-
[21]
Proceedings of Machine Learning and Systems , volume=
Reducing activation recomputation in large transformer models , author=. Proceedings of Machine Learning and Systems , volume=
-
[23]
Proceedings of the 52nd International Conference on Parallel Processing , pages=
Colossal-ai: A unified deep learning system for large-scale parallel training , author=. Proceedings of the 52nd International Conference on Parallel Processing , pages=
-
[24]
Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ NeurIPS 2023) , year=
LightSeq:: Sequence Level Parallelism for Distributed Training of Long Context Transformers , author=. Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ NeurIPS 2023) , year=
2023
-
[25]
arXiv preprint arXiv:2311.02382 , year=
Ultra-Long Sequence Distributed Transformer , author=. arXiv preprint arXiv:2311.02382 , year=
-
[26]
International Conference on Learning Representations , year=
Reformer: The Efficient Transformer , author=. International Conference on Learning Representations , year=
-
[27]
Linformer: Self-Attention with Linear Complexity
Linformer: Self-attention with linear complexity , author=. arXiv preprint arXiv:2006.04768 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[28]
Advances in Neural Information Processing Systems , volume=
Luna: Linear Unified Nested Attention , author=. Advances in Neural Information Processing Systems , volume=
-
[29]
Advances in Neural Information Processing Systems , volume=
Flashattention: Fast and memory-efficient exact attention with io-awareness , author=. Advances in Neural Information Processing Systems , volume=
-
[30]
The Twelfth International Conference on Learning Representations , year=
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , author=. The Twelfth International Conference on Learning Representations , year=
-
[31]
and Ermon, Stefano and Rudra, Atri and Re, Christopher , year =
Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and Re, Christopher , year =. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , DOI =
-
[32]
and Kavukcuoglu, Koray and Kohli, Pushmeet and Hassabis, Demis , year =
Jumper, John and Evans, Richard and Pritzel, Alexander and Green, Tim and Figurnov, Michael and Ronneberger, Olaf and Tunyasuvunakool, Kathryn and Bates, Russ and Zidek, Augustin and Potapenko, Anna..nyals, Oriol and Senior, Andrew W. and Kavukcuoglu, Koray and Kohli, Pushmeet and Hassabis, Demis , year =. Highly accurate protein structure prediction with...
-
[33]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces , DOI =
Gu, Albert and Dao, Tri , year =. Mamba: Linear-Time Sequence Modeling with Selective State Spaces , DOI =
-
[34]
Zhang, Zhenyuan and Zhao, Qihang and Zhou, Peng and Zhou, Qinghua and Zhu, Jian and Zhu, Rui-Jie , year =
Peng, Bo and Alcaide, Eric and Anthony, Quentin and Albalak, Alon and Arcadinho, Samuel and Biderman, Stella and Cao, Huanqi and Cheng, Xin and Chung, Michael and Grella, Matteo and GV, Kranthi Kira.. Zhang, Zhenyuan and Zhao, Qihang and Zhou, Peng and Zhou, Qinghua and Zhu, Jian and Zhu, Rui-Jie , year =. RWKV: Reinventing RNNs for the Transformer Era , DOI =
-
[35]
and Salakhutdinov, Ruslan , year =
Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc V. and Salakhutdinov, Ruslan , year =. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context , DOI =
-
[36]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , DOI =
Dao, Tri , year =. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , DOI =
-
[37]
and Zettlemoyer, Luke and Yih, Scott and Lewis, Mike , year =
Shi, Weijia and Min, Sewon and Lomeli, Maria and Zhou, Chunting and Li, Margaret and James, Rich and Lin, Xi Victoria and Smith, Noah A. and Zettlemoyer, Luke and Yih, Scott and Lewis, Mike , year =. In-Context Pretraining: Language Modeling Beyond Document Boundaries , DOI =
-
[38]
Evolutionary-scale prediction of atomic-level protein structure with a language model , volume =
Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Smetanin, Nikita and Verkuil, Robert and Kabeli, Ori and Shmueli, Yaniv and dos Santos Costa, Allan and Fazel-Zarandi, Maryam and Sercu, Tom and Candido, Salvatore and Rives, Alexander , year =. Evolutionary-scale prediction of atomic-level protein structure w...
-
[39]
Block-State Transformers , repository =
Fathi, Mahan and Pilault, Jonathan and Firat, Orhan and Pal, Christopher and Bacon, Pierre-Luc and Goroshin, Ross , year =. Block-State Transformers , repository =
-
[40]
01-ai/Yi: A series of large language models trained from scratch by developers @01-ai , URL =
01-ai, , year =. 01-ai/Yi: A series of large language models trained from scratch by developers @01-ai , URL =
-
[41]
, year =
Taori, Rohan and Gulrajani, Ishaan and Zhang, Tianyi and Dubois, Yann and Li, Xuechen and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B. , year =. Stanford Alpaca: An Instruction-following LLaMA model , publisher =
-
[42]
and Gonzalez, Joseph E
Li, Dacheng and Shao, Rulin and Xie, Anze and Xing, Eric P. and Gonzalez, Joseph E. and Stoica, Ion and Ma, Xuezhe and Zhang, Hao , year =. LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers , repository =
-
[43]
and Fitzgibbon, Andrew , year =
Krell, Mario Michael and Kosec, Matej and Perez, Sergio P. and Fitzgibbon, Andrew , year =. Efficient Sequence Packing without Cross-contamination: Accelerating Large Language Models without Impacting Performance , DOI =
-
[44]
De Vries, Harm , title =
-
[45]
2024 , eprint=
World Model on Million-Length Video And Language With Blockwise RingAttention , author=. 2024 , eprint=
2024
-
[46]
2024 , url=
Video generation models as world simulators , author=. 2024 , url=
2024
-
[47]
2023 , month =
GPT-4 Technical Report , DOI =. 2023 , month =
2023
-
[48]
2023 , month =
Gemini: A Family of Highly Capable Multimodal Models , DOI =. 2023 , month =
2023
-
[50]
Together Computer , title =
-
[51]
2023 , eprint=
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations , author=. 2023 , eprint=
2023
-
[52]
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. 2024. https://openai.com/research/video-generation-models-as-world-simulators Video generation models as world simulators
2024
-
[53]
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pages 1877--1901
2020
-
[54]
Together Computer. 2023. https://github.com/togethercomputer/RedPajama-Data Redpajama: an open dataset for training large language models
2023
-
[55]
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344--16359
2022
-
[56]
Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. 2023. https://arxiv.org/abs/2305.14233 Enhancing chat language models by scaling high-quality instructional conversations . Preprint, arXiv:2305.14233
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team , Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Millican..na, Tim Green, Demis Hassabis, Koray Kavukcuoglu, Jeffrey Dean, and Oriol Vinyals. 2023. https://doi.org/10.48550/arXiv.2312.11805 Gemini: A family of highly capable multimodal models . ArXiv:2312....
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805 2023
-
[58]
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Leon Song, Samyam Rajbhandari, and Yuxiong He. 2023. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. 2023. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5
2023
-
[60]
Mario Michael Krell, Matej Kosec, Sergio P. Perez, and Andrew Fitzgibbon. 2022. https://doi.org/10.48550/arXiv.2107.02027 Efficient sequence packing without cross-contamination: Accelerating large language models without impacting performance . ArXiv:2107.02027 [cs, math]
-
[61]
Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. 2023 a . How long can context length of open-source llms truly promise? In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following
2023
-
[62]
Dacheng Li, Rulin Shao, Anze Xie, Eric P Xing, Joseph E Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. 2023 b . Lightseq:: Sequence level parallelism for distributed training of long context transformers. In Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ NeurIPS 2023)
2023
-
[63]
Shenggui Li, Hongxin Liu, Zhengda Bian, Jiarui Fang, Haichen Huang, Yuliang Liu, Boxiang Wang, and Yang You. 2023 c . Colossal-ai: A unified deep learning system for large-scale parallel training. In Proceedings of the 52nd International Conference on Parallel Processing, pages 766--775
2023
-
[64]
Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. 2023 d . https://doi.org/10.18653/v1/2023.acl-long.134 Sequence parallelism: Long sequence training from system perspective . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2391--2404, Toronto, Canada. Associati...
-
[65]
Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. 2024. https://arxiv.org/abs/2402.08268 World model on million-length video and language with blockwise ringattention . Preprint, arXiv:2402.08268
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[66]
Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023 a . Ring attention with blockwise transformers for near-infinite context. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following
2023
-
[67]
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023 b . Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[68]
Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. 2024. https://doi.org/10.57967/hf/2497 Fineweb-edu
-
[69]
Maxim Milakov and Natalia Gimelshein. 2018. Online normalizer calculation for softmax. arXiv: Performance,arXiv: Performance
2018
-
[70]
OpenAI , Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, ..rvin Anadkat, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. 2023. https://doi.org/10.48550/arXiv.2303.08774 Gpt-4 technical report . ArXiv:2303.08774 [cs]
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
-
[71]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and PeterJ. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv: Learning,arXiv: Learning
2019
-
[72]
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. https://doi.org/10.1145/3394486.3406703 Deepspeed
-
[73]
Smith, Luke Zettlemoyer, Scott Yih, and Mike Lewis
Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Rich James, Xi Victoria Lin, Noah A. Smith, Luke Zettlemoyer, Scott Yih, and Mike Lewis. 2023. https://doi.org/10.48550/arXiv.2310.10638 In-context pretraining: Language modeling beyond document boundaries . ArXiv:2310.10638 [cs]
-
[74]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[75]
Konrad Staniszewski, Szymon Tworkowski, Sebastian Jaszczur, Henryk Michalewski, -L ukasz Kuci’nski, and Piotr Mi l o’s. 2023. Structured packing in llm training improves long context utilization
2023
-
[76]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6000--6010
2017
- [77]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.