Loongtrain: Efficient training of long-sequence llms with head-context parallelism

Minzheng Wang, Longze Chen, Cheng Fu, Shengyi Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, Yunshui Li, Min Yang, Fei Huang, Yongbin Li · 2024 · arXiv 2406.18485

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 1 baseline 1

citation-polarity summary

background 1 baseline 1

representative citing papers

HCMS: Head-Chunked Multi-Stream Pipeline for Communication-Computation Overlap in Long-Sequence Parallel Attention

cs.DC · 2026-07-02 · unverdicted · novelty 7.0

HCMS partitions multi-head attention into chunks and pipelines them across dual CUDA streams to overlap communication and computation, delivering 10-17.5% speedup over Ulysses for 31K-56K token sequences.

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell and other GPUs.

FlashCP: Load-Balanced Communication-Efficient Context Parallelism for LLM Training

cs.DC · 2026-06-07 · unverdicted · novelty 6.0

FlashCP introduces Whole-Doc sharding, sharding-aware KV communication, and a heuristic for mixed sharding plans, claiming up to 1.63x speedup over prior CP methods for LLM training.

Addressing Variable Heterogeneity in Distributed Multimodal Training with Entrain

cs.DC · 2026-05-27 · unverdicted · novelty 6.0

Entrain reduces microbatch workload variability by up to 10.6x and improves multimodal LLM training throughput by 1.4x via static model parallelism and deferred hierarchical microbatch assignment.

HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware

cs.DC · 2026-05-08 · unverdicted · novelty 6.0

HexiSeq optimizes sequence and head partitioning across mixed GPUs to improve long-context LLM training throughput by up to 1.72x in simulations.

CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism

cs.DC · 2026-04-16 · unverdicted · novelty 6.0

CoCoDiff achieves 3.6x average and 8.4x peak speedup for distributed DiT inference on up to 96 GPU tiles via tile-aware all-to-all, V-first scheduling, and selective V communication.

MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training

cs.CL · 2025-10-21 · conditional · novelty 6.0

MTraining scales LLM training to 512K-token contexts on 32 A100 GPUs by integrating dynamic sparse training patterns with balanced and hierarchical sparse ring attention, achieving up to 6x throughput gains without accuracy loss on long-context benchmarks.

MAGI-1: Autoregressive Video Generation at Scale

cs.CV · 2025-05-19 · unverdicted · novelty 6.0

MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.

Syncopate: Efficient Multi-GPU AI Kernels via Automatic Chunk-Centric Compute-Communication Overlap

cs.DC · 2026-01-28

citing papers explorer

Showing 9 of 9 citing papers.

HCMS: Head-Chunked Multi-Stream Pipeline for Communication-Computation Overlap in Long-Sequence Parallel Attention cs.DC · 2026-07-02 · unverdicted · none · ref 37
HCMS partitions multi-head attention into chunks and pipelines them across dual CUDA streams to overlap communication and computation, delivering 10-17.5% speedup over Ulysses for 31K-56K token sequences.
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation cs.CV · 2026-05-18 · unverdicted · none · ref 21
LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell and other GPUs.
FlashCP: Load-Balanced Communication-Efficient Context Parallelism for LLM Training cs.DC · 2026-06-07 · unverdicted · none · ref 7
FlashCP introduces Whole-Doc sharding, sharding-aware KV communication, and a heuristic for mixed sharding plans, claiming up to 1.63x speedup over prior CP methods for LLM training.
Addressing Variable Heterogeneity in Distributed Multimodal Training with Entrain cs.DC · 2026-05-27 · unverdicted · none · ref 13
Entrain reduces microbatch workload variability by up to 10.6x and improves multimodal LLM training throughput by 1.4x via static model parallelism and deferred hierarchical microbatch assignment.
HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware cs.DC · 2026-05-08 · unverdicted · none · ref 13
HexiSeq optimizes sequence and head partitioning across mixed GPUs to improve long-context LLM training throughput by up to 1.72x in simulations.
CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism cs.DC · 2026-04-16 · unverdicted · none · ref 57
CoCoDiff achieves 3.6x average and 8.4x peak speedup for distributed DiT inference on up to 96 GPU tiles via tile-aware all-to-all, V-first scheduling, and selective V communication.
MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training cs.CL · 2025-10-21 · conditional · none · ref 30
MTraining scales LLM training to 512K-token contexts on 32 A100 GPUs by integrating dynamic sparse training patterns with balanced and hierarchical sparse ring attention, achieving up to 6x throughput gains without accuracy loss on long-context benchmarks.
MAGI-1: Autoregressive Video Generation at Scale cs.CV · 2025-05-19 · unverdicted · none · ref 13
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
Syncopate: Efficient Multi-GPU AI Kernels via Automatic Chunk-Centric Compute-Communication Overlap cs.DC · 2026-01-28 · unreviewed · ref 12

Loongtrain: Efficient training of long-sequence llms with head-context parallelism

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer