HCMS partitions multi-head attention into chunks and pipelines them across dual CUDA streams to overlap communication and computation, delivering 10-17.5% speedup over Ulysses for 31K-56K token sequences.
Loongtrain: Efficient training of long-sequence llms with head-context parallelism
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell and other GPUs.
FlashCP introduces Whole-Doc sharding, sharding-aware KV communication, and a heuristic for mixed sharding plans, claiming up to 1.63x speedup over prior CP methods for LLM training.
Entrain reduces microbatch workload variability by up to 10.6x and improves multimodal LLM training throughput by 1.4x via static model parallelism and deferred hierarchical microbatch assignment.
HexiSeq optimizes sequence and head partitioning across mixed GPUs to improve long-context LLM training throughput by up to 1.72x in simulations.
CoCoDiff achieves 3.6x average and 8.4x peak speedup for distributed DiT inference on up to 96 GPU tiles via tile-aware all-to-all, V-first scheduling, and selective V communication.
MTraining scales LLM training to 512K-token contexts on 32 A100 GPUs by integrating dynamic sparse training patterns with balanced and hierarchical sparse ring attention, achieving up to 6x throughput gains without accuracy loss on long-context benchmarks.
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
citing papers explorer
-
HCMS: Head-Chunked Multi-Stream Pipeline for Communication-Computation Overlap in Long-Sequence Parallel Attention
HCMS partitions multi-head attention into chunks and pipelines them across dual CUDA streams to overlap communication and computation, delivering 10-17.5% speedup over Ulysses for 31K-56K token sequences.
-
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation
LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell and other GPUs.
-
FlashCP: Load-Balanced Communication-Efficient Context Parallelism for LLM Training
FlashCP introduces Whole-Doc sharding, sharding-aware KV communication, and a heuristic for mixed sharding plans, claiming up to 1.63x speedup over prior CP methods for LLM training.
-
Addressing Variable Heterogeneity in Distributed Multimodal Training with Entrain
Entrain reduces microbatch workload variability by up to 10.6x and improves multimodal LLM training throughput by 1.4x via static model parallelism and deferred hierarchical microbatch assignment.
-
HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware
HexiSeq optimizes sequence and head partitioning across mixed GPUs to improve long-context LLM training throughput by up to 1.72x in simulations.
-
CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism
CoCoDiff achieves 3.6x average and 8.4x peak speedup for distributed DiT inference on up to 96 GPU tiles via tile-aware all-to-all, V-first scheduling, and selective V communication.
-
MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training
MTraining scales LLM training to 512K-token contexts on 32 A100 GPUs by integrating dynamic sparse training patterns with balanced and hierarchical sparse ring attention, achieving up to 6x throughput gains without accuracy loss on long-context benchmarks.
-
MAGI-1: Autoregressive Video Generation at Scale
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
- Syncopate: Efficient Multi-GPU AI Kernels via Automatic Chunk-Centric Compute-Communication Overlap