HCMS partitions multi-head attention into chunks and pipelines them across dual CUDA streams to overlap communication and computation, delivering 10-17.5% speedup over Ulysses for 31K-56K token sequences.
hub
Striped attention: Faster ring attention for causal transformers
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
A CPU-GPU hybrid design with stream-loading prefill, expert parallelism, and disaggregation achieves cloud SLOs for local MoE inference on dual-socket CPUs and consumer GPUs.
HexiSeq optimizes sequence and head partitioning across mixed GPUs to improve long-context LLM training throughput by up to 1.72x in simulations.
CoCoDiff achieves 3.6x average and 8.4x peak speedup for distributed DiT inference on up to 96 GPU tiles via tile-aware all-to-all, V-first scheduling, and selective V communication.
MTraining scales LLM training to 512K-token contexts on 32 A100 GPUs by integrating dynamic sparse training patterns with balanced and hierarchical sparse ring attention, achieving up to 6x throughput gains without accuracy loss on long-context benchmarks.
InfiniPipe proposes elastic pipeline parallelism and stage-aware chunk-level adaptive checkpointing to achieve 1.69x speedup over state-of-the-art for variable-length long-context LLM training.
Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
Arachne orchestrates cascades for distributed T2V training and reports up to 65% lower iteration time with improving gains at larger scales compared to static bucketing approaches.
Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack performance.
A decentralized prefix-cache-aware routing scheme for P2P LLM serving improves simulated latency under low-delay skewed workloads but is limited by network latency and hotspots.
Presents open-source 7B models for million-token video and language understanding via Blockwise RingAttention, setting new benchmarks in retrieval and long video tasks.
The paper surveys and taxonomizes inference optimization methods for large vision-language models across four categories while noting limitations and open problems.
citing papers explorer
-
MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training
MTraining scales LLM training to 512K-token contexts on 32 A100 GPUs by integrating dynamic sparse training patterns with balanced and hierarchical sparse ring attention, achieving up to 6x throughput gains without accuracy loss on long-context benchmarks.
-
InfiniPipe: Elastic Pipeline Parallelism for Efficient Variable-Length Long-Context LLM Training
InfiniPipe proposes elastic pipeline parallelism and stage-aware chunk-level adaptive checkpointing to achieve 1.69x speedup over state-of-the-art for variable-length long-context LLM training.