HCMS partitions multi-head attention into chunks and pipelines them across dual CUDA streams to overlap communication and computation, delivering 10-17.5% speedup over Ulysses for 31K-56K token sequences.
hub
Striped attention: Faster ring attention for causal transformers
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
A CPU-GPU hybrid design with stream-loading prefill, expert parallelism, and disaggregation achieves cloud SLOs for local MoE inference on dual-socket CPUs and consumer GPUs.
HexiSeq optimizes sequence and head partitioning across mixed GPUs to improve long-context LLM training throughput by up to 1.72x in simulations.
CoCoDiff achieves 3.6x average and 8.4x peak speedup for distributed DiT inference on up to 96 GPU tiles via tile-aware all-to-all, V-first scheduling, and selective V communication.
MTraining scales LLM training to 512K-token contexts on 32 A100 GPUs by integrating dynamic sparse training patterns with balanced and hierarchical sparse ring attention, achieving up to 6x throughput gains without accuracy loss on long-context benchmarks.
InfiniPipe proposes elastic pipeline parallelism and stage-aware chunk-level adaptive checkpointing to achieve 1.69x speedup over state-of-the-art for variable-length long-context LLM training.
Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
Arachne orchestrates cascades for distributed T2V training and reports up to 65% lower iteration time with improving gains at larger scales compared to static bucketing approaches.
Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack performance.
A decentralized prefix-cache-aware routing scheme for P2P LLM serving improves simulated latency under low-delay skewed workloads but is limited by network latency and hotspots.
Presents open-source 7B models for million-token video and language understanding via Blockwise RingAttention, setting new benchmarks in retrieval and long video tasks.
The paper surveys and taxonomizes inference optimization methods for large vision-language models across four categories while noting limitations and open problems.
citing papers explorer
-
HCMS: Head-Chunked Multi-Stream Pipeline for Communication-Computation Overlap in Long-Sequence Parallel Attention
HCMS partitions multi-head attention into chunks and pipelines them across dual CUDA streams to overlap communication and computation, delivering 10-17.5% speedup over Ulysses for 31K-56K token sequences.
-
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.
-
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
-
Achieving Cloud-Grade SLOs for Local Mixture-of-Experts Inference through CPU-GPU Hybrid Design
A CPU-GPU hybrid design with stream-loading prefill, expert parallelism, and disaggregation achieves cloud SLOs for local MoE inference on dual-socket CPUs and consumer GPUs.
-
HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware
HexiSeq optimizes sequence and head partitioning across mixed GPUs to improve long-context LLM training throughput by up to 1.72x in simulations.
-
CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism
CoCoDiff achieves 3.6x average and 8.4x peak speedup for distributed DiT inference on up to 96 GPU tiles via tile-aware all-to-all, V-first scheduling, and selective V communication.
-
MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training
MTraining scales LLM training to 512K-token contexts on 32 A100 GPUs by integrating dynamic sparse training patterns with balanced and hierarchical sparse ring attention, achieving up to 6x throughput gains without accuracy loss on long-context benchmarks.
-
InfiniPipe: Elastic Pipeline Parallelism for Efficient Variable-Length Long-Context LLM Training
InfiniPipe proposes elastic pipeline parallelism and stage-aware chunk-level adaptive checkpointing to achieve 1.69x speedup over state-of-the-art for variable-length long-context LLM training.
-
Gated Linear Attention Transformers with Hardware-Efficient Training
Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
-
Arachne: Orchestrating Cascades for Efficient Text-to-Video Model Training
Arachne orchestrates cascades for distributed T2V training and reports up to 65% lower iteration time with improving gains at larger scales compared to static bucketing approaches.
-
Kaczmarz Linear Attention
Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack performance.
-
Towards Distributed Inference of LLMs on a P2P Network
A decentralized prefix-cache-aware routing scheme for P2P LLM serving improves simulated latency under low-delay skewed workloads but is limited by network latency and hotspots.
-
World Model on Million-Length Video And Language With Blockwise RingAttention
Presents open-source 7B models for million-token video and language understanding via Blockwise RingAttention, setting new benchmarks in retrieval and long video tasks.
-
Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies
The paper surveys and taxonomizes inference optimization methods for large vision-language models across four categories while noting limitations and open problems.