pith. sign in

hub Mixed citations

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Mixed citation behavior. Most common role is background (56%).

47 Pith papers citing it
Background 56% of classified citations
abstract

Computation in a typical Transformer-based large language model (LLM) can be characterized by batch size, hidden dimension, number of layers, and sequence length. Until now, system works for accelerating LLM training have focused on the first three dimensions: data parallelism for batch size, tensor parallelism for hidden size and pipeline parallelism for model depth or layers. These widely studied forms of parallelism are not targeted or optimized for long sequence Transformer models. Given practical application needs for long sequence LLM, renewed attentions are being drawn to sequence parallelism. However, existing works in sequence parallelism are constrained by memory-communication inefficiency, limiting their scalability to long sequence large models. In this work, we introduce DeepSpeed-Ulysses, a novel, portable and effective methodology for enabling highly efficient and scalable LLM training with extremely long sequence length. DeepSpeed-Ulysses at its core partitions input data along the sequence dimension and employs an efficient all-to-all collective communication for attention computation. Theoretical communication analysis shows that whereas other methods incur communication overhead as sequence length increases, DeepSpeed-Ulysses maintains constant communication volume when sequence length and compute devices are increased proportionally. Furthermore, experimental evaluations show that DeepSpeed-Ulysses trains 2.5x faster with 4x longer sequence length than the existing method SOTA baseline.

hub tools

citation-role summary

background 10 method 8

citation-polarity summary

representative citing papers

LVSA: Training-Free Sparse Attention for Long Video Diffusion

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

LVSA is a training-free block-sparse attention technique combining structured windows with rotating global anchors that reduces inference compute 2.98-3.33x on video diffusion models at extended horizons while remaining quality-neutral or positive.

Priming: Hybrid State Space Models From Pre-trained Transformers

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasoning at 32B scale.

Hive: A Multi-Agent Infrastructure for Algorithm- and Task-Level Scaling

cs.AI · 2026-04-19 · unverdicted · novelty 6.0

Hive is a multi-agent infrastructure with a logits cache for reducing cross-path redundancy in sampling and agent-aware scheduling for better compute and KV-cache allocation, shown to deliver 1.11x-1.76x speedups and 33%-51% lower hotspot miss rates.

Geometric Context Transformer for Streaming 3D Reconstruction

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20 FPS over sequences longer than 10,000 frames.

LPM 1.0: Video-based Character Performance Model

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

LPM 1.0 generates infinite-length, identity-stable, real-time audio-visual conversational performances for single characters using a distilled causal diffusion transformer and a new benchmark.

Kling-Omni Technical Report

cs.CV · 2025-12-18 · unverdicted · novelty 6.0

Kling-Omni is a unified multimodal generative system that produces cinematic videos from diverse inputs by integrating generation, editing, and intelligent reasoning in a single end-to-end model.

citing papers explorer

Showing 47 of 47 citing papers.