arXiv preprint arXiv:2402.15627 (2024)

URLhttps://arxiv · 2024 · arXiv 2402.15627

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production

cs.DC · 2026-05-09 · unverdicted · novelty 6.0

MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.

HybridFlow: A Flexible and Efficient RLHF Framework

cs.LG · 2024-09-28 · unverdicted · novelty 6.0

HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.

TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training

cs.DC · 2026-04-27 · unverdicted · novelty 5.0

TACO compresses tensor-parallel intermediate tensors with an adaptive FP8 scheme and fused kernels, yielding up to 1.87X throughput gains on GPT and Qwen models with near-lossless accuracy.

citing papers explorer

Showing 3 of 3 citing papers.

MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production cs.DC · 2026-05-09 · unverdicted · none · ref 26
MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.
HybridFlow: A Flexible and Efficient RLHF Framework cs.LG · 2024-09-28 · unverdicted · none · ref 40
HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.
TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training cs.DC · 2026-04-27 · unverdicted · none · ref 26
TACO compresses tensor-parallel intermediate tensors with an adaptive FP8 scheme and fused kernels, yielding up to 1.87X throughput gains on GPT and Qwen models with near-lossless accuracy.

arXiv preprint arXiv:2402.15627 (2024)

fields

years

verdicts

representative citing papers

citing papers explorer