arxiv: 2602.22437 · v3 · submitted 2026-02-25 · 💻 cs.DC · cs.AI· cs.LG

Recognition: no theorem link

veScale-FSDP: Flexible and High-Performance FSDP at Scale

Zezhou Wang , Youjie Li , Zhiqi Lin , Jiacheng Yang , Cong Xie , Guanyu Feng , Zheng Zhong , Ziyue Huang

show 4 more authors

Hongyu Zhu Zhi Zhang Yanghua Peng Xin Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:00 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.LG

keywords FSDPRaggedShardzero-copy communicationblock-wise quantizationnon-element-wise optimizerslarge-scale trainingmemory efficiencyGPU scaling

0 comments

The pith

veScale-FSDP enables zero-copy FSDP communications and native support for block-wise quantization and non-element-wise optimizers while improving throughput up to 66 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing Fully Sharded Data Parallel systems rely on fixed element-wise or row-wise sharding that conflicts with block-structured computations in modern training. veScale-FSDP replaces those formats with RaggedShard and adds a structure-aware planning algorithm to remove the conflict. The change produces zero-copy communications and direct compatibility with block-wise quantization plus optimizers such as Shampoo and Muon. Reported results show throughput gains of 5 to 66 percent and memory reductions of 16 to 30 percent. The design is stated to maintain efficiency when the training job spans tens of thousands of GPUs.

Core claim

veScale-FSDP is a novel FSDP system that combines RaggedShard, a flexible sharding format, with a structure-aware planning algorithm to deliver both flexibility and performance. It enables zero-copy FSDP communications and natively supports block-wise quantization and non-element-wise optimizers, achieving 5% to 66% higher throughput and 16% to 30% lower memory usage than existing FSDP systems, while scaling efficiently to tens of thousands of GPUs.

What carries the argument

RaggedShard, the flexible sharding format paired with a structure-aware planning algorithm that produces zero-copy data movement and structure-aware operations.

Load-bearing premise

The RaggedShard format and structure-aware planner add negligible overhead and create no new communication or synchronization bottlenecks at tens of thousands of GPUs.

What would settle it

A side-by-side run on a 10000-GPU cluster that measures communication volume, peak memory, and end-to-end step time for a block-wise quantized model under veScale-FSDP versus a standard FSDP baseline.

read the original abstract

Fully Sharded Data Parallel (FSDP), also known as Zero Redundancy Optimizer (ZeRO), is widely used for large-scale model training, because of its memory efficiency and minimal intrusion on model code. However, existing FSDP systems rely on fixed element-wise or row-wise sharding formats that conflict with block-structured computations. As a result, they struggle to support modern structure-aware training methods, including block-wise quantization and non-element-wise optimizers such as Shampoo and Muon. In addition, today's implementations incur communication and memory overheads that degrade efficiency at the scale of tens of thousands of GPUs. We introduce veScale-FSDP, a novel FSDP system that combines RaggedShard, a flexible sharding format, with a structure-aware planning algorithm to deliver both flexibility and performance. veScale-FSDP enables zero-copy FSDP communications and natively supports block-wise quantization and non-element-wise optimizers, achieving 5% to 66% higher throughput and 16% to 30% lower memory usage than existing FSDP systems, while scaling efficiently to tens of thousands of GPUs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

veScale-FSDP adds a ragged sharding format and planner to let FSDP handle block-structured ops without the usual overhead, but the gains look like solid engineering rather than a breakthrough.

read the letter

The core contribution here is RaggedShard plus the structure-aware planner. Existing FSDP sticks to element-wise or row-wise layouts that fight block-wise quantization and optimizers like Shampoo or Muon. This system relaxes the layout while keeping the memory benefits of full sharding and claims zero-copy communication. That directly tackles a mismatch that shows up once people move beyond plain Adam on very large models. The reported numbers—5-66% higher throughput and 16-30% lower memory at tens of thousands of GPUs—come from that flexibility, and if the baselines are fair the improvement is real for practitioners who need those features. The design itself is straightforward: the planner decides shard boundaries based on block structure instead of forcing everything into uniform rows. That part reads as careful systems work rather than new theory. The soft spots sit in the evaluation details. The abstract gives ranges but not the exact model sizes, hardware configs, or whether the baselines include the most recent FSDP variants with their own optimizations. Any new planner can hide synchronization costs that only appear at the largest scales, and without ablations on planner overhead it is hard to know how much of the win is from the ragged format versus careful tuning. The paper stays within standard engineering trade-offs, so there is no obvious circularity or unstated precondition that would break the claims if the implementation holds. This is useful reading for anyone who runs or extends distributed training stacks for frontier models. It will not change how most researchers write training loops, but it removes a practical barrier for teams that want block-aware methods without custom sharding code. The work is coherent enough on its own terms to deserve referee time; the claims are testable and the problem it solves is current.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces veScale-FSDP, a novel FSDP system that combines a flexible RaggedShard sharding format with a structure-aware planning algorithm. This design targets limitations of fixed element-wise or row-wise sharding in existing FSDP/ZeRO implementations, enabling zero-copy communications and native support for block-wise quantization and non-element-wise optimizers such as Shampoo and Muon. The authors report throughput gains of 5% to 66% and memory reductions of 16% to 30% relative to prior FSDP systems, with efficient scaling to tens of thousands of GPUs.

Significance. If the performance and scaling claims hold under comparable conditions, the work would be significant for large-scale training. It directly addresses the practical conflict between sharding efficiency and support for modern structure-aware methods, potentially enabling higher throughput and lower memory footprints without code changes for advanced optimizers.

major comments (2)

[Evaluation] The central performance claims (5%-66% throughput improvement and 16%-30% memory reduction) rest on comparisons whose fairness is not fully detailed in the provided text. The manuscript should explicitly state the exact baseline FSDP implementations, model architectures, hardware setups, and whether experiments isolate the contribution of zero-copy communications versus other factors.
[Design] The structure-aware planning algorithm and RaggedShard format are load-bearing for the claims of negligible overhead and no new bottlenecks. The paper should include measurements or analysis demonstrating that these components do not introduce synchronization or communication costs that scale poorly beyond the tested regimes.

minor comments (1)

The abstract introduces terms such as 'RaggedShard' and 'structure-aware planning algorithm' without brief definitions; a short parenthetical description would improve accessibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript accordingly to improve clarity on evaluation details and to include additional analysis of design overheads.

read point-by-point responses

Referee: [Evaluation] The central performance claims (5%-66% throughput improvement and 16%-30% memory reduction) rest on comparisons whose fairness is not fully detailed in the provided text. The manuscript should explicitly state the exact baseline FSDP implementations, model architectures, hardware setups, and whether experiments isolate the contribution of zero-copy communications versus other factors.

Authors: We agree that more explicit details are required for reproducibility and to substantiate the fairness of the comparisons. In the revised manuscript we have added a new table and expanded text in Section 5 that specify: the exact baseline implementations (PyTorch FSDP 2.4 and DeepSpeed ZeRO-3), the model families and sizes (Llama-7B/70B plus block-structured variants), the hardware platforms (H100 clusters ranging from 512 to 16384 GPUs), and dedicated ablation experiments that isolate zero-copy communication savings from other factors such as quantization and optimizer support. These revisions directly address the concern. revision: yes
Referee: [Design] The structure-aware planning algorithm and RaggedShard format are load-bearing for the claims of negligible overhead and no new bottlenecks. The paper should include measurements or analysis demonstrating that these components do not introduce synchronization or communication costs that scale poorly beyond the tested regimes.

Authors: We acknowledge the need for explicit verification of scalability. The revised paper now contains a dedicated subsection (Section 4.3) with profiling results showing that structure-aware planning incurs a one-time cost of less than 0.05 % of total training time and introduces no additional per-iteration synchronization. Communication-volume breakdowns and weak-scaling curves up to 32768 GPUs are provided, confirming that RaggedShard preserves the same all-reduce volume as standard FSDP while eliminating copy overheads for block-structured data. A brief complexity argument is also included to show that the added planning logic remains O(1) per layer independent of GPU count. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper is a systems contribution describing a new FSDP implementation (RaggedShard format plus structure-aware planner) that enables zero-copy communication and native support for block-wise quantization and non-element-wise optimizers. No equations, fitted parameters, or mathematical derivations appear in the provided abstract or description. Performance numbers (5-66% throughput, 16-30% memory) are presented as empirical outcomes of the implementation rather than quantities defined in terms of themselves or prior self-citations. The design rests on standard engineering choices whose correctness is externally verifiable by benchmarking, with no load-bearing step that reduces to a self-definition or renamed input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the design appears to rest on standard distributed-systems assumptions.

pith-pipeline@v0.9.0 · 5532 in / 1120 out tokens · 25159 ms · 2026-05-15T19:00:00.791137+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 9 internal anchors

[1]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

8-bit optimizers via block-wise quantization

Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. In International Conference on Learning Representations, 2022

work page 2022
[3]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

work page 2024
[4]

Michael R Garey and David S. Johnson. Complexity results for multiprocessor scheduling under resource constraints. SIAM journal on Computing, 4(4):397–411, 1975

work page 1975
[5]

[rfc] per-parameter-sharding fsdp, 2023

Andrew Gu, Wei Feng, and Yanli Zhao. [rfc] per-parameter-sharding fsdp, 2023. URLhttps://github.com/ pytorch/pytorch/issues/114299

work page 2023
[6]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, pages 1842–1850. PMLR, 2018

work page 2018
[7]

Deepspeed is slower than fsdp, 2024

Halilakin. Deepspeed is slower than fsdp, 2024. URLhttps://github.com/deepspeedai/DeepSpeed/issues/ 5047#issuecomment-1926275502

work page 2024
[8]

In 21st USENIX Symposium on NetworkedSystems Design and Implementation (NSDI 24), pages 745–760, 2024

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, et al.{MegaScale}: Scaling large language model training to more than 10,000{GPUs}. In 21st USENIX Symposium on NetworkedSystems Design and Implementation (NSDI 24), pages 745–760, 2024

work page 2024
[9]

Muon: An optimizer for hidden layers in neural networks, 2024

KellerJordan, YuchenJin, VladoBoza, JiachengYou, FranzCesista, LakerNewhouse, andJeremyBernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URLhttps://kellerjordan.github.io/posts/muon/

work page 2024
[10]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprintarXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[11]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[12]

Pytorch distributed: Experiences on accelerating data parallel training.arXiv preprint arXiv:2006.15704, 2020

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. Pytorch distributed: Experiences on accelerating data parallel training.arXiv preprint arXiv:2006.15704, 2020

work page arXiv 2006
[13]

veScale: Consistent and Efficient Tensor Programming with Eager-Mode SPMD, 2025

Youjie Li, Cheng Wan, Zhiqi Lin, Hongyu Zhu, Jiacheng Yang, Ziang Song, Xinyi Di, Jiawei Wu, Huiyao Shu, Wenlei Bao, Yanghua Peng, Haibin Lin, and Li-Wen Chang. veScale: Consistent and Efficient Tensor Programming with Eager-Mode SPMD, 2025. URLhttps://arxiv.org/abs/2509.07003

work page arXiv 2025
[14]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo, 2025

Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, Zhi Zhang, and Xin Liu. VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo, 2025. URLhttps://arxiv.org/abs/2508.02317

work page arXiv 2025
[16]

Mcore custom fully sharded data parallel (fsdp)

Megatron. Mcore custom fully sharded data parallel (fsdp). Technical report, 2025

work page 2025
[17]

Regarding the allgather bandwidth with different byte alignment under different protocols, 2025

NVIDIA NCCL. Regarding the allgather bandwidth with different byte alignment under different protocols, 2025. URLhttps://github.com/NVIDIA/nccl/issues/413

work page 2025
[18]

Nccl: Collective operations, 2025

NVIDIA NCCL. Nccl: Collective operations, 2025. URL https://docs.nvidia.com/deeplearning/nccl/ user-guide/docs/usage/collectives.html

work page 2025
[19]

Fully sharded data parallel (fsdp2)

PyTorch. Fully sharded data parallel (fsdp2). Technical report, 2024

work page 2024
[20]

Pytorch jaggedtensor, 2025

PyTorch. Pytorch jaggedtensor, 2025. URL https://docs.pytorch.org/FBGEMM/fbgemm_gpu/overview/ jagged-tensor-ops/JaggedTensorOps.html. 17

work page 2025
[21]

Pytorch nestedtensor, 2025

PyTorch. Pytorch nestedtensor, 2025. URLhttps://docs.pytorch.org/docs/main/nested.html

work page 2025
[22]

Distributed checkpoint, 2025

PyTorch Team. Distributed checkpoint, 2025. URL https://docs.pytorch.org/docs/stable/distributed. checkpoint.html#distributed-checkpoint-torch#-distributed-checkpoint

work page 2025
[23]

Meta pytorch team 2026 h1 roadmaps, 2026

PyTorch Team. Meta pytorch team 2026 h1 roadmaps, 2026. URL https://dev-discuss.pytorch.org/t/ meta-pytorch-team-2026-h1-roadmaps

work page 2026
[24]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020

work page 2020
[25]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[26]

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model.arXiv preprint arXiv:2201.11990, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Tensorflow ragged tensors, 2025

TensorFlow. Tensorflow ragged tensors, 2025. URLhttps://www.tensorflow.org/guide/ragged_tensor

work page 2025
[30]

PyTorch DTensor (Distributed Tensor).https://pytorch.org/docs/stable/distributed

The PyTorch Team. PyTorch DTensor (Distributed Tensor).https://pytorch.org/docs/stable/distributed. tensor.html, 2024

work page 2024
[31]

Fantastic pretraining optimizers and where to find them

Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them. arXiv preprint arXiv:2509.02046, 2025

work page arXiv 2025
[32]

Terabyte-scale analytics in the blink of an eye.arXiv preprint arXiv:2506.09226, 2025

Bowen Wu, Wei Cui, Carlo Curino, Matteo Interlandi, and Rathijit Sen. Terabyte-scale analytics in the blink of an eye.arXiv preprint arXiv:2506.09226, 2025

work page arXiv 2025
[33]

FSDP & CUDACachingAllocator

Jane Xu. FSDP & CUDACachingAllocator. https://dev-discuss.pytorch.org/t/ fsdp-cudacachingallocator-an-outsider-newb-perspective/1486, 2024. PyTorch Dev Discuss

work page 2024
[34]

Gspmd: general and scalable parallelization for ml computation graphs

Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, et al. Gspmd: general and scalable parallelization for ml computation graphs. arXiv preprint arXiv:2105.04663, 2021

work page arXiv 2021
[35]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Fsdp1 post backward reduce, 2025

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Fsdp1 post backward reduce, 2025. URL https://github.com/pytorch/pytorch/blob/a4925c0ce004cf883fdd1b248d71676769524934/torch/ distributed/fsdp/_runtime_utils.py#L695C1-L773C1. 18

work page 2025