pith. machine review for the scientific record. sign in

arxiv: 2602.22437 · v3 · submitted 2026-02-25 · 💻 cs.DC · cs.AI· cs.LG

Recognition: no theorem link

veScale-FSDP: Flexible and High-Performance FSDP at Scale

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:00 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.LG
keywords FSDPRaggedShardzero-copy communicationblock-wise quantizationnon-element-wise optimizerslarge-scale trainingmemory efficiencyGPU scaling
0
0 comments X

The pith

veScale-FSDP enables zero-copy FSDP communications and native support for block-wise quantization and non-element-wise optimizers while improving throughput up to 66 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing Fully Sharded Data Parallel systems rely on fixed element-wise or row-wise sharding that conflicts with block-structured computations in modern training. veScale-FSDP replaces those formats with RaggedShard and adds a structure-aware planning algorithm to remove the conflict. The change produces zero-copy communications and direct compatibility with block-wise quantization plus optimizers such as Shampoo and Muon. Reported results show throughput gains of 5 to 66 percent and memory reductions of 16 to 30 percent. The design is stated to maintain efficiency when the training job spans tens of thousands of GPUs.

Core claim

veScale-FSDP is a novel FSDP system that combines RaggedShard, a flexible sharding format, with a structure-aware planning algorithm to deliver both flexibility and performance. It enables zero-copy FSDP communications and natively supports block-wise quantization and non-element-wise optimizers, achieving 5% to 66% higher throughput and 16% to 30% lower memory usage than existing FSDP systems, while scaling efficiently to tens of thousands of GPUs.

What carries the argument

RaggedShard, the flexible sharding format paired with a structure-aware planning algorithm that produces zero-copy data movement and structure-aware operations.

Load-bearing premise

The RaggedShard format and structure-aware planner add negligible overhead and create no new communication or synchronization bottlenecks at tens of thousands of GPUs.

What would settle it

A side-by-side run on a 10000-GPU cluster that measures communication volume, peak memory, and end-to-end step time for a block-wise quantized model under veScale-FSDP versus a standard FSDP baseline.

read the original abstract

Fully Sharded Data Parallel (FSDP), also known as Zero Redundancy Optimizer (ZeRO), is widely used for large-scale model training, because of its memory efficiency and minimal intrusion on model code. However, existing FSDP systems rely on fixed element-wise or row-wise sharding formats that conflict with block-structured computations. As a result, they struggle to support modern structure-aware training methods, including block-wise quantization and non-element-wise optimizers such as Shampoo and Muon. In addition, today's implementations incur communication and memory overheads that degrade efficiency at the scale of tens of thousands of GPUs. We introduce veScale-FSDP, a novel FSDP system that combines RaggedShard, a flexible sharding format, with a structure-aware planning algorithm to deliver both flexibility and performance. veScale-FSDP enables zero-copy FSDP communications and natively supports block-wise quantization and non-element-wise optimizers, achieving 5% to 66% higher throughput and 16% to 30% lower memory usage than existing FSDP systems, while scaling efficiently to tens of thousands of GPUs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces veScale-FSDP, a novel FSDP system that combines a flexible RaggedShard sharding format with a structure-aware planning algorithm. This design targets limitations of fixed element-wise or row-wise sharding in existing FSDP/ZeRO implementations, enabling zero-copy communications and native support for block-wise quantization and non-element-wise optimizers such as Shampoo and Muon. The authors report throughput gains of 5% to 66% and memory reductions of 16% to 30% relative to prior FSDP systems, with efficient scaling to tens of thousands of GPUs.

Significance. If the performance and scaling claims hold under comparable conditions, the work would be significant for large-scale training. It directly addresses the practical conflict between sharding efficiency and support for modern structure-aware methods, potentially enabling higher throughput and lower memory footprints without code changes for advanced optimizers.

major comments (2)
  1. [Evaluation] The central performance claims (5%-66% throughput improvement and 16%-30% memory reduction) rest on comparisons whose fairness is not fully detailed in the provided text. The manuscript should explicitly state the exact baseline FSDP implementations, model architectures, hardware setups, and whether experiments isolate the contribution of zero-copy communications versus other factors.
  2. [Design] The structure-aware planning algorithm and RaggedShard format are load-bearing for the claims of negligible overhead and no new bottlenecks. The paper should include measurements or analysis demonstrating that these components do not introduce synchronization or communication costs that scale poorly beyond the tested regimes.
minor comments (1)
  1. The abstract introduces terms such as 'RaggedShard' and 'structure-aware planning algorithm' without brief definitions; a short parenthetical description would improve accessibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript accordingly to improve clarity on evaluation details and to include additional analysis of design overheads.

read point-by-point responses
  1. Referee: [Evaluation] The central performance claims (5%-66% throughput improvement and 16%-30% memory reduction) rest on comparisons whose fairness is not fully detailed in the provided text. The manuscript should explicitly state the exact baseline FSDP implementations, model architectures, hardware setups, and whether experiments isolate the contribution of zero-copy communications versus other factors.

    Authors: We agree that more explicit details are required for reproducibility and to substantiate the fairness of the comparisons. In the revised manuscript we have added a new table and expanded text in Section 5 that specify: the exact baseline implementations (PyTorch FSDP 2.4 and DeepSpeed ZeRO-3), the model families and sizes (Llama-7B/70B plus block-structured variants), the hardware platforms (H100 clusters ranging from 512 to 16384 GPUs), and dedicated ablation experiments that isolate zero-copy communication savings from other factors such as quantization and optimizer support. These revisions directly address the concern. revision: yes

  2. Referee: [Design] The structure-aware planning algorithm and RaggedShard format are load-bearing for the claims of negligible overhead and no new bottlenecks. The paper should include measurements or analysis demonstrating that these components do not introduce synchronization or communication costs that scale poorly beyond the tested regimes.

    Authors: We acknowledge the need for explicit verification of scalability. The revised paper now contains a dedicated subsection (Section 4.3) with profiling results showing that structure-aware planning incurs a one-time cost of less than 0.05 % of total training time and introduces no additional per-iteration synchronization. Communication-volume breakdowns and weak-scaling curves up to 32768 GPUs are provided, confirming that RaggedShard preserves the same all-reduce volume as standard FSDP while eliminating copy overheads for block-structured data. A brief complexity argument is also included to show that the added planning logic remains O(1) per layer independent of GPU count. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper is a systems contribution describing a new FSDP implementation (RaggedShard format plus structure-aware planner) that enables zero-copy communication and native support for block-wise quantization and non-element-wise optimizers. No equations, fitted parameters, or mathematical derivations appear in the provided abstract or description. Performance numbers (5-66% throughput, 16-30% memory) are presented as empirical outcomes of the implementation rather than quantities defined in terms of themselves or prior self-citations. The design rests on standard engineering choices whose correctness is externally verifiable by benchmarking, with no load-bearing step that reduces to a self-definition or renamed input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the design appears to rest on standard distributed-systems assumptions.

pith-pipeline@v0.9.0 · 5532 in / 1120 out tokens · 25159 ms · 2026-05-15T19:00:00.791137+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 9 internal anchors

  1. [1]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

  2. [2]

    8-bit optimizers via block-wise quantization

    Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. In International Conference on Learning Representations, 2022

  3. [3]

    The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

  4. [4]

    Michael R Garey and David S. Johnson. Complexity results for multiprocessor scheduling under resource constraints. SIAM journal on Computing, 4(4):397–411, 1975

  5. [5]

    [rfc] per-parameter-sharding fsdp, 2023

    Andrew Gu, Wei Feng, and Yanli Zhao. [rfc] per-parameter-sharding fsdp, 2023. URLhttps://github.com/ pytorch/pytorch/issues/114299

  6. [6]

    Shampoo: Preconditioned stochastic tensor optimization

    Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, pages 1842–1850. PMLR, 2018

  7. [7]

    Deepspeed is slower than fsdp, 2024

    Halilakin. Deepspeed is slower than fsdp, 2024. URLhttps://github.com/deepspeedai/DeepSpeed/issues/ 5047#issuecomment-1926275502

  8. [8]

    In 21st USENIX Symposium on NetworkedSystems Design and Implementation (NSDI 24), pages 745–760, 2024

    Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, et al.{MegaScale}: Scaling large language model training to more than 10,000{GPUs}. In 21st USENIX Symposium on NetworkedSystems Design and Implementation (NSDI 24), pages 745–760, 2024

  9. [9]

    Muon: An optimizer for hidden layers in neural networks, 2024

    KellerJordan, YuchenJin, VladoBoza, JiachengYou, FranzCesista, LakerNewhouse, andJeremyBernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URLhttps://kellerjordan.github.io/posts/muon/

  10. [10]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprintarXiv:2001.08361, 2020

  11. [11]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020

  12. [12]

    Pytorch distributed: Experiences on accelerating data parallel training.arXiv preprint arXiv:2006.15704, 2020

    Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. Pytorch distributed: Experiences on accelerating data parallel training.arXiv preprint arXiv:2006.15704, 2020

  13. [13]

    veScale: Consistent and Efficient Tensor Programming with Eager-Mode SPMD, 2025

    Youjie Li, Cheng Wan, Zhiqi Lin, Hongyu Zhu, Jiacheng Yang, Ziang Song, Xinyi Di, Jiawei Wu, Huiyao Shu, Wenlei Bao, Yanghua Peng, Haibin Lin, and Li-Wen Chang. veScale: Consistent and Efficient Tensor Programming with Eager-Mode SPMD, 2025. URLhttps://arxiv.org/abs/2509.07003

  14. [14]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  15. [15]

    VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo, 2025

    Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, Zhi Zhang, and Xin Liu. VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo, 2025. URLhttps://arxiv.org/abs/2508.02317

  16. [16]

    Mcore custom fully sharded data parallel (fsdp)

    Megatron. Mcore custom fully sharded data parallel (fsdp). Technical report, 2025

  17. [17]

    Regarding the allgather bandwidth with different byte alignment under different protocols, 2025

    NVIDIA NCCL. Regarding the allgather bandwidth with different byte alignment under different protocols, 2025. URLhttps://github.com/NVIDIA/nccl/issues/413

  18. [18]

    Nccl: Collective operations, 2025

    NVIDIA NCCL. Nccl: Collective operations, 2025. URL https://docs.nvidia.com/deeplearning/nccl/ user-guide/docs/usage/collectives.html

  19. [19]

    Fully sharded data parallel (fsdp2)

    PyTorch. Fully sharded data parallel (fsdp2). Technical report, 2024

  20. [20]

    Pytorch jaggedtensor, 2025

    PyTorch. Pytorch jaggedtensor, 2025. URL https://docs.pytorch.org/FBGEMM/fbgemm_gpu/overview/ jagged-tensor-ops/JaggedTensorOps.html. 17

  21. [21]

    Pytorch nestedtensor, 2025

    PyTorch. Pytorch nestedtensor, 2025. URLhttps://docs.pytorch.org/docs/main/nested.html

  22. [22]

    Distributed checkpoint, 2025

    PyTorch Team. Distributed checkpoint, 2025. URL https://docs.pytorch.org/docs/stable/distributed. checkpoint.html#distributed-checkpoint-torch#-distributed-checkpoint

  23. [23]

    Meta pytorch team 2026 h1 roadmaps, 2026

    PyTorch Team. Meta pytorch team 2026 h1 roadmaps, 2026. URL https://dev-discuss.pytorch.org/t/ meta-pytorch-team-2026-h1-roadmaps

  24. [24]

    Zero: Memory optimizations toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020

  25. [25]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

  26. [26]

    Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

    Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model.arXiv preprint arXiv:2201.11990, 2022

  27. [27]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  28. [28]

    Kimi K2: Open Agentic Intelligence

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

  29. [29]

    Tensorflow ragged tensors, 2025

    TensorFlow. Tensorflow ragged tensors, 2025. URLhttps://www.tensorflow.org/guide/ragged_tensor

  30. [30]

    PyTorch DTensor (Distributed Tensor).https://pytorch.org/docs/stable/distributed

    The PyTorch Team. PyTorch DTensor (Distributed Tensor).https://pytorch.org/docs/stable/distributed. tensor.html, 2024

  31. [31]

    Fantastic pretraining optimizers and where to find them

    Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them. arXiv preprint arXiv:2509.02046, 2025

  32. [32]

    Terabyte-scale analytics in the blink of an eye.arXiv preprint arXiv:2506.09226, 2025

    Bowen Wu, Wei Cui, Carlo Curino, Matteo Interlandi, and Rathijit Sen. Terabyte-scale analytics in the blink of an eye.arXiv preprint arXiv:2506.09226, 2025

  33. [33]

    FSDP & CUDACachingAllocator

    Jane Xu. FSDP & CUDACachingAllocator. https://dev-discuss.pytorch.org/t/ fsdp-cudacachingallocator-an-outsider-newb-perspective/1486, 2024. PyTorch Dev Discuss

  34. [34]

    Gspmd: general and scalable parallelization for ml computation graphs

    Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, et al. Gspmd: general and scalable parallelization for ml computation graphs. arXiv preprint arXiv:2105.04663, 2021

  35. [35]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023

  36. [36]

    Fsdp1 post backward reduce, 2025

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Fsdp1 post backward reduce, 2025. URL https://github.com/pytorch/pytorch/blob/a4925c0ce004cf883fdd1b248d71676769524934/torch/ distributed/fsdp/_runtime_utils.py#L695C1-L773C1. 18