pith. sign in

arxiv: 2606.03209 · v1 · pith:PH6ZSKSZnew · submitted 2026-06-02 · 💻 cs.LG

DECA: Decentralizing Block-Wise Adam for Efficient LLM Full-Parameter Fine-Tuning on Non-IID Data

Pith reviewed 2026-06-28 11:22 UTC · model grok-4.3

classification 💻 cs.LG
keywords decentralized fine-tuningblock-wise Adamfull-parameter fine-tuningnon-IID dataLLM adaptationresource efficiencyconsensus signalsfederated optimization
0
0 comments X

The pith

DECA partitions LLM parameters into blocks for sequential Adam updates to enable efficient decentralized full-parameter fine-tuning on non-IID data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that full-parameter fine-tuning of large language models can be made practical in decentralized environments where data across clients follows non-IID distributions. It does so by splitting parameters into disjoint blocks and running Adam optimization sequentially on each block rather than all at once. Fresh local gradient statistics are combined with discrepancy signals derived from client consensus to keep the process stable. If this holds, collaborative adaptation of billion-parameter models becomes feasible without a central server or the performance limits of parameter-efficient methods.

Core claim

DECA partitions model parameters into disjoint blocks and performs sequential block-wise Adam optimization, reducing resource consumption while preserving decentralized full-parameter adaptation. To stabilize training, DECA further introduces first- and second-order block-wise moment estimates with fresh local gradient statistics and consensus-derived discrepancy signals, yielding fast convergence, strong downstream performance, and significant resource efficiency on non-IID data.

What carries the argument

Sequential block-wise Adam optimization using first- and second-order moment estimates from local gradients and consensus discrepancy signals.

If this is right

  • Cuts memory and compute demands per client for models with billions of parameters.
  • Reduces vulnerability to client drift through the added consensus signals.
  • Retains the downstream task gains that come from updating every parameter rather than a subset.
  • Supplies theoretical convergence analysis for the resulting decentralized process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same block partitioning might apply to other first-order optimizers in decentralized settings.
  • It suggests full-parameter methods could become viable in more constrained federated scenarios than previously assumed.
  • Sequential block handling may interact differently with attention layers versus feed-forward layers in practice.

Load-bearing premise

That updating parameters one block at a time preserves the adaptation capacity of simultaneous full-parameter updates without introducing bias or instability from the update order or non-IID client data.

What would settle it

A direct comparison experiment on a fixed non-IID data split where DECA's final accuracy or convergence speed falls substantially below that of a full decentralized Adam baseline or a centralized full-parameter run.

Figures

Figures reproduced from arXiv: 2606.03209 by Feng Li, Jun Luo, Kai Han, Kai Wang, Shaowei Li, Yunsheng Yuan, Zheng Zhang, Zhongyuan Sun.

Figure 1
Figure 1. Figure 1: Training loss of different algorithms on TFNS dataset using Llama-3.1-8B model. 0 100 200 300 400 500 Update steps 1.0 1.2 1.4 1.6 1.8 Loss Dec-LoRA Dec-Adapter DeCAF DECA [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training loss of different algorithms on NWGI dataset using Llama-3.1-8B model. 0 100 200 300 400 500 Update steps 0 1 2 3 Loss w/o BMA w/ trival BMA DECA [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training loss of different algorithms on NWGI dataset under ring topology. 0 100 200 300 400 500 Update steps 0 2 4 6 8 Loss ER Ring Bipartite [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training loss of different algorithms with 12 clients on TFNS dataset. 0 100 200 300 400 500 Update steps 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Loss 8 Clients 12 Clients 16 Clients [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Training loss of DECA under different [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Forward and backward latency under different granularities using Llama-3.1-8B. We measure the average latency per single forward and backward pass over the entire training phase. 1 2 4 Partitioning Granularity 0 10 20 30 40 50 60 70 Memory Usage (GB) 46.92 53.06 59.85 29.69 35.02 38.84 Max Memory Usage Average Memory Usage [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗
read the original abstract

Fine-tuning large language models (LLMs) in privacy-sensitive and resource-constrained environments remains challenging. Since training data are often distributed across multiple clients, decentralized fine-tuning offers a natural paradigm for collaborative adaptation without a central server. However, enabling full-parameter fine-tuning (FPFT) in this decentralized setting is difficult: FPFT provides strong adaptation capacity but incurs prohibitive resource consumption for billion-scale models. Existing decentralized LLM fine-tuning methods therefore mainly rely on parameter-efficient updates, which improve efficiency but may restrict downstream performance. Moreover, client data are typically non-IID, making decentralized optimization more vulnerable to client drift and unstable convergence. To address these challenges, we propose DECA, a resource-efficient decentralized FPFT framework for LLMs on non-IID data. DECA partitions model parameters into disjoint blocks and performs sequential block-wise Adam optimization, reducing resource consumption while preserving decentralized full-parameter adaptation. To stabilize training, DECA further introduces first- and second-order block-wise moment estimates with fresh local gradient statistics and consensus-derived discrepancy signals. We provide rigorous theoretical analysis and extensive experiments, showing that DECA achieves fast convergence, strong downstream performance, and significant resource efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes DECA, a decentralized framework for full-parameter fine-tuning of LLMs on non-IID client data. Parameters are partitioned into disjoint blocks for sequential block-wise Adam updates; first- and second-order moment estimates are formed from fresh local gradient statistics together with consensus-derived discrepancy signals. The paper asserts that this yields resource-efficient FPFT while preserving adaptation capacity, supported by a rigorous theoretical analysis and extensive experiments that demonstrate fast convergence, strong downstream performance, and significant resource savings relative to existing decentralized methods.

Significance. If the central claims hold, DECA would constitute a meaningful advance in decentralized LLM fine-tuning by enabling full-parameter updates at lower per-client resource cost than standard Adam while mitigating client drift on non-IID distributions. This could narrow the performance gap between parameter-efficient and full-parameter decentralized approaches in privacy-sensitive settings.

minor comments (2)
  1. The abstract states that 'rigorous theoretical analysis' is provided, yet no key assumptions, convergence rates, or bounds are sketched; this omission makes it impossible to evaluate whether the analysis directly supports the claimed stability on non-IID data.
  2. The description of 'consensus-derived discrepancy signals' is introduced without reference to the precise consensus protocol or how the signals are computed from local and global statistics; clarification of this mechanism would aid reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their review and positive summary of DECA. The report lists no specific major comments, so we provide no point-by-point responses below. We remain available to address any additional questions or clarifications the referee may have.

Circularity Check

0 steps flagged

No circularity; derivation self-contained

full rationale

The provided abstract and description introduce DECA as a block-wise Adam method using local gradient statistics and consensus signals, with a claimed theoretical analysis. No equations, derivations, or self-citations are exhibited that reduce any prediction or result to a fitted input or self-definition by construction. The central claims rest on the described partitioning and moment estimates without visible reduction to the method's own outputs. This is the normal case of a self-contained proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no concrete free parameters, axioms, or invented entities; all such elements would require the full manuscript.

pith-pipeline@v0.9.1-grok · 5760 in / 1063 out tokens · 19356 ms · 2026-06-28T11:22:09.130131+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 12 canonical work pages · 7 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning

    Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. InProc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP), pages 7319–7328, 2021

  3. [3]

    Aketi, A

    S. Aketi, A. Hashemi, and K. Roy. Global Update Tracking: A Decentralized Learning Algorithm for Heterogeneous Data. InProc. of the 37th NeurIPS, 2024

  4. [4]

    Greedy Layerwise Learning Can Scale To ImageNet

    Eugene Belilovsky, Michael Eickenberg, and Edouard Oyallon. Greedy Layerwise Learning Can Scale To ImageNet. InProc. of the 36th International Conference on Machine Learning (ICML), pages 583–593, 2019

  5. [5]

    Greedy Layer-Wise Training of Deep Networks

    Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy Layer-Wise Training of Deep Networks. InProc. of the 20th Annual Conference on Neural Information Processing Systems (NIPS), pages 153–160, 2006

  6. [6]

    Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah

    Stephen P. Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. Randomized Gossip Algorithms.IEEE Transactions on Information Theory, 52(6):2508–2530, 2006

  7. [7]

    On the Importance and Applicability of Pre-Training for Federated Learning

    Hong-You Chen, Cheng-Hao Tu, Ziwei Li, and Han-Wei Shen an Wei-Lun Chao. On the Importance and Applicability of Pre-Training for Federated Learning. InProc. of the 11th International Conference on Learning Representations (ICLR), 2023

  8. [8]

    Shuaijun Chen, Omid Tavallaie, Niousha Nazemi, and Albert Y . Zomaya. RBLA: Rank-Based- LoRA-Aggregation for Fine-Tuning Heterogeneous Models in FLaaS. InProc. of the 31st International Conference on Web Service (ICWS), pages 47–62, 2024

  9. [9]

    Training Deep Nets with Sublinear Memory Cost

    Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training Deep Nets with Sublinear Memory Cost.arXiv preprint arXiv:1604.06174, 2016

  10. [10]

    Chiang, Z

    W. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://vicuna.lmsys.org (accessed 14 April 2023), 2(3):6, 2023

  11. [11]

    Heterogeneous LoRA for Federated Fine-tuning of On-Device Foundation Models

    Yae Jee Cho, Luyang Liu, Zheng Xu, Aldi Fahrezi, and Gauri Joshi. Heterogeneous LoRA for Federated Fine-tuning of On-Device Foundation Models. InProc. of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 12903–12913, 2024

  12. [12]

    Epidemic Learning: Boosting Decentralized Learning with Randomized Communication

    Martijn de V os, Sadegh Farhadkhani, Rachid Guerraoui, Anne-Marie Kermarrec, Rafael Pires, and Rishi Sharma. Epidemic Learning: Boosting Decentralized Learning with Randomized Communication. InProc. of the 36th NeurIPS, 2023

  13. [13]

    QLoRA: Efficient Finetuning of Quantized LLMs

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. InProc. of the 36th Annual Conference on Neural Information Processing Systems (NeurIPS), 2023

  14. [14]

    The Llama 3 Herd of Models

    A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, and A. Fan. The LLaMA 3 Herd of Models.arXiv preprint arXiv:2407.21783, 2024

  15. [15]

    Cross-Gradient Aggregation for Decentralized Learning from Non-IID Data

    Yasaman Esfandiari, Sin Yong Tan, Zhanhong Jiang, Aditya Balu, Ethan Herron, Chinmay Hegde, and Soumik Sarkar. Cross-Gradient Aggregation for Decentralized Learning from Non-IID Data. InProc. of the 38th International Conference on Machine Learning (ICML), pages 3036–3046, 2021

  16. [16]

    Decentralized low-rank fine- tuning of large language models

    Sajjad Ghiasvand, Mahnoosh Alizadeh, and Ramtin Pedarsani. Decentralized low-rank fine- tuning of large language models. InProceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025), pages 334–345, 2025. 10

  17. [17]

    Robust Decentralized Learning With Local Updates and Gradient Tracking.IEEE Transactions on Networking, 33(4):2036–2048, 2025

    Sajjad Ghiasvand, Amirhossein Reisizadeh, Mahnoosh Alizadeh, and Ramtin Pedarsani. Robust Decentralized Learning With Local Updates and Gradient Tracking.IEEE Transactions on Networking, 33(4):2036–2048, 2025

  18. [18]

    Selective Aggregation for Low-Rank Adaptation in Federated Learning

    Pengxin Guo, Shuang Zeng, Yanran Wang, Huijie Fan, Feifei Wang, and Liangqiong Qu. Selective Aggregation for Low-Rank Adaptation in Federated Learning. InProc. of the 13th International Conference on Learning Representations (ICLR), 2025

  19. [19]

    Parameter-Efficient Transfer Learning for NLP

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-Efficient Transfer Learning for NLP. InProc. of the 36th International Conference on Machine Learning (ICML), pages 2790–2799, 2019

  20. [20]

    E. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low- Rank Adaptation of Large Language Models. InProc. of the 10th International Conference on Learning Representations (ICLR), 2022

  21. [21]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. InProc. of the 3rd International Conference on Learning Representations (ICLR), 2015

  22. [22]

    NOLA: Networks as Linear Combination of Low Rank Random Basis.arXiv preprint arXiv:2310.02556, 2023

    Soroush Abbasi Koohpayegani, KL Navaneet, Parsa Nooralinejad, Soheil Kolouri, and Hamed Pirsiavash. NOLA: Networks as Linear Combination of Low Rank Random Basis.arXiv preprint arXiv:2310.02556, 2023

  23. [23]

    Dawid Jan Kopiczko, Tijmen Blankevoort, and Yuki M. Asano. VeRA: Vector-based Random Matrix Adaptation. InProc. of the 12th International Conference on Learning Representations (ICLR), 2024

  24. [24]

    The Power of Scale for Parameter-Efficient Prompt Tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The Power of Scale for Parameter-Efficient Prompt Tuning. InProc. of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3045–3059, 2021

  25. [25]

    C. Li, G. Li, and P. Varshney. Decentralized Federated Learning via Mutual Knowledge Transfer. IEEE Internet of Things Journal, 9(2):1136–1147, 2021

  26. [26]

    Measuring the Intrinsic Dimension of Objective Landscapes

    Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the Intrinsic Dimension of Objective Landscapes. InProc. of the 6th International Conference on Learning Representations (ICLR), 2018

  27. [27]

    Prefix-Tuning: Optimizing Continuous Prompts for Generation

    Xiang Lisa Li and Percy Liang. Prefix-Tuning: Optimizing Continuous Prompts for Generation. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP), pages 4582–4597, 2021

  28. [28]

    ReLoRA: High-Rank Training Through Low-Rank Updates

    Vladislav Lialin, Sherin Muckatira, Namrata Shivagunde, and Anna Rumshisky. ReLoRA: High-Rank Training Through Low-Rank Updates. InProc. of the 12th International Conference on Learning Representations (ICLR), 2024

  29. [29]

    Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent

    Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent. InProc. of the 31st Annual Conference on Neural Information Processing Systems (NIPS), pages 5330–5340, 2017

  30. [30]

    Stich, and Martin Jaggi

    Tao Lin, Sai Praneeth Karimireddy, Sebastian U. Stich, and Martin Jaggi. Quasi-global Momentum: Accelerating Decentralized Deep Learning on Heterogeneous Data. InProc. of the 38th International Conference on Machine Learning (ICML), pages 6654–6665, 2021

  31. [31]

    HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy

    Yongkang Liu, Yiqun Zhang, Qian Li, Tong Liu, Shi Feng, Daling Wang, Yifei Zhang, and Hinrich Schütze. HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy. InProc. of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 18266–18287, 2024

  32. [32]

    Lu and L

    Z. Lu and L. Xiao. On the complexity analysis of randomized block-coordinate descent methods. Mathematical Programming, 152(1):615–642, 2015

  33. [33]

    Q. Luo, H. Yu, and X. Li. BAdam: A Memory Efficient Full Parameter Optimization Method for Large Language Models. InProc. of the 38th Annual Conference on Neural Information Processing Systems (NIPS), pages 24926–24958, 2024. 11

  34. [34]

    Full Parameter Fine-tuning for Large Language Models with Limited Resources

    Kai Lv, Yuqing Yang, Tengxiao Liu, Qipeng Guo, and Xipeng Qiu. Full Parameter Fine-tuning for Large Language Models with Limited Resources. InProc. of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 8187–8198, 2024

  35. [35]

    Lee, Danqi Chen, and Sanjeev Arora

    Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, and Sanjeev Arora. Fine-Tuning Language Models With Just Forward Passes. InProc. of the 36th Annual Conference on Neural Information Processing Systems (NeurIPS), pages 53038–53075, 2023

  36. [36]

    Communication-Efficient Learning of Deep Networks from Decentralized Data

    Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. Communication-Efficient Learning of Deep Networks from Decentralized Data. InProc. of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1273–1282, 2017

  37. [37]

    John Nguyen, Jianyu Wang, Kshitiz Malik, Maziar Sanjabi, and Michael G. Rabbat. Where to Begin? On the Impact of Pre-Training and Initialization in Federated Learning. InProc. of the 11th International Conference on Learning Representations (ICLR), 2023

  38. [38]

    Empirical Analysis of The Strengths and Weaknesses of PEFT Techniques for LLMs.arXiv preprint arXiv:2304.14999, 2023

    George Pu, Anirudh Jain, Jihan Yin, and Russell Kaplan. Empirical Analysis of The Strengths and Weaknesses of PEFT Techniques for LLMs.arXiv preprint arXiv:2304.14999, 2023

  39. [39]

    Pu and A

    S. Pu and A. Nedi. Distributed Stochastic Gradient Tracking Methods.Mathematical Program- ming, 187(1):409–457, 2021

  40. [40]

    FDLoRA: Personalized Federated Learning of Large Language Model via Dual LoRA Tuning

    Jiaxing Qi, Zhongzhi Luan, Shaohan Huang, Carol Fung, Hailong Yang, and Depei Qian. FDLoRA: Personalized Federated Learning of Large Language Model via Dual LoRA Tuning. arXiv preprint arXiv:2406.07925, 2024

  41. [41]

    Federated full-parameter tuning of billion-sized language models with communication cost under 18 kilobytes

    Zhen Qin, Daoyuan Chen, Bingchen Qian, Bolin Ding, Yaliang Li, and Shuiguang Deng. Federated full-parameter tuning of billion-sized language models with communication cost under 18 kilobytes. InProc. of the 41st International Conference on Machine Learning (ICML), 2024

  42. [42]

    Liangqiong Qu, Yuyin Zhou, Paul Pu Liang, Yingda Xia, Feifei Wang, Ehsan Adeli, Fei-Fei Li, and Daniel L. Rubin. Rethinking Architecture Design for Tackling Data Heterogeneity in Federated Learning. InProc. of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10051–10061, 2022

  43. [43]

    ZeRO-Offload: Democratizing Billion-Scale Model Training

    Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. ZeRO-Offload: Democratizing Billion-Scale Model Training. InProc. of the 2021 USENIX Annual Technical Conference (ATC), pages 551–564, 2021

  44. [44]

    Waite, Shreyan Ganguly, Aditya Balu, Chinmay Hegde, and Soumik Sarkar

    Nastaran Saadati, Zhanhong Jiang, Joshua R. Waite, Shreyan Ganguly, Aditya Balu, Chinmay Hegde, and Soumik Sarkar. DeCAF: Decentralized Consensus-And-Factorization for Low-Rank Adaptation of Foundation Models.Neural Networks, 2026

  45. [45]

    Scaman, F

    K. Scaman, F. Bach, S. Bubeck, Y . Lee, and L. Massoulié. Optimal Algorithms for Smooth and Strongly Convex Distributed Optimization in Networks. InProc. of the 34th International Conference on Machine Learning (ICML), pages 3027–3036, 2017

  46. [46]

    Scaman, F

    K. Scaman, F. Bach, S. Bubeck, Y . Lee, and L. Massoulié. Optimal Algorithms for Non-smooth Distributed Optimization in Networks. InProc. of the 32nd NIPS, page 2745–2754, 2018

  47. [47]

    Y . Shi, L. Shen, K. Wei, Y . Sun, B. Yuan, X. Wang, and D. Tao. Improving the Model Consistency of Decentralized Federated Learning. InProc. of the 40th ICML, volume 202, pages 31269–31291, 2023

  48. [48]

    Ferret: Federated full-parameter tuning at scale for large language models

    Yao Shu, Wenyang Hu, See-Kiong Ng, Bryan Kian Hsiang Low, and Fei Richard Yu. Ferret: Federated full-parameter tuning at scale for large language models. InProc. of the 42nd International Conference on Machine Learning (ICML), 2025

  49. [49]

    Improving LoRA in Privacy-preserving Federated Learning

    Youbang Sun, Zitao Li, Yaliang Li, and Bolin Ding. Improving LoRA in Privacy-preserving Federated Learning. InProc. of the 12th International Conference on Learning Representations (ICLR), 2024

  50. [50]

    Takezawa, H

    Y . Takezawa, H. Bao, K. Niwa, R. Sato, and M. Yamada. Momentum Tracking: Momentum Acceleration for Decentralized Deep Learning on Heterogeneous Data.Trans. on Machine Learning Research, 2023, 2023. 12

  51. [51]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  52. [52]

    Convergence of A Block Coordinate Descent Method for Nondifferentiable Minimization.Journal of Optimization Theory and Applications, 109(3):475–494, 2001

    Paul Tseng. Convergence of A Block Coordinate Descent Method for Nondifferentiable Minimization.Journal of Optimization Theory and Applications, 109(3):475–494, 2001

  53. [53]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polo- sukhin. Attention Is All You Need. InProc. of the 30th Annual Conference on Neural Information Processing Systems (NIPS), pages 5998–6008, 2017

  54. [54]

    ROSS: RObust decentralized Stochastic learning based on Shapley values.IEEE Transactions on Networking, 34:2911–2926, 2026

    Lina Wang, Yunsheng Yuan, Feng Li, and Lingjie Duan. ROSS: RObust decentralized Stochastic learning based on Shapley values.IEEE Transactions on Networking, 34:2911–2926, 2026

  55. [55]

    PDSL: Privacy-Preserved Decen- tralized Stochastic Learning with Heterogeneous Data Distribution

    Lina Wang, Yunsheng Yuan, Chunxiao Wang, and Feng Li. PDSL: Privacy-Preserved Decen- tralized Stochastic Learning with Heterogeneous Data Distribution. InProc. of the 45th IEEE International Conference on Distributed Computing Systems (ICDCS), pages 736–746, 2025

  56. [56]

    FLoRA: Federated Fine-Tuning Large Language Models with Heterogeneous Low-Rank Adaptations

    Ziyao Wang, Zheyu Shen, Yexiao He, Guoheng Sun, Hongyi Wang, Lingjuan Lyu, and Ang Li. FLoRA: Federated Fine-Tuning Large Language Models with Heterogeneous Low-Rank Adaptations. InProc. of the 38th Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

  57. [57]

    Flexora: Flexible Low-Rank Adaptation for Large Language Models

    Chenxing Wei, Yao Shu, Ying Tiffany He, and Fei Yu. Flexora: Flexible Low-Rank Adaptation for Large Language Models. InProc. of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 14643–14682, 2025

  58. [58]

    Lawrie, and Benjamin Van Durme

    Orion Weller, Marc Marone, Vladimir Braverman, Dawn J. Lawrie, and Benjamin Van Durme. Pretrained Models for Multilingual Federated Learning. InProc. of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 1413–1421, 2022

  59. [59]

    Coordinate Descent Algorithms.Mathematical Programming, 151(1):3–34, 2015

    Stephen J Wright. Coordinate Descent Algorithms.Mathematical Programming, 151(1):3–34, 2015

  60. [60]

    Chain of LoRA: Efficient Fine-tuning of Language Models via Residual Learning.arXiv preprint arXiv:2401.04151, 2024

    Wenhan Xia, Chengwei Qin, and Elad Hazan. Chain of LoRA: Efficient Fine-tuning of Language Models via Residual Learning.arXiv preprint arXiv:2401.04151, 2024

  61. [61]

    J. Xu, W. Zhang, and F. Wang. A(DP)2SGD: Asynchronous Decentralized Parallel Stochastic Gradient Descent With Differential Privacy.IEEE Trans. on Pattern Analysis and Machine Intelligence, 44(11):8036–8047, 2021

  62. [62]

    A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, et al. Qwen2 technical report.eprint arXiv:2407.10671, 2024

  63. [63]

    A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, and Z. Qi...

  64. [64]

    On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization

    Hao Yu, Rong Jin, and Sen Yang. On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization. InProc. of the 36th International Conference on Machine Learning (ICML), pages 7184–7193, 2019

  65. [65]

    When Scaling Meets LLM Fine- tuning: The Effect of Data, Model and Finetuning Method

    Biao Zhang, Zhongtao Liu, Colin Cherry, and Orhan Firat. When Scaling Meets LLM Fine- tuning: The Effect of Data, Model and Finetuning Method. InProc. of The 12th International Conference on Learning Representations (ICLR), 2024

  66. [66]

    LoRA-FA: Efficient and Effective Low Rank Representation Fine-tuning

    Longteng Zhang, Lin Zhang, Shaohuai Shi, Xiaowen Chu, and Bo Li. LoRA-FA: Memory- efficient Low-rank Adaptation for Large Language Models Fine-tuning.arXiv preprint arXiv:2308.03303, 2023

  67. [67]

    Zhang, X

    X. Zhang, X. Chen, M. Hong, S. Wu, and J. Yi. Understanding Clipping for Federated Learning: Convergence and Client-Level Differential Privacy. InProc. of the 39th ICML, volume 162, pages 26048–26067, 2022

  68. [68]

    NET- FLEET: Achieving Linear Convergence Speedup for Fully Decentralized Federated Learning with Heterogeneous Data

    Xin Zhang, Minghong Fang, Zhuqing Liu, Haibo Yang, Jia Liu, and Zhengyuan Zhu. NET- FLEET: Achieving Linear Convergence Speedup for Fully Decentralized Federated Learning with Heterogeneous Data. InProc. of the 23rd International Symposium on Theory, Algorithmic 13 Foundations, and Protocol Design for Mobile Networks and Mobile Computing (MobiHoc), page 7...

  69. [69]

    Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun

    Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P. Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun. Adam-mini: Use Fewer Learning Rates To Gain More. In Proc. of The 13th International Conference on Learning Representations (ICLR), 2025

  70. [70]

    Enhancing Storage and Computational Efficiency in Federated Multimodal Learning for Large-Scale Models

    Zixin Zhang, Fan Qi, and Changsheng Xu. Enhancing Storage and Computational Efficiency in Federated Multimodal Learning for Large-Scale Models. InProc. of the 41st International Conference on Machine Learning (ICML), 2024

  71. [71]

    FedPrompt: Communication- Efficient and Privacy-Preserving Prompt Tuning in Federated Learning

    Haodong Zhao, Wei Du, Fangqi Li, Peixuan Li, and Gongshen Liu. FedPrompt: Communication- Efficient and Privacy-Preserving Prompt Tuning in Federated Learning. InProc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023

  72. [72]

    Galore: Memory-efficient LLM training by gradient low-rank projection

    Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. Galore: Memory-efficient LLM training by gradient low-rank projection. InProc. of 41st International Conference on Machine Learning (ICML), 2024

  73. [73]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InProc. of the 36th Annual Conference on Neural Information Processing Systems (NeurIPS), 2024. 14 Contents 1 Introduction 1 2 System Model and Preliminaries 2 3 Ou...

  74. [74]

    Bullish” (positive) or “Bearish

    and the last inequality comes from X[t,0] k − ¯X[t,0] k = X[t,R] k−1 − ¯X[t,R] k−1 andX [0,0] 1 = ¯X[0,0] 1 . Sequence Tracking Error. Next we show the error boundary of e[t,r] k . According to the definition of the error term (see Eq. (16)), we have e[t,r+1] k =ˆx[t,r+1] k −¯x[t,r+1] k =e[t,r] k +γ 1 1−α r+1 1   1 N NX i=1 X j∈Ni wij α1m[t,r] j + (1−α ...