DECA: Decentralizing Block-Wise Adam for Efficient LLM Full-Parameter Fine-Tuning on Non-IID Data

Feng Li; Jun Luo; Kai Han; Kai Wang; Shaowei Li; Yunsheng Yuan; Zheng Zhang; Zhongyuan Sun

arxiv: 2606.03209 · v1 · pith:PH6ZSKSZnew · submitted 2026-06-02 · 💻 cs.LG

DECA: Decentralizing Block-Wise Adam for Efficient LLM Full-Parameter Fine-Tuning on Non-IID Data

Yunsheng Yuan , Shaowei Li , Kai Wang , Zhongyuan Sun , Zheng Zhang , Kai Han , Jun Luo , Feng Li This is my paper

Pith reviewed 2026-06-28 11:22 UTC · model grok-4.3

classification 💻 cs.LG

keywords decentralized fine-tuningblock-wise Adamfull-parameter fine-tuningnon-IID dataLLM adaptationresource efficiencyconsensus signalsfederated optimization

0 comments

The pith

DECA partitions LLM parameters into blocks for sequential Adam updates to enable efficient decentralized full-parameter fine-tuning on non-IID data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that full-parameter fine-tuning of large language models can be made practical in decentralized environments where data across clients follows non-IID distributions. It does so by splitting parameters into disjoint blocks and running Adam optimization sequentially on each block rather than all at once. Fresh local gradient statistics are combined with discrepancy signals derived from client consensus to keep the process stable. If this holds, collaborative adaptation of billion-parameter models becomes feasible without a central server or the performance limits of parameter-efficient methods.

Core claim

DECA partitions model parameters into disjoint blocks and performs sequential block-wise Adam optimization, reducing resource consumption while preserving decentralized full-parameter adaptation. To stabilize training, DECA further introduces first- and second-order block-wise moment estimates with fresh local gradient statistics and consensus-derived discrepancy signals, yielding fast convergence, strong downstream performance, and significant resource efficiency on non-IID data.

What carries the argument

Sequential block-wise Adam optimization using first- and second-order moment estimates from local gradients and consensus discrepancy signals.

If this is right

Cuts memory and compute demands per client for models with billions of parameters.
Reduces vulnerability to client drift through the added consensus signals.
Retains the downstream task gains that come from updating every parameter rather than a subset.
Supplies theoretical convergence analysis for the resulting decentralized process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same block partitioning might apply to other first-order optimizers in decentralized settings.
It suggests full-parameter methods could become viable in more constrained federated scenarios than previously assumed.
Sequential block handling may interact differently with attention layers versus feed-forward layers in practice.

Load-bearing premise

That updating parameters one block at a time preserves the adaptation capacity of simultaneous full-parameter updates without introducing bias or instability from the update order or non-IID client data.

What would settle it

A direct comparison experiment on a fixed non-IID data split where DECA's final accuracy or convergence speed falls substantially below that of a full decentralized Adam baseline or a centralized full-parameter run.

Figures

Figures reproduced from arXiv: 2606.03209 by Feng Li, Jun Luo, Kai Han, Kai Wang, Shaowei Li, Yunsheng Yuan, Zheng Zhang, Zhongyuan Sun.

**Figure 1.** Figure 1: Training loss of different algorithms on TFNS dataset using Llama-3.1-8B model. 0 100 200 300 400 500 Update steps 1.0 1.2 1.4 1.6 1.8 Loss Dec-LoRA Dec-Adapter DeCAF DECA [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗

**Figure 3.** Figure 3: Training loss of different algorithms on NWGI dataset using Llama-3.1-8B model. 0 100 200 300 400 500 Update steps 0 1 2 3 Loss w/o BMA w/ trival BMA DECA [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 5.** Figure 5: Training loss of different algorithms on NWGI dataset under ring topology. 0 100 200 300 400 500 Update steps 0 2 4 6 8 Loss ER Ring Bipartite [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗

**Figure 7.** Figure 7: Training loss of different algorithms with 12 clients on TFNS dataset. 0 100 200 300 400 500 Update steps 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Loss 8 Clients 12 Clients 16 Clients [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗

**Figure 9.** Figure 9: Training loss of DECA under different [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗

**Figure 11.** Figure 11: Forward and backward latency under different granularities using Llama-3.1-8B. We measure the average latency per single forward and backward pass over the entire training phase. 1 2 4 Partitioning Granularity 0 10 20 30 40 50 60 70 Memory Usage (GB) 46.92 53.06 59.85 29.69 35.02 38.84 Max Memory Usage Average Memory Usage [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗

read the original abstract

Fine-tuning large language models (LLMs) in privacy-sensitive and resource-constrained environments remains challenging. Since training data are often distributed across multiple clients, decentralized fine-tuning offers a natural paradigm for collaborative adaptation without a central server. However, enabling full-parameter fine-tuning (FPFT) in this decentralized setting is difficult: FPFT provides strong adaptation capacity but incurs prohibitive resource consumption for billion-scale models. Existing decentralized LLM fine-tuning methods therefore mainly rely on parameter-efficient updates, which improve efficiency but may restrict downstream performance. Moreover, client data are typically non-IID, making decentralized optimization more vulnerable to client drift and unstable convergence. To address these challenges, we propose DECA, a resource-efficient decentralized FPFT framework for LLMs on non-IID data. DECA partitions model parameters into disjoint blocks and performs sequential block-wise Adam optimization, reducing resource consumption while preserving decentralized full-parameter adaptation. To stabilize training, DECA further introduces first- and second-order block-wise moment estimates with fresh local gradient statistics and consensus-derived discrepancy signals. We provide rigorous theoretical analysis and extensive experiments, showing that DECA achieves fast convergence, strong downstream performance, and significant resource efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DECA splits parameters into blocks for sequential Adam updates in decentralized LLM fine-tuning and adds local moments plus consensus signals to handle non-IID data, but the value hinges on unshown experiments and theory.

read the letter

DECA splits LLM parameters into disjoint blocks and runs Adam sequentially on them in a decentralized setup, keeping fresh local first- and second-order moments while using consensus-derived signals to counter client drift on non-IID data. This specific mix for full-parameter decentralized training appears new relative to earlier block-wise or federated Adam work.

The paper does a clear job stating the practical problem: full-parameter fine-tuning gives better adaptation than parameter-efficient methods but costs too much in distributed environments, and non-IID data makes convergence unstable. The framing is direct and the motivation is easy to follow.

The soft spots sit in the central assumption that sequential block updates will preserve full adaptation power without adding bias or instability when client distributions differ. That claim is load-bearing, yet the abstract gives no equations or proof sketches to evaluate it. The experiments are described as extensive and the theory as rigorous, but without the actual baselines, error bars, or convergence plots it is impossible to judge whether the reported speed and resource gains are real or setup-dependent. Minor issues include the lack of visible comparison to recent decentralized optimizers that already use some form of block handling.

This work is mainly for researchers focused on distributed and federated optimization for large models. A reader already working on privacy-preserving LLM training would find the method worth examining if the numbers hold up.

It should go to peer review so the theory and experiments can be checked in detail; the problem is real and the proposal is concrete enough to test.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes DECA, a decentralized framework for full-parameter fine-tuning of LLMs on non-IID client data. Parameters are partitioned into disjoint blocks for sequential block-wise Adam updates; first- and second-order moment estimates are formed from fresh local gradient statistics together with consensus-derived discrepancy signals. The paper asserts that this yields resource-efficient FPFT while preserving adaptation capacity, supported by a rigorous theoretical analysis and extensive experiments that demonstrate fast convergence, strong downstream performance, and significant resource savings relative to existing decentralized methods.

Significance. If the central claims hold, DECA would constitute a meaningful advance in decentralized LLM fine-tuning by enabling full-parameter updates at lower per-client resource cost than standard Adam while mitigating client drift on non-IID distributions. This could narrow the performance gap between parameter-efficient and full-parameter decentralized approaches in privacy-sensitive settings.

minor comments (2)

The abstract states that 'rigorous theoretical analysis' is provided, yet no key assumptions, convergence rates, or bounds are sketched; this omission makes it impossible to evaluate whether the analysis directly supports the claimed stability on non-IID data.
The description of 'consensus-derived discrepancy signals' is introduced without reference to the precise consensus protocol or how the signals are computed from local and global statistics; clarification of this mechanism would aid reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their review and positive summary of DECA. The report lists no specific major comments, so we provide no point-by-point responses below. We remain available to address any additional questions or clarifications the referee may have.

Circularity Check

0 steps flagged

No circularity; derivation self-contained

full rationale

The provided abstract and description introduce DECA as a block-wise Adam method using local gradient statistics and consensus signals, with a claimed theoretical analysis. No equations, derivations, or self-citations are exhibited that reduce any prediction or result to a fitted input or self-definition by construction. The central claims rest on the described partitioning and moment estimates without visible reduction to the method's own outputs. This is the normal case of a self-contained proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no concrete free parameters, axioms, or invented entities; all such elements would require the full manuscript.

pith-pipeline@v0.9.1-grok · 5760 in / 1063 out tokens · 19356 ms · 2026-06-28T11:22:09.130131+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 12 canonical work pages · 7 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning

Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. InProc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP), pages 7319–7328, 2021

2021
[3]

Aketi, A

S. Aketi, A. Hashemi, and K. Roy. Global Update Tracking: A Decentralized Learning Algorithm for Heterogeneous Data. InProc. of the 37th NeurIPS, 2024

2024
[4]

Greedy Layerwise Learning Can Scale To ImageNet

Eugene Belilovsky, Michael Eickenberg, and Edouard Oyallon. Greedy Layerwise Learning Can Scale To ImageNet. InProc. of the 36th International Conference on Machine Learning (ICML), pages 583–593, 2019

2019
[5]

Greedy Layer-Wise Training of Deep Networks

Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy Layer-Wise Training of Deep Networks. InProc. of the 20th Annual Conference on Neural Information Processing Systems (NIPS), pages 153–160, 2006

2006
[6]

Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah

Stephen P. Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. Randomized Gossip Algorithms.IEEE Transactions on Information Theory, 52(6):2508–2530, 2006

2006
[7]

On the Importance and Applicability of Pre-Training for Federated Learning

Hong-You Chen, Cheng-Hao Tu, Ziwei Li, and Han-Wei Shen an Wei-Lun Chao. On the Importance and Applicability of Pre-Training for Federated Learning. InProc. of the 11th International Conference on Learning Representations (ICLR), 2023

2023
[8]

Shuaijun Chen, Omid Tavallaie, Niousha Nazemi, and Albert Y . Zomaya. RBLA: Rank-Based- LoRA-Aggregation for Fine-Tuning Heterogeneous Models in FLaaS. InProc. of the 31st International Conference on Web Service (ICWS), pages 47–62, 2024

2024
[9]

Training Deep Nets with Sublinear Memory Cost

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training Deep Nets with Sublinear Memory Cost.arXiv preprint arXiv:1604.06174, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[10]

Chiang, Z

W. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://vicuna.lmsys.org (accessed 14 April 2023), 2(3):6, 2023

2023
[11]

Heterogeneous LoRA for Federated Fine-tuning of On-Device Foundation Models

Yae Jee Cho, Luyang Liu, Zheng Xu, Aldi Fahrezi, and Gauri Joshi. Heterogeneous LoRA for Federated Fine-tuning of On-Device Foundation Models. InProc. of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 12903–12913, 2024

2024
[12]

Epidemic Learning: Boosting Decentralized Learning with Randomized Communication

Martijn de V os, Sadegh Farhadkhani, Rachid Guerraoui, Anne-Marie Kermarrec, Rafael Pires, and Rishi Sharma. Epidemic Learning: Boosting Decentralized Learning with Randomized Communication. InProc. of the 36th NeurIPS, 2023

2023
[13]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. InProc. of the 36th Annual Conference on Neural Information Processing Systems (NeurIPS), 2023

2023
[14]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, and A. Fan. The LLaMA 3 Herd of Models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Cross-Gradient Aggregation for Decentralized Learning from Non-IID Data

Yasaman Esfandiari, Sin Yong Tan, Zhanhong Jiang, Aditya Balu, Ethan Herron, Chinmay Hegde, and Soumik Sarkar. Cross-Gradient Aggregation for Decentralized Learning from Non-IID Data. InProc. of the 38th International Conference on Machine Learning (ICML), pages 3036–3046, 2021

2021
[16]

Decentralized low-rank fine- tuning of large language models

Sajjad Ghiasvand, Mahnoosh Alizadeh, and Ramtin Pedarsani. Decentralized low-rank fine- tuning of large language models. InProceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025), pages 334–345, 2025. 10

2025
[17]

Robust Decentralized Learning With Local Updates and Gradient Tracking.IEEE Transactions on Networking, 33(4):2036–2048, 2025

Sajjad Ghiasvand, Amirhossein Reisizadeh, Mahnoosh Alizadeh, and Ramtin Pedarsani. Robust Decentralized Learning With Local Updates and Gradient Tracking.IEEE Transactions on Networking, 33(4):2036–2048, 2025

2036
[18]

Selective Aggregation for Low-Rank Adaptation in Federated Learning

Pengxin Guo, Shuang Zeng, Yanran Wang, Huijie Fan, Feifei Wang, and Liangqiong Qu. Selective Aggregation for Low-Rank Adaptation in Federated Learning. InProc. of the 13th International Conference on Learning Representations (ICLR), 2025

2025
[19]

Parameter-Efficient Transfer Learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-Efficient Transfer Learning for NLP. InProc. of the 36th International Conference on Machine Learning (ICML), pages 2790–2799, 2019

2019
[20]

E. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low- Rank Adaptation of Large Language Models. InProc. of the 10th International Conference on Learning Representations (ICLR), 2022

2022
[21]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. InProc. of the 3rd International Conference on Learning Representations (ICLR), 2015

2015
[22]

NOLA: Networks as Linear Combination of Low Rank Random Basis.arXiv preprint arXiv:2310.02556, 2023

Soroush Abbasi Koohpayegani, KL Navaneet, Parsa Nooralinejad, Soheil Kolouri, and Hamed Pirsiavash. NOLA: Networks as Linear Combination of Low Rank Random Basis.arXiv preprint arXiv:2310.02556, 2023

work page arXiv 2023
[23]

Dawid Jan Kopiczko, Tijmen Blankevoort, and Yuki M. Asano. VeRA: Vector-based Random Matrix Adaptation. InProc. of the 12th International Conference on Learning Representations (ICLR), 2024

2024
[24]

The Power of Scale for Parameter-Efficient Prompt Tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The Power of Scale for Parameter-Efficient Prompt Tuning. InProc. of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3045–3059, 2021

2021
[25]

C. Li, G. Li, and P. Varshney. Decentralized Federated Learning via Mutual Knowledge Transfer. IEEE Internet of Things Journal, 9(2):1136–1147, 2021

2021
[26]

Measuring the Intrinsic Dimension of Objective Landscapes

Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the Intrinsic Dimension of Objective Landscapes. InProc. of the 6th International Conference on Learning Representations (ICLR), 2018

2018
[27]

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Xiang Lisa Li and Percy Liang. Prefix-Tuning: Optimizing Continuous Prompts for Generation. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP), pages 4582–4597, 2021

2021
[28]

ReLoRA: High-Rank Training Through Low-Rank Updates

Vladislav Lialin, Sherin Muckatira, Namrata Shivagunde, and Anna Rumshisky. ReLoRA: High-Rank Training Through Low-Rank Updates. InProc. of the 12th International Conference on Learning Representations (ICLR), 2024

2024
[29]

Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent

Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent. InProc. of the 31st Annual Conference on Neural Information Processing Systems (NIPS), pages 5330–5340, 2017

2017
[30]

Stich, and Martin Jaggi

Tao Lin, Sai Praneeth Karimireddy, Sebastian U. Stich, and Martin Jaggi. Quasi-global Momentum: Accelerating Decentralized Deep Learning on Heterogeneous Data. InProc. of the 38th International Conference on Machine Learning (ICML), pages 6654–6665, 2021

2021
[31]

HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy

Yongkang Liu, Yiqun Zhang, Qian Li, Tong Liu, Shi Feng, Daling Wang, Yifei Zhang, and Hinrich Schütze. HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy. InProc. of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 18266–18287, 2024

2024
[32]

Lu and L

Z. Lu and L. Xiao. On the complexity analysis of randomized block-coordinate descent methods. Mathematical Programming, 152(1):615–642, 2015

2015
[33]

Q. Luo, H. Yu, and X. Li. BAdam: A Memory Efficient Full Parameter Optimization Method for Large Language Models. InProc. of the 38th Annual Conference on Neural Information Processing Systems (NIPS), pages 24926–24958, 2024. 11

2024
[34]

Full Parameter Fine-tuning for Large Language Models with Limited Resources

Kai Lv, Yuqing Yang, Tengxiao Liu, Qipeng Guo, and Xipeng Qiu. Full Parameter Fine-tuning for Large Language Models with Limited Resources. InProc. of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 8187–8198, 2024

2024
[35]

Lee, Danqi Chen, and Sanjeev Arora

Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, and Sanjeev Arora. Fine-Tuning Language Models With Just Forward Passes. InProc. of the 36th Annual Conference on Neural Information Processing Systems (NeurIPS), pages 53038–53075, 2023

2023
[36]

Communication-Efficient Learning of Deep Networks from Decentralized Data

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. Communication-Efficient Learning of Deep Networks from Decentralized Data. InProc. of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1273–1282, 2017

2017
[37]

John Nguyen, Jianyu Wang, Kshitiz Malik, Maziar Sanjabi, and Michael G. Rabbat. Where to Begin? On the Impact of Pre-Training and Initialization in Federated Learning. InProc. of the 11th International Conference on Learning Representations (ICLR), 2023

2023
[38]

Empirical Analysis of The Strengths and Weaknesses of PEFT Techniques for LLMs.arXiv preprint arXiv:2304.14999, 2023

George Pu, Anirudh Jain, Jihan Yin, and Russell Kaplan. Empirical Analysis of The Strengths and Weaknesses of PEFT Techniques for LLMs.arXiv preprint arXiv:2304.14999, 2023

work page arXiv 2023
[39]

Pu and A

S. Pu and A. Nedi. Distributed Stochastic Gradient Tracking Methods.Mathematical Program- ming, 187(1):409–457, 2021

2021
[40]

FDLoRA: Personalized Federated Learning of Large Language Model via Dual LoRA Tuning

Jiaxing Qi, Zhongzhi Luan, Shaohan Huang, Carol Fung, Hailong Yang, and Depei Qian. FDLoRA: Personalized Federated Learning of Large Language Model via Dual LoRA Tuning. arXiv preprint arXiv:2406.07925, 2024

work page arXiv 2024
[41]

Federated full-parameter tuning of billion-sized language models with communication cost under 18 kilobytes

Zhen Qin, Daoyuan Chen, Bingchen Qian, Bolin Ding, Yaliang Li, and Shuiguang Deng. Federated full-parameter tuning of billion-sized language models with communication cost under 18 kilobytes. InProc. of the 41st International Conference on Machine Learning (ICML), 2024

2024
[42]

Liangqiong Qu, Yuyin Zhou, Paul Pu Liang, Yingda Xia, Feifei Wang, Ehsan Adeli, Fei-Fei Li, and Daniel L. Rubin. Rethinking Architecture Design for Tackling Data Heterogeneity in Federated Learning. InProc. of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10051–10061, 2022

2022
[43]

ZeRO-Offload: Democratizing Billion-Scale Model Training

Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. ZeRO-Offload: Democratizing Billion-Scale Model Training. InProc. of the 2021 USENIX Annual Technical Conference (ATC), pages 551–564, 2021

2021
[44]

Waite, Shreyan Ganguly, Aditya Balu, Chinmay Hegde, and Soumik Sarkar

Nastaran Saadati, Zhanhong Jiang, Joshua R. Waite, Shreyan Ganguly, Aditya Balu, Chinmay Hegde, and Soumik Sarkar. DeCAF: Decentralized Consensus-And-Factorization for Low-Rank Adaptation of Foundation Models.Neural Networks, 2026

2026
[45]

Scaman, F

K. Scaman, F. Bach, S. Bubeck, Y . Lee, and L. Massoulié. Optimal Algorithms for Smooth and Strongly Convex Distributed Optimization in Networks. InProc. of the 34th International Conference on Machine Learning (ICML), pages 3027–3036, 2017

2017
[46]

Scaman, F

K. Scaman, F. Bach, S. Bubeck, Y . Lee, and L. Massoulié. Optimal Algorithms for Non-smooth Distributed Optimization in Networks. InProc. of the 32nd NIPS, page 2745–2754, 2018

2018
[47]

Y . Shi, L. Shen, K. Wei, Y . Sun, B. Yuan, X. Wang, and D. Tao. Improving the Model Consistency of Decentralized Federated Learning. InProc. of the 40th ICML, volume 202, pages 31269–31291, 2023

2023
[48]

Ferret: Federated full-parameter tuning at scale for large language models

Yao Shu, Wenyang Hu, See-Kiong Ng, Bryan Kian Hsiang Low, and Fei Richard Yu. Ferret: Federated full-parameter tuning at scale for large language models. InProc. of the 42nd International Conference on Machine Learning (ICML), 2025

2025
[49]

Improving LoRA in Privacy-preserving Federated Learning

Youbang Sun, Zitao Li, Yaliang Li, and Bolin Ding. Improving LoRA in Privacy-preserving Federated Learning. InProc. of the 12th International Conference on Learning Representations (ICLR), 2024

2024
[50]

Takezawa, H

Y . Takezawa, H. Bao, K. Niwa, R. Sato, and M. Yamada. Momentum Tracking: Momentum Acceleration for Decentralized Deep Learning on Heterogeneous Data.Trans. on Machine Learning Research, 2023, 2023. 12

2023
[51]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Convergence of A Block Coordinate Descent Method for Nondifferentiable Minimization.Journal of Optimization Theory and Applications, 109(3):475–494, 2001

Paul Tseng. Convergence of A Block Coordinate Descent Method for Nondifferentiable Minimization.Journal of Optimization Theory and Applications, 109(3):475–494, 2001

2001
[53]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polo- sukhin. Attention Is All You Need. InProc. of the 30th Annual Conference on Neural Information Processing Systems (NIPS), pages 5998–6008, 2017

2017
[54]

ROSS: RObust decentralized Stochastic learning based on Shapley values.IEEE Transactions on Networking, 34:2911–2926, 2026

Lina Wang, Yunsheng Yuan, Feng Li, and Lingjie Duan. ROSS: RObust decentralized Stochastic learning based on Shapley values.IEEE Transactions on Networking, 34:2911–2926, 2026

2026
[55]

PDSL: Privacy-Preserved Decen- tralized Stochastic Learning with Heterogeneous Data Distribution

Lina Wang, Yunsheng Yuan, Chunxiao Wang, and Feng Li. PDSL: Privacy-Preserved Decen- tralized Stochastic Learning with Heterogeneous Data Distribution. InProc. of the 45th IEEE International Conference on Distributed Computing Systems (ICDCS), pages 736–746, 2025

2025
[56]

FLoRA: Federated Fine-Tuning Large Language Models with Heterogeneous Low-Rank Adaptations

Ziyao Wang, Zheyu Shen, Yexiao He, Guoheng Sun, Hongyi Wang, Lingjuan Lyu, and Ang Li. FLoRA: Federated Fine-Tuning Large Language Models with Heterogeneous Low-Rank Adaptations. InProc. of the 38th Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

2024
[57]

Flexora: Flexible Low-Rank Adaptation for Large Language Models

Chenxing Wei, Yao Shu, Ying Tiffany He, and Fei Yu. Flexora: Flexible Low-Rank Adaptation for Large Language Models. InProc. of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 14643–14682, 2025

2025
[58]

Lawrie, and Benjamin Van Durme

Orion Weller, Marc Marone, Vladimir Braverman, Dawn J. Lawrie, and Benjamin Van Durme. Pretrained Models for Multilingual Federated Learning. InProc. of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 1413–1421, 2022

2022
[59]

Coordinate Descent Algorithms.Mathematical Programming, 151(1):3–34, 2015

Stephen J Wright. Coordinate Descent Algorithms.Mathematical Programming, 151(1):3–34, 2015

2015
[60]

Chain of LoRA: Efficient Fine-tuning of Language Models via Residual Learning.arXiv preprint arXiv:2401.04151, 2024

Wenhan Xia, Chengwei Qin, and Elad Hazan. Chain of LoRA: Efficient Fine-tuning of Language Models via Residual Learning.arXiv preprint arXiv:2401.04151, 2024

work page arXiv 2024
[61]

J. Xu, W. Zhang, and F. Wang. A(DP)2SGD: Asynchronous Decentralized Parallel Stochastic Gradient Descent With Differential Privacy.IEEE Trans. on Pattern Analysis and Machine Intelligence, 44(11):8036–8047, 2021

2021
[62]

A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, et al. Qwen2 technical report.eprint arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, and Z. Qi...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization

Hao Yu, Rong Jin, and Sen Yang. On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization. InProc. of the 36th International Conference on Machine Learning (ICML), pages 7184–7193, 2019

2019
[65]

When Scaling Meets LLM Fine- tuning: The Effect of Data, Model and Finetuning Method

Biao Zhang, Zhongtao Liu, Colin Cherry, and Orhan Firat. When Scaling Meets LLM Fine- tuning: The Effect of Data, Model and Finetuning Method. InProc. of The 12th International Conference on Learning Representations (ICLR), 2024

2024
[66]

LoRA-FA: Efficient and Effective Low Rank Representation Fine-tuning

Longteng Zhang, Lin Zhang, Shaohuai Shi, Xiaowen Chu, and Bo Li. LoRA-FA: Memory- efficient Low-rank Adaptation for Large Language Models Fine-tuning.arXiv preprint arXiv:2308.03303, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

Zhang, X

X. Zhang, X. Chen, M. Hong, S. Wu, and J. Yi. Understanding Clipping for Federated Learning: Convergence and Client-Level Differential Privacy. InProc. of the 39th ICML, volume 162, pages 26048–26067, 2022

2022
[68]

NET- FLEET: Achieving Linear Convergence Speedup for Fully Decentralized Federated Learning with Heterogeneous Data

Xin Zhang, Minghong Fang, Zhuqing Liu, Haibo Yang, Jia Liu, and Zhengyuan Zhu. NET- FLEET: Achieving Linear Convergence Speedup for Fully Decentralized Federated Learning with Heterogeneous Data. InProc. of the 23rd International Symposium on Theory, Algorithmic 13 Foundations, and Protocol Design for Mobile Networks and Mobile Computing (MobiHoc), page 7...

2022
[69]

Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun

Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P. Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun. Adam-mini: Use Fewer Learning Rates To Gain More. In Proc. of The 13th International Conference on Learning Representations (ICLR), 2025

2025
[70]

Enhancing Storage and Computational Efficiency in Federated Multimodal Learning for Large-Scale Models

Zixin Zhang, Fan Qi, and Changsheng Xu. Enhancing Storage and Computational Efficiency in Federated Multimodal Learning for Large-Scale Models. InProc. of the 41st International Conference on Machine Learning (ICML), 2024

2024
[71]

FedPrompt: Communication- Efficient and Privacy-Preserving Prompt Tuning in Federated Learning

Haodong Zhao, Wei Du, Fangqi Li, Peixuan Li, and Gongshen Liu. FedPrompt: Communication- Efficient and Privacy-Preserving Prompt Tuning in Federated Learning. InProc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023

2023
[72]

Galore: Memory-efficient LLM training by gradient low-rank projection

Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. Galore: Memory-efficient LLM training by gradient low-rank projection. InProc. of 41st International Conference on Machine Learning (ICML), 2024

2024
[73]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InProc. of the 36th Annual Conference on Neural Information Processing Systems (NeurIPS), 2024. 14 Contents 1 Introduction 1 2 System Model and Preliminaries 2 3 Ou...

2024
[74]

Bullish” (positive) or “Bearish

and the last inequality comes from X[t,0] k − ¯X[t,0] k = X[t,R] k−1 − ¯X[t,R] k−1 andX [0,0] 1 = ¯X[0,0] 1 . Sequence Tracking Error. Next we show the error boundary of e[t,r] k . According to the definition of the error term (see Eq. (16)), we have e[t,r+1] k =ˆx[t,r+1] k −¯x[t,r+1] k =e[t,r] k +γ 1 1−α r+1 1   1 N NX i=1 X j∈Ni wij α1m[t,r] j + (1−α ...

work page arXiv

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning

Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. InProc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP), pages 7319–7328, 2021

2021

[3] [3]

Aketi, A

S. Aketi, A. Hashemi, and K. Roy. Global Update Tracking: A Decentralized Learning Algorithm for Heterogeneous Data. InProc. of the 37th NeurIPS, 2024

2024

[4] [4]

Greedy Layerwise Learning Can Scale To ImageNet

Eugene Belilovsky, Michael Eickenberg, and Edouard Oyallon. Greedy Layerwise Learning Can Scale To ImageNet. InProc. of the 36th International Conference on Machine Learning (ICML), pages 583–593, 2019

2019

[5] [5]

Greedy Layer-Wise Training of Deep Networks

Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy Layer-Wise Training of Deep Networks. InProc. of the 20th Annual Conference on Neural Information Processing Systems (NIPS), pages 153–160, 2006

2006

[6] [6]

Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah

Stephen P. Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. Randomized Gossip Algorithms.IEEE Transactions on Information Theory, 52(6):2508–2530, 2006

2006

[7] [7]

On the Importance and Applicability of Pre-Training for Federated Learning

Hong-You Chen, Cheng-Hao Tu, Ziwei Li, and Han-Wei Shen an Wei-Lun Chao. On the Importance and Applicability of Pre-Training for Federated Learning. InProc. of the 11th International Conference on Learning Representations (ICLR), 2023

2023

[8] [8]

Shuaijun Chen, Omid Tavallaie, Niousha Nazemi, and Albert Y . Zomaya. RBLA: Rank-Based- LoRA-Aggregation for Fine-Tuning Heterogeneous Models in FLaaS. InProc. of the 31st International Conference on Web Service (ICWS), pages 47–62, 2024

2024

[9] [9]

Training Deep Nets with Sublinear Memory Cost

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training Deep Nets with Sublinear Memory Cost.arXiv preprint arXiv:1604.06174, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[10] [10]

Chiang, Z

W. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://vicuna.lmsys.org (accessed 14 April 2023), 2(3):6, 2023

2023

[11] [11]

Heterogeneous LoRA for Federated Fine-tuning of On-Device Foundation Models

Yae Jee Cho, Luyang Liu, Zheng Xu, Aldi Fahrezi, and Gauri Joshi. Heterogeneous LoRA for Federated Fine-tuning of On-Device Foundation Models. InProc. of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 12903–12913, 2024

2024

[12] [12]

Epidemic Learning: Boosting Decentralized Learning with Randomized Communication

Martijn de V os, Sadegh Farhadkhani, Rachid Guerraoui, Anne-Marie Kermarrec, Rafael Pires, and Rishi Sharma. Epidemic Learning: Boosting Decentralized Learning with Randomized Communication. InProc. of the 36th NeurIPS, 2023

2023

[13] [13]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. InProc. of the 36th Annual Conference on Neural Information Processing Systems (NeurIPS), 2023

2023

[14] [14]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, and A. Fan. The LLaMA 3 Herd of Models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Cross-Gradient Aggregation for Decentralized Learning from Non-IID Data

Yasaman Esfandiari, Sin Yong Tan, Zhanhong Jiang, Aditya Balu, Ethan Herron, Chinmay Hegde, and Soumik Sarkar. Cross-Gradient Aggregation for Decentralized Learning from Non-IID Data. InProc. of the 38th International Conference on Machine Learning (ICML), pages 3036–3046, 2021

2021

[16] [16]

Decentralized low-rank fine- tuning of large language models

Sajjad Ghiasvand, Mahnoosh Alizadeh, and Ramtin Pedarsani. Decentralized low-rank fine- tuning of large language models. InProceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025), pages 334–345, 2025. 10

2025

[17] [17]

Robust Decentralized Learning With Local Updates and Gradient Tracking.IEEE Transactions on Networking, 33(4):2036–2048, 2025

Sajjad Ghiasvand, Amirhossein Reisizadeh, Mahnoosh Alizadeh, and Ramtin Pedarsani. Robust Decentralized Learning With Local Updates and Gradient Tracking.IEEE Transactions on Networking, 33(4):2036–2048, 2025

2036

[18] [18]

Selective Aggregation for Low-Rank Adaptation in Federated Learning

Pengxin Guo, Shuang Zeng, Yanran Wang, Huijie Fan, Feifei Wang, and Liangqiong Qu. Selective Aggregation for Low-Rank Adaptation in Federated Learning. InProc. of the 13th International Conference on Learning Representations (ICLR), 2025

2025

[19] [19]

Parameter-Efficient Transfer Learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-Efficient Transfer Learning for NLP. InProc. of the 36th International Conference on Machine Learning (ICML), pages 2790–2799, 2019

2019

[20] [20]

E. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low- Rank Adaptation of Large Language Models. InProc. of the 10th International Conference on Learning Representations (ICLR), 2022

2022

[21] [21]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. InProc. of the 3rd International Conference on Learning Representations (ICLR), 2015

2015

[22] [22]

NOLA: Networks as Linear Combination of Low Rank Random Basis.arXiv preprint arXiv:2310.02556, 2023

Soroush Abbasi Koohpayegani, KL Navaneet, Parsa Nooralinejad, Soheil Kolouri, and Hamed Pirsiavash. NOLA: Networks as Linear Combination of Low Rank Random Basis.arXiv preprint arXiv:2310.02556, 2023

work page arXiv 2023

[23] [23]

Dawid Jan Kopiczko, Tijmen Blankevoort, and Yuki M. Asano. VeRA: Vector-based Random Matrix Adaptation. InProc. of the 12th International Conference on Learning Representations (ICLR), 2024

2024

[24] [24]

The Power of Scale for Parameter-Efficient Prompt Tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The Power of Scale for Parameter-Efficient Prompt Tuning. InProc. of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3045–3059, 2021

2021

[25] [25]

C. Li, G. Li, and P. Varshney. Decentralized Federated Learning via Mutual Knowledge Transfer. IEEE Internet of Things Journal, 9(2):1136–1147, 2021

2021

[26] [26]

Measuring the Intrinsic Dimension of Objective Landscapes

Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the Intrinsic Dimension of Objective Landscapes. InProc. of the 6th International Conference on Learning Representations (ICLR), 2018

2018

[27] [27]

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Xiang Lisa Li and Percy Liang. Prefix-Tuning: Optimizing Continuous Prompts for Generation. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP), pages 4582–4597, 2021

2021

[28] [28]

ReLoRA: High-Rank Training Through Low-Rank Updates

Vladislav Lialin, Sherin Muckatira, Namrata Shivagunde, and Anna Rumshisky. ReLoRA: High-Rank Training Through Low-Rank Updates. InProc. of the 12th International Conference on Learning Representations (ICLR), 2024

2024

[29] [29]

Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent

Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent. InProc. of the 31st Annual Conference on Neural Information Processing Systems (NIPS), pages 5330–5340, 2017

2017

[30] [30]

Stich, and Martin Jaggi

Tao Lin, Sai Praneeth Karimireddy, Sebastian U. Stich, and Martin Jaggi. Quasi-global Momentum: Accelerating Decentralized Deep Learning on Heterogeneous Data. InProc. of the 38th International Conference on Machine Learning (ICML), pages 6654–6665, 2021

2021

[31] [31]

HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy

Yongkang Liu, Yiqun Zhang, Qian Li, Tong Liu, Shi Feng, Daling Wang, Yifei Zhang, and Hinrich Schütze. HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy. InProc. of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 18266–18287, 2024

2024

[32] [32]

Lu and L

Z. Lu and L. Xiao. On the complexity analysis of randomized block-coordinate descent methods. Mathematical Programming, 152(1):615–642, 2015

2015

[33] [33]

Q. Luo, H. Yu, and X. Li. BAdam: A Memory Efficient Full Parameter Optimization Method for Large Language Models. InProc. of the 38th Annual Conference on Neural Information Processing Systems (NIPS), pages 24926–24958, 2024. 11

2024

[34] [34]

Full Parameter Fine-tuning for Large Language Models with Limited Resources

Kai Lv, Yuqing Yang, Tengxiao Liu, Qipeng Guo, and Xipeng Qiu. Full Parameter Fine-tuning for Large Language Models with Limited Resources. InProc. of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 8187–8198, 2024

2024

[35] [35]

Lee, Danqi Chen, and Sanjeev Arora

Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, and Sanjeev Arora. Fine-Tuning Language Models With Just Forward Passes. InProc. of the 36th Annual Conference on Neural Information Processing Systems (NeurIPS), pages 53038–53075, 2023

2023

[36] [36]

Communication-Efficient Learning of Deep Networks from Decentralized Data

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. Communication-Efficient Learning of Deep Networks from Decentralized Data. InProc. of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1273–1282, 2017

2017

[37] [37]

John Nguyen, Jianyu Wang, Kshitiz Malik, Maziar Sanjabi, and Michael G. Rabbat. Where to Begin? On the Impact of Pre-Training and Initialization in Federated Learning. InProc. of the 11th International Conference on Learning Representations (ICLR), 2023

2023

[38] [38]

Empirical Analysis of The Strengths and Weaknesses of PEFT Techniques for LLMs.arXiv preprint arXiv:2304.14999, 2023

George Pu, Anirudh Jain, Jihan Yin, and Russell Kaplan. Empirical Analysis of The Strengths and Weaknesses of PEFT Techniques for LLMs.arXiv preprint arXiv:2304.14999, 2023

work page arXiv 2023

[39] [39]

Pu and A

S. Pu and A. Nedi. Distributed Stochastic Gradient Tracking Methods.Mathematical Program- ming, 187(1):409–457, 2021

2021

[40] [40]

FDLoRA: Personalized Federated Learning of Large Language Model via Dual LoRA Tuning

Jiaxing Qi, Zhongzhi Luan, Shaohan Huang, Carol Fung, Hailong Yang, and Depei Qian. FDLoRA: Personalized Federated Learning of Large Language Model via Dual LoRA Tuning. arXiv preprint arXiv:2406.07925, 2024

work page arXiv 2024

[41] [41]

Federated full-parameter tuning of billion-sized language models with communication cost under 18 kilobytes

Zhen Qin, Daoyuan Chen, Bingchen Qian, Bolin Ding, Yaliang Li, and Shuiguang Deng. Federated full-parameter tuning of billion-sized language models with communication cost under 18 kilobytes. InProc. of the 41st International Conference on Machine Learning (ICML), 2024

2024

[42] [42]

Liangqiong Qu, Yuyin Zhou, Paul Pu Liang, Yingda Xia, Feifei Wang, Ehsan Adeli, Fei-Fei Li, and Daniel L. Rubin. Rethinking Architecture Design for Tackling Data Heterogeneity in Federated Learning. InProc. of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10051–10061, 2022

2022

[43] [43]

ZeRO-Offload: Democratizing Billion-Scale Model Training

Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. ZeRO-Offload: Democratizing Billion-Scale Model Training. InProc. of the 2021 USENIX Annual Technical Conference (ATC), pages 551–564, 2021

2021

[44] [44]

Waite, Shreyan Ganguly, Aditya Balu, Chinmay Hegde, and Soumik Sarkar

Nastaran Saadati, Zhanhong Jiang, Joshua R. Waite, Shreyan Ganguly, Aditya Balu, Chinmay Hegde, and Soumik Sarkar. DeCAF: Decentralized Consensus-And-Factorization for Low-Rank Adaptation of Foundation Models.Neural Networks, 2026

2026

[45] [45]

Scaman, F

K. Scaman, F. Bach, S. Bubeck, Y . Lee, and L. Massoulié. Optimal Algorithms for Smooth and Strongly Convex Distributed Optimization in Networks. InProc. of the 34th International Conference on Machine Learning (ICML), pages 3027–3036, 2017

2017

[46] [46]

Scaman, F

K. Scaman, F. Bach, S. Bubeck, Y . Lee, and L. Massoulié. Optimal Algorithms for Non-smooth Distributed Optimization in Networks. InProc. of the 32nd NIPS, page 2745–2754, 2018

2018

[47] [47]

Y . Shi, L. Shen, K. Wei, Y . Sun, B. Yuan, X. Wang, and D. Tao. Improving the Model Consistency of Decentralized Federated Learning. InProc. of the 40th ICML, volume 202, pages 31269–31291, 2023

2023

[48] [48]

Ferret: Federated full-parameter tuning at scale for large language models

Yao Shu, Wenyang Hu, See-Kiong Ng, Bryan Kian Hsiang Low, and Fei Richard Yu. Ferret: Federated full-parameter tuning at scale for large language models. InProc. of the 42nd International Conference on Machine Learning (ICML), 2025

2025

[49] [49]

Improving LoRA in Privacy-preserving Federated Learning

Youbang Sun, Zitao Li, Yaliang Li, and Bolin Ding. Improving LoRA in Privacy-preserving Federated Learning. InProc. of the 12th International Conference on Learning Representations (ICLR), 2024

2024

[50] [50]

Takezawa, H

Y . Takezawa, H. Bao, K. Niwa, R. Sato, and M. Yamada. Momentum Tracking: Momentum Acceleration for Decentralized Deep Learning on Heterogeneous Data.Trans. on Machine Learning Research, 2023, 2023. 12

2023

[51] [51]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

Convergence of A Block Coordinate Descent Method for Nondifferentiable Minimization.Journal of Optimization Theory and Applications, 109(3):475–494, 2001

Paul Tseng. Convergence of A Block Coordinate Descent Method for Nondifferentiable Minimization.Journal of Optimization Theory and Applications, 109(3):475–494, 2001

2001

[53] [53]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polo- sukhin. Attention Is All You Need. InProc. of the 30th Annual Conference on Neural Information Processing Systems (NIPS), pages 5998–6008, 2017

2017

[54] [54]

ROSS: RObust decentralized Stochastic learning based on Shapley values.IEEE Transactions on Networking, 34:2911–2926, 2026

Lina Wang, Yunsheng Yuan, Feng Li, and Lingjie Duan. ROSS: RObust decentralized Stochastic learning based on Shapley values.IEEE Transactions on Networking, 34:2911–2926, 2026

2026

[55] [55]

PDSL: Privacy-Preserved Decen- tralized Stochastic Learning with Heterogeneous Data Distribution

Lina Wang, Yunsheng Yuan, Chunxiao Wang, and Feng Li. PDSL: Privacy-Preserved Decen- tralized Stochastic Learning with Heterogeneous Data Distribution. InProc. of the 45th IEEE International Conference on Distributed Computing Systems (ICDCS), pages 736–746, 2025

2025

[56] [56]

FLoRA: Federated Fine-Tuning Large Language Models with Heterogeneous Low-Rank Adaptations

Ziyao Wang, Zheyu Shen, Yexiao He, Guoheng Sun, Hongyi Wang, Lingjuan Lyu, and Ang Li. FLoRA: Federated Fine-Tuning Large Language Models with Heterogeneous Low-Rank Adaptations. InProc. of the 38th Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

2024

[57] [57]

Flexora: Flexible Low-Rank Adaptation for Large Language Models

Chenxing Wei, Yao Shu, Ying Tiffany He, and Fei Yu. Flexora: Flexible Low-Rank Adaptation for Large Language Models. InProc. of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 14643–14682, 2025

2025

[58] [58]

Lawrie, and Benjamin Van Durme

Orion Weller, Marc Marone, Vladimir Braverman, Dawn J. Lawrie, and Benjamin Van Durme. Pretrained Models for Multilingual Federated Learning. InProc. of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 1413–1421, 2022

2022

[59] [59]

Coordinate Descent Algorithms.Mathematical Programming, 151(1):3–34, 2015

Stephen J Wright. Coordinate Descent Algorithms.Mathematical Programming, 151(1):3–34, 2015

2015

[60] [60]

Chain of LoRA: Efficient Fine-tuning of Language Models via Residual Learning.arXiv preprint arXiv:2401.04151, 2024

Wenhan Xia, Chengwei Qin, and Elad Hazan. Chain of LoRA: Efficient Fine-tuning of Language Models via Residual Learning.arXiv preprint arXiv:2401.04151, 2024

work page arXiv 2024

[61] [61]

J. Xu, W. Zhang, and F. Wang. A(DP)2SGD: Asynchronous Decentralized Parallel Stochastic Gradient Descent With Differential Privacy.IEEE Trans. on Pattern Analysis and Machine Intelligence, 44(11):8036–8047, 2021

2021

[62] [62]

A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, et al. Qwen2 technical report.eprint arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[63] [63]

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, and Z. Qi...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[64] [64]

On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization

Hao Yu, Rong Jin, and Sen Yang. On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization. InProc. of the 36th International Conference on Machine Learning (ICML), pages 7184–7193, 2019

2019

[65] [65]

When Scaling Meets LLM Fine- tuning: The Effect of Data, Model and Finetuning Method

Biao Zhang, Zhongtao Liu, Colin Cherry, and Orhan Firat. When Scaling Meets LLM Fine- tuning: The Effect of Data, Model and Finetuning Method. InProc. of The 12th International Conference on Learning Representations (ICLR), 2024

2024

[66] [66]

LoRA-FA: Efficient and Effective Low Rank Representation Fine-tuning

Longteng Zhang, Lin Zhang, Shaohuai Shi, Xiaowen Chu, and Bo Li. LoRA-FA: Memory- efficient Low-rank Adaptation for Large Language Models Fine-tuning.arXiv preprint arXiv:2308.03303, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[67] [67]

Zhang, X

X. Zhang, X. Chen, M. Hong, S. Wu, and J. Yi. Understanding Clipping for Federated Learning: Convergence and Client-Level Differential Privacy. InProc. of the 39th ICML, volume 162, pages 26048–26067, 2022

2022

[68] [68]

NET- FLEET: Achieving Linear Convergence Speedup for Fully Decentralized Federated Learning with Heterogeneous Data

Xin Zhang, Minghong Fang, Zhuqing Liu, Haibo Yang, Jia Liu, and Zhengyuan Zhu. NET- FLEET: Achieving Linear Convergence Speedup for Fully Decentralized Federated Learning with Heterogeneous Data. InProc. of the 23rd International Symposium on Theory, Algorithmic 13 Foundations, and Protocol Design for Mobile Networks and Mobile Computing (MobiHoc), page 7...

2022

[69] [69]

Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun

Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P. Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun. Adam-mini: Use Fewer Learning Rates To Gain More. In Proc. of The 13th International Conference on Learning Representations (ICLR), 2025

2025

[70] [70]

Enhancing Storage and Computational Efficiency in Federated Multimodal Learning for Large-Scale Models

Zixin Zhang, Fan Qi, and Changsheng Xu. Enhancing Storage and Computational Efficiency in Federated Multimodal Learning for Large-Scale Models. InProc. of the 41st International Conference on Machine Learning (ICML), 2024

2024

[71] [71]

FedPrompt: Communication- Efficient and Privacy-Preserving Prompt Tuning in Federated Learning

Haodong Zhao, Wei Du, Fangqi Li, Peixuan Li, and Gongshen Liu. FedPrompt: Communication- Efficient and Privacy-Preserving Prompt Tuning in Federated Learning. InProc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023

2023

[72] [72]

Galore: Memory-efficient LLM training by gradient low-rank projection

Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. Galore: Memory-efficient LLM training by gradient low-rank projection. InProc. of 41st International Conference on Machine Learning (ICML), 2024

2024

[73] [73]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InProc. of the 36th Annual Conference on Neural Information Processing Systems (NeurIPS), 2024. 14 Contents 1 Introduction 1 2 System Model and Preliminaries 2 3 Ou...

2024

[74] [74]

Bullish” (positive) or “Bearish

and the last inequality comes from X[t,0] k − ¯X[t,0] k = X[t,R] k−1 − ¯X[t,R] k−1 andX [0,0] 1 = ¯X[0,0] 1 . Sequence Tracking Error. Next we show the error boundary of e[t,r] k . According to the definition of the error term (see Eq. (16)), we have e[t,r+1] k =ˆx[t,r+1] k −¯x[t,r+1] k =e[t,r] k +γ 1 1−α r+1 1   1 N NX i=1 X j∈Ni wij α1m[t,r] j + (1−α ...

work page arXiv