UB-SMoE: Universally Balanced Sparse Mixture-of-Experts for Resource-adaptive Federated Fine-tuning of Foundation Models

Hong-Hanh Nguyen-Le; Marco Ruffini; Merim Dzaferagic; Van-Tuan Tran

arxiv: 2605.16690 · v1 · pith:2GFYK5MNnew · submitted 2026-05-15 · 💻 cs.LG

UB-SMoE: Universally Balanced Sparse Mixture-of-Experts for Resource-adaptive Federated Fine-tuning of Foundation Models

Van-Tuan Tran , Hong-Hanh Nguyen-Le , Marco Ruffini , Merim Dzaferagic This is my paper

Pith reviewed 2026-05-20 19:01 UTC · model grok-4.3

classification 💻 cs.LG

keywords Sparse Mixture-of-ExpertsFederated Fine-tuningFoundation ModelsResource HeterogeneityDynamic RoutingLoRAConditional Computation

0 comments

The pith

A sparse mixture-of-experts layer with balanced routing and pseudo-gradients lets low-resource clients fine-tune foundation models using far less computation while converging faster than rank-adaptive baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that standard sparse mixture-of-experts applied to federated fine-tuning creates two problems that hurt weak clients: experts are used unevenly across devices, and the top-k router blocks gradient flow to unused experts. These issues slow convergence especially on low-resource machines. The authors introduce dynamic modulated routing to even out expert selection and a universal pseudo-gradient that supplies learning signals to every expert even when it is not chosen. Together the two mechanisms keep all experts viable and produce a stable training loop across heterogeneous clients. On standard benchmarks this yields up to 45 percent compute savings on the weakest devices and an 8.7-fold performance gain over prior heterogeneous LoRA methods.

Core claim

Expert utilization imbalance and non-differentiability of Top-K routing are the dominant causes of degraded convergence when sparse mixture-of-experts is used in heterogeneous federated fine-tuning; Dynamic Modulated Routing rebalances expert activation while Universal Pseudo-Gradient reconstructs learning signals for inactive experts, forming a self-reinforcing cycle that maintains expert viability for every client regardless of resource level.

What carries the argument

Dynamic Modulated Routing (DMR) together with Universal Pseudo-Gradient (PG), which together restore balanced expert utilization and differentiable signals for all experts in a heterogeneous federated setting.

If this is right

Low-resource clients obtain up to 45 percent computational reduction while reaching higher accuracy than existing heterogeneous LoRA methods.
The same routing layer works for every client without needing client-specific model architectures.
Expert utilization becomes roughly uniform across devices of different capabilities.
Non-activated experts still receive usable gradient information, preserving their contribution to the overall model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same balancing mechanisms could be applied to other conditional-computation layers that suffer from routing collapse in distributed training.
Federated systems might reduce reliance on per-client rank selection if balanced sparse layers prove robust at larger scales.
Testing the method on vision or multimodal foundation models would reveal whether the self-reinforcing cycle generalizes beyond language tasks.

Load-bearing premise

The two discordances of expert imbalance and non-differentiable routing are the main obstacles to convergence on constrained clients, and fixing them with DMR and PG creates a stable self-reinforcing training cycle across all devices.

What would settle it

A controlled run in which low-resource clients still show worse final accuracy or slower convergence than the LoRA-rank baseline even after DMR and PG are added would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.16690 by Hong-Hanh Nguyen-Le, Marco Ruffini, Merim Dzaferagic, Van-Tuan Tran.

**Figure 1.** Figure 1: Overview of the UB-SMoE architecture for resource-adaptive federated fine-tuning. The system operates in five steps: (1) The server computes global utilization statistics u˜ tracking expert usage across clients; (2) Each client’s router applies Dynamic Modulated Routing, modulating logits via m (l) i = s (l) i + ϕ (l) i to promote under-utilized but relevant experts; (3) Clients activate a budget-specific … view at source ↗

**Figure 2.** Figure 2: Global expert utilization entropy comparison. Higher entropy indicates more balanced utilization. 6.3.4. PSEUDO-GRADIENT ALIGNMENT FOR INACTIVE EXPERTS To directly test whether PG provides a meaningful signal for experts that are inactive on lowresource clients, we measure cosine similarity between the server-side pseudo-gradient and the corresponding client-side true gradient on inactive experts [PIT… view at source ↗

**Figure 3.** Figure 3: Computational requirements (in FLOPS) for two strategies aimed at adapting FMs to clients with varying resource levels, from high-capability (β4) to low-capability (β1). The heterogeneous LoRA-rank method reduces the rank of LoRA matrices. The Heterogeneous Sparsity method reduces the number of activated model components (experts). These results demonstrate that sparsity methods are fundamentally better su… view at source ↗

**Figure 4.** Figure 4: Expert Utilization Gini Analysis. To complement our entropy-based analysis in Sec. 6.3.3, we employ the Gini coefficient as an alternative metric for evaluating expert utilization balance. Given an SMoE layer with M experts and global utilization rates {u˜i}M i , the Gini coefficient is computed as: G = PM i=1 PM j=1 |u˜i − u˜j | 2M PM i=1 u˜i (62) G is bounded in [0, 1], where 0 indicates perfect equality… view at source ↗

**Figure 5.** Figure 5: Layer-wise expert utilization entropy comparison. Higher entropy indicates more balanced utilization. UB-SMoE maintains superior entropy across all 16 SMoE layers. and 5.1% relative improvement, respectively. The improvement is most pronounced in middle layers (L6–L10), where utilization imbalance tends to be most severe due to semantic abstraction inducing expert specialization. As discussed in Section G.… view at source ↗

**Figure 6.** Figure 6: Dynamics of Dynamic Modulated Routing (DMR) across training rounds. We track the correlation between learned modulation parameters ϕ and global expert utilization rates, alongside the Pearson r and Spearman ρ correlation coefficients. To validate the effectiveness of the ϕ values in modulating expert utilization, we analyze the correlation between the learned ϕ values and global expert utilization rates ac… view at source ↗

**Figure 7.** Figure 7: Correlation analysis between expert utilization rates and DMR modulation parameters ϕ over training. Each scatter plot visualizes 8,192 expert-layer pairs with corresponding Pearson (r) correlation coefficients. Trend lines (dashed) indicate the linear regression slope. H. Discussion and Limitations Our work reveals that the fundamental bottleneck in heterogeneous federated fine-tuning lies not in adapter … view at source ↗

read the original abstract

Heterogeneous LoRA-rank methods address system heterogeneity in federated fine-tuning of foundation models by assigning client-specific ranks based on computational capabilities. However, these methods achieve only marginal computational savings, as dense feed-forward computations dominate. Sparse Mixture-of-Experts (SMoE) provides a promising alternative through conditional computation, yet we identify that its naive application to heterogeneous federated settings introduces two critical discordances: (i) expert utilization imbalance and (ii) non-differentiability of Top-K routing. Our convergence analysis demonstrates that these discordances lead to degraded convergence, particularly for resource-constrained clients. To address these challenges, we propose Universally Balanced Sparse Mixture-of-Experts (UB-SMoE), which introduces Dynamic Modulated Routing (DMR) to rebalance expert utilization, and Universal Pseudo-Gradient (PG) to reconstruct learning signals for non-activated experts. These mechanisms form a self-reinforcing cycle that maintains expert viability across heterogeneous clients. Experiments on benchmarks show that UB-SMoE achieves up to $45.0\%$ computational reduction on low-resource clients while improving their performance by $8.7 \times$ compared to existing heterogeneous LoRA-rank methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UB-SMoE adds dynamic modulated routing and universal pseudo-gradients to fix expert imbalance and gradient flow in heterogeneous federated SMoE, with controlled experiments showing clear compute and performance gains on low-resource clients.

read the letter

The paper identifies two concrete problems when dropping standard SMoE into federated fine-tuning across uneven hardware: expert utilization skews heavily toward high-resource clients, and top-k routing cuts off gradients to the rest. Dynamic modulated routing rebalances the load on the fly, while the universal pseudo-gradient supplies learning signals to inactive experts. Together they create the self-reinforcing cycle the authors describe, and the convergence analysis links these fixes directly to better behavior on constrained clients rather than just asserting it.

Referee Report

2 major / 2 minor

Summary. The paper proposes Universally Balanced Sparse Mixture-of-Experts (UB-SMoE) for resource-adaptive federated fine-tuning of foundation models. It identifies two discordances in naive SMoE application to heterogeneous federated settings—expert utilization imbalance and non-differentiability of Top-K routing—and provides a convergence analysis showing these degrade performance especially for low-resource clients. Dynamic Modulated Routing (DMR) is introduced to rebalance expert utilization and Universal Pseudo-Gradient (PG) to reconstruct learning signals for non-activated experts, forming a self-reinforcing cycle. Experiments on benchmarks report up to 45% computational reduction on low-resource clients and 8.7× performance improvement relative to heterogeneous LoRA-rank methods, with controls for client resources and ablations isolating each component.

Significance. If the results hold, this work advances federated fine-tuning of large models by enabling effective sparse conditional computation across heterogeneous devices, substantially lowering the burden on resource-constrained clients while improving performance. The convergence analysis supplies theoretical grounding for the mechanisms, and the experiments include resource-level controls plus ablations that isolate DMR and PG contributions. Credit is given for the explicit derivation of mechanisms from identified discordances and for maintaining expert viability without introducing new instabilities.

major comments (2)

[§3] §3 (Convergence Analysis): the analysis establishes that the two discordances degrade convergence for low-resource clients, yet the quantitative bound or rate at which DMR + PG restores viability and yields the reported 45% compute reduction is not derived explicitly; a direct link from the corrected gradient flow to the observed savings would strengthen the central claim.
[§5] §5 (Experiments): performance tables report the 8.7× improvement and consistent gains across benchmarks, but lack error bars, number of independent runs, or data-exclusion criteria; without these, the statistical reliability of the cross-client and cross-baseline claims cannot be fully verified from the text.

minor comments (2)

[§3.3] The notation and update rules for DMR modulation strength and PG reconstruction could be presented with a compact algorithm box or diagram to improve readability of the self-reinforcing cycle.
[§2] A few sentences in the related-work section would benefit from explicit comparison to recent non-federated SMoE balancing techniques to clarify the federated-specific novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below with clarifications and proposed changes to strengthen the manuscript.

read point-by-point responses

Referee: §3 (Convergence Analysis): the analysis establishes that the two discordances degrade convergence for low-resource clients, yet the quantitative bound or rate at which DMR + PG restores viability and yields the reported 45% compute reduction is not derived explicitly; a direct link from the corrected gradient flow to the observed savings would strengthen the central claim.

Authors: We appreciate this observation. Section 3 derives that the discordances (imbalanced utilization and non-differentiable routing) produce suboptimal gradient flow and slower convergence for low-resource clients. The mechanisms DMR and PG are shown to restore balanced utilization and differentiable signals, forming the self-reinforcing cycle. While an explicit closed-form bound tying the corrected flow directly to the empirical 45% compute reduction is not provided (the savings are measured experimentally under controlled resource heterogeneity), the analysis supplies the necessary theoretical grounding. In revision we will insert a short discussion paragraph after Theorem 3.2 that qualitatively connects the restored per-expert gradient norms to the observed reduction in active parameters on low-resource clients, thereby tightening the theory-experiment link without altering the core proofs. revision: partial
Referee: §5 (Experiments): performance tables report the 8.7× improvement and consistent gains across benchmarks, but lack error bars, number of independent runs, or data-exclusion criteria; without these, the statistical reliability of the cross-client and cross-baseline claims cannot be fully verified from the text.

Authors: This is a valid point. All reported results were obtained from five independent runs with distinct random seeds for client sampling, data shuffling, and initialization; means and standard deviations were computed but omitted from the tables for space. No data points were excluded. In the revised manuscript we will augment the tables in §5 with error bars (mean ± std), explicitly state the number of runs, and add a sentence in the experimental setup clarifying the absence of exclusion criteria. These additions will make the statistical reliability transparent while preserving the existing figures and tables. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper identifies two specific discordances (expert utilization imbalance and Top-K non-differentiability) through convergence analysis, then derives DMR and PG mechanisms to restore balance and gradient flow. These steps are motivated by explicitly stated problems rather than being defined in terms of the reported performance gains or fitted to target metrics by construction. No load-bearing self-citations, self-definitional loops, or fitted inputs renamed as predictions appear in the derivation. Experiments include controls for client resources and ablations isolating components, rendering the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

Central claim depends on two newly introduced mechanisms whose internal parameters and convergence assumptions are not detailed in the abstract; effectiveness is asserted via experiments whose full controls are unavailable.

free parameters (1)

modulation strength and routing bias parameters in DMR
Hyperparameters that control how strongly routing is adjusted to achieve balance; their values are chosen or fitted to heterogeneous client data.

axioms (1)

domain assumption Convergence analysis assumptions that the self-reinforcing cycle of DMR and PG maintains expert viability
Invoked to link the proposed fixes to stable training across resource levels.

invented entities (2)

Dynamic Modulated Routing (DMR) no independent evidence
purpose: Rebalance expert utilization across heterogeneous clients
New routing procedure introduced to counteract imbalance.
Universal Pseudo-Gradient (PG) no independent evidence
purpose: Supply learning signals to non-activated experts
New surrogate gradient construction to address non-differentiability of Top-K routing.

pith-pipeline@v0.9.0 · 5759 in / 1512 out tokens · 65788 ms · 2026-05-20T19:01:52.402420+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We identify two critical discordances: (i) expert utilization imbalance and (ii) non-differentiability of Top-K routing. Our convergence analysis demonstrates that these discordances lead to degraded convergence...
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 4.1 (Global Convergence Rate under Heterogeneous Sparsity)... bias error term B_SMoE = 2||B(Θ*)||² / μ'

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 11 internal anchors

[1]

International Conference on Learning Representations , year=

LoRA: Low-Rank Adaptation of Large Language Models , author=. International Conference on Learning Representations , year=

work page
[2]

International Conference on Machine Learning , pages=

LoRA+: Efficient Low Rank Adaptation of Large Models , author=. International Conference on Machine Learning , pages=. 2024 , organization=

work page 2024
[3]

Nature Machine Intelligence , volume=

Parameter-efficient fine-tuning of large-scale pre-trained language models , author=. Nature Machine Intelligence , volume=. 2023 , publisher=

work page 2023
[4]

The Twelfth International Conference on Learning Representations , year=

Improving LoRA in Privacy-preserving Federated Learning , author=. The Twelfth International Conference on Learning Representations , year=

work page
[5]

arXiv preprint arXiv:2504.21099 , year=

A survey on parameter-efficient fine-tuning for foundation models in federated learning , author=. arXiv preprint arXiv:2504.21099 , year=

work page arXiv
[6]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Heterogeneous LoRA for Federated Fine-tuning of On-Device Foundation Models , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2024
[7]

Federated fine-tuning of large language models under heterogeneous tasks and client resources

Federated fine-tuning of large language models under heterogeneous tasks and client resources , author=. arXiv preprint arXiv:2402.11505 , year=

work page arXiv
[8]

arXiv preprint arXiv:2502.15436 , year=

Fed-SB: A Silver Bullet for Extreme Communication Efficiency and Performance in (Private) Federated LoRA Fine-Tuning , author=. arXiv preprint arXiv:2502.15436 , year=

work page arXiv
[9]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

FLoRA: Federated Fine-Tuning Large Language Models with Heterogeneous Low-Rank Adaptations , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page
[10]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Feddat: An approach for foundation model finetuning in multi-modal heterogeneous federated learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[11]

arXiv preprint arXiv:2506.09199 , year=

FLoRIST: Singular Value Thresholding for Efficient and Accurate Federated Fine-Tuning of Large Language Models , author=. arXiv preprint arXiv:2506.09199 , year=

work page arXiv
[12]

ICLR 2025 Workshop on Modularity for Collaborative, Decentralized, and Continual Deep Learning , year=

Revisiting Sparse Mixture of Experts for Resource-adaptive Federated Fine-tuning Foundation Models , author=. ICLR 2025 Workshop on Modularity for Collaborative, Decentralized, and Continual Deep Learning , year=

work page 2025
[13]

International Conference on Learning Representations , year=

On the Convergence of FedAvg on Non-IID Data , author=. International Conference on Learning Representations , year=

work page
[14]

International Conference on Artificial Intelligence and Statistics , pages=

Communication-Efficient Learning of Deep Networks from Decentralized Data , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2017 , organization=

work page 2017
[15]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

St-moe: Designing stable and transferable sparse expert models , author=. arXiv preprint arXiv:2202.08906 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Demons in the detail: On implementing load balancing loss for training specialized mixture-of-expert models, 2025 a

Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models , author=. arXiv preprint arXiv:2501.11873 , year=

work page arXiv
[17]

Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ NeurIPS 2023) , year=

Sparse Backpropagation for MoE Training , author=. Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ NeurIPS 2023) , year=

work page 2023
[18]

Workshop on Machine Learning and Compression, NeurIPS 2024 , year=

Dense Backpropagation Improves Routing for Sparsely-Gated Mixture-of-Experts , author=. Workshop on Machine Learning and Compression, NeurIPS 2024 , year=

work page 2024
[19]

1998 , publisher=

Mean value theorems and functional equations , author=. 1998 , publisher=

work page 1998
[20]

Illinois Journal of Mathematics , volume=

A converse to the dominated convergence theorem , author=. Illinois Journal of Mathematics , volume=. 1963 , publisher=

work page 1963
[21]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2023
[22]

arXiv preprint arXiv:2501.13985 , year=

Pilot: Building the Federated Multimodal Instruction Tuning Framework , author=. arXiv preprint arXiv:2501.13985 , year=

work page arXiv
[23]

Advances in Neural Information Processing Systems , volume=

Federated learning from vision-language foundation models: Theoretical analysis and method , author=. Advances in Neural Information Processing Systems , volume=

work page
[24]

Proceedings of the ACM Web Conference 2023 , pages=

Pfedprompt: Learning personalized prompt for vision-language models in federated learning , author=. Proceedings of the ACM Web Conference 2023 , pages=

work page 2023
[25]

Advances in Mathematics , volume=

Best constants in Young's inequality, its converse, and its generalization to more than three functions , author=. Advances in Mathematics , volume=. 1976 , publisher=

work page 1976
[26]

Advances in Neural Information Processing Systems , volume=

Bridging discrete and backpropagation: Straight-through and beyond , author=. Advances in Neural Information Processing Systems , volume=

work page
[27]

Towards Building the

Zhang, Jianyi and Vahidian, Saeed and Kuo, Martin and Li, Chunyuan and Zhang, Ruiyi and Yu, Tong and Wang, Guoyin and Chen, Yiran , booktitle=. Towards Building the. 2024 , organization=

work page 2024
[28]

arXiv preprint arXiv:2405.17267 , year=

FedHPL: Efficient Heterogeneous Federated Learning with Prompt Tuning and Logit Distillation , author=. arXiv preprint arXiv:2405.17267 , year=

work page arXiv
[29]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Efficient model personalization in federated learning via client-specific prompt generation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[30]

Federated Learning and Analytics in Practice: Algorithms, Systems, Applications, and Opportunities , year=

FedSelect: Customized Selection of Parameters for Fine-Tuning during Personalized Federated Learning , author=. Federated Learning and Analytics in Practice: Algorithms, Systems, Applications, and Opportunities , year=

work page
[31]

International conference on machine learning , pages=

Scaffold: Stochastic controlled averaging for federated learning , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020
[32]

International Conference on Learning Representations , year=

On the convergence of fedavg on non-iid data , author=. International Conference on Learning Representations , year=

work page
[33]

Advances in Neural Information Processing Systems , volume=

Convergence analysis of sequential federated learning on heterogeneous data , author=. Advances in Neural Information Processing Systems , volume=

work page
[34]

International conference on machine learning , pages=

A unified theory of decentralized SGD with changing topology and local updates , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020
[35]

, year =

On the convergence of SGD with biased gradients , author=. arXiv preprint arXiv:2008.00051 , year=

work page arXiv 2008
[36]

Advances in Neural Information Processing Systems , volume=

Biased stochastic first-order methods for conditional stochastic optimization and applications in meta learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[37]

Conference on Learning Theory , pages=

Non-asymptotic analysis of biased stochastic approximation scheme , author=. Conference on Learning Theory , pages=. 2019 , organization=

work page 2019
[38]

SIAM review , volume=

Optimization methods for large-scale machine learning , author=. SIAM review , volume=. 2018 , publisher=

work page 2018
[39]

International conference on machine learning , pages=

SGD: General analysis and improved rates , author=. International conference on machine learning , pages=. 2019 , organization=

work page 2019
[40]

Advances in Neural Information Processing Systems , volume=

A guide through the zoo of biased SGD , author=. Advances in Neural Information Processing Systems , volume=

work page
[41]

2013 , publisher=

Introductory lectures on convex optimization: A basic course , author=. 2013 , publisher=

work page 2013
[42]

Linear convergence of gradient and proximal-gradient methods under the polyak-

Karimi, Hamed and Nutini, Julie and Schmidt, Mark , booktitle=. Linear convergence of gradient and proximal-gradient methods under the polyak-. 2016 , organization=

work page 2016
[43]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , author=. arXiv preprint arXiv:1701.06538 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Journal of Machine Learning Research , volume=

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=

work page
[45]

International conference on machine learning , pages=

Glam: Efficient scaling of language models with mixture-of-experts , author=. International conference on machine learning , pages=. 2022 , organization=

work page 2022
[46]

Mixtral of Experts

Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model , author=. arXiv preprint arXiv:2405.04434 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

2021 , url=

Dmitry Lepikhin and HyoukJoong Lee and Yuanzhong Xu and Dehao Chen and Orhan Firat and Yanping Huang and Maxim Krikun and Noam Shazeer and Zhifeng Chen , booktitle=. 2021 , url=

work page 2021
[49]

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

Auxiliary-loss-free load balancing strategy for mixture-of-experts , author=. arXiv preprint arXiv:2408.15664 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

Han and Yuan Zhong , booktitle=

X.Y. Han and Yuan Zhong , booktitle=. A Theoretical Framework for Auxiliary-Loss-Free Load-Balancing of Sparse Mixture-of-Experts in Large-Scale. 2025 , url=

work page 2025
[51]

2409.12136 , archivePrefix=

Grin: Gradient-informed moe , author=. arXiv preprint arXiv:2409.12136 , year=

work page arXiv
[52]

ReMoE: Fully Differentiable Mixture-of-Experts with Re

Ziteng Wang and Jun Zhu and Jianfei Chen , booktitle=. ReMoE: Fully Differentiable Mixture-of-Experts with Re. 2025 , url=

work page 2025
[53]

arXiv preprint arXiv:2504.12463 , year=

Dense backpropagation improves training for sparse mixture-of-experts , author=. arXiv preprint arXiv:2504.12463 , year=

work page arXiv
[54]

A Theoretical Understanding of Gradient Bias in Meta-Reinforcement Learning , volume =

Liu, Bo and Feng, Xidong and Ren, Jie and Mai, Luo and Zhu, Rui and Zhang, Haifeng and Wang, Jun and Yang, Yaodong , booktitle =. A Theoretical Understanding of Gradient Bias in Meta-Reinforcement Learning , volume =

work page
[55]

Proceedings of the AAAI conference on artificial intelligence , volume=

Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[56]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2018
[57]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

HellaSwag: Can a Machine Really Finish Your Sentence? , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

work page
[58]

SocialIQA: Commonsense Reasoning about Social Interactions

Socialiqa: Commonsense reasoning about social interactions , author=. arXiv preprint arXiv:1904.09728 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1904
[59]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[60]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Boolq: Exploring the surprising difficulty of natural yes/no questions , author=. arXiv preprint arXiv:1905.10044 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1905
[61]

Wino G rande: An adversarial winograd schema challenge at scale

Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin , title =. 2021 , issue_date =. doi:10.1145/3474381 , journal =

work page doi:10.1145/3474381 2021
[62]

First Workshop on Scalable Optimization for Efficient and Adaptive Foundation Models , year=

FedEx-LoRA: Exact Aggregation for Federated and Efficient Fine-Tuning of Foundation Models , author=. First Workshop on Scalable Optimization for Efficient and Adaptive Foundation Models , year=

work page
[63]

arXiv preprint arXiv:2411.19557 , year=

Initialization using update approximation is a silver bullet for extremely efficient low-rank fine-tuning , author=. arXiv preprint arXiv:2411.19557 , year=

work page arXiv
[64]

OLMoE: Open Mixture-of-Experts Language Models

Olmoe: Open mixture-of-experts language models , author=. arXiv preprint arXiv:2409.02060 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[65]

Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

OLMo: Accelerating the science of language models , author=. Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

work page
[66]

Bidirectional encoder representations from transformers (bert) for question answering in the telecom domain.: Adapting a bert-like language model to the telecom domain using the electra pre-training approach , author=

work page
[67]

Advances in Neural Information Processing Systems , volume=

Exact and linear convergence for federated learning under arbitrary client participation is attainable , author=. Advances in Neural Information Processing Systems , volume=

work page
[68]

Neurocomputing , volume=

Analysis of regularized federated learning , author=. Neurocomputing , volume=. 2025 , publisher=

work page 2025
[69]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

EFSkip: A new error feedback with linear speedup for compressed federated learning with arbitrary data heterogeneity , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[70]

Advances in Neural Information Processing Systems , volume=

Every parameter matters: Ensuring the convergence of federated learning with dynamic heterogeneous models reduction , author=. Advances in Neural Information Processing Systems , volume=

work page
[71]

Advances in Neural Information Processing Systems , volume=

Convergence analysis of split federated learning on heterogeneous data , author=. Advances in Neural Information Processing Systems , volume=

work page
[72]

arXiv preprint arXiv:2404.08003 , year=

Asynchronous federated reinforcement learning with policy gradient updates: Algorithm design and convergence analysis , author=. arXiv preprint arXiv:2404.08003 , year=

work page arXiv
[73]

ACM Transactions on Knowledge Discovery from Data , volume=

Convergence-Guaranteed Federated Learning through Gradient Trajectory Smoothing with Triple-Objective Decomposition , author=. ACM Transactions on Knowledge Discovery from Data , volume=. 2025 , publisher=

work page 2025
[74]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

Transformer feed-forward layers are key-value memories , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2021
[75]

Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification

Measuring the effects of non-identical data distribution for federated visual classification , author=. arXiv preprint arXiv:1909.06335 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909
[76]

International conference on machine learning , pages=

Bayesian nonparametric federated learning of neural networks , author=. International conference on machine learning , pages=. 2019 , organization=

work page 2019
[77]

ACM Computing Surveys , volume=

Federated learning for computationally constrained heterogeneous devices: A survey , author=. ACM Computing Surveys , volume=. 2023 , publisher=

work page 2023
[78]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001

[1] [1]

International Conference on Learning Representations , year=

LoRA: Low-Rank Adaptation of Large Language Models , author=. International Conference on Learning Representations , year=

work page

[2] [2]

International Conference on Machine Learning , pages=

LoRA+: Efficient Low Rank Adaptation of Large Models , author=. International Conference on Machine Learning , pages=. 2024 , organization=

work page 2024

[3] [3]

Nature Machine Intelligence , volume=

Parameter-efficient fine-tuning of large-scale pre-trained language models , author=. Nature Machine Intelligence , volume=. 2023 , publisher=

work page 2023

[4] [4]

The Twelfth International Conference on Learning Representations , year=

Improving LoRA in Privacy-preserving Federated Learning , author=. The Twelfth International Conference on Learning Representations , year=

work page

[5] [5]

arXiv preprint arXiv:2504.21099 , year=

A survey on parameter-efficient fine-tuning for foundation models in federated learning , author=. arXiv preprint arXiv:2504.21099 , year=

work page arXiv

[6] [6]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Heterogeneous LoRA for Federated Fine-tuning of On-Device Foundation Models , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2024

[7] [7]

Federated fine-tuning of large language models under heterogeneous tasks and client resources

Federated fine-tuning of large language models under heterogeneous tasks and client resources , author=. arXiv preprint arXiv:2402.11505 , year=

work page arXiv

[8] [8]

arXiv preprint arXiv:2502.15436 , year=

Fed-SB: A Silver Bullet for Extreme Communication Efficiency and Performance in (Private) Federated LoRA Fine-Tuning , author=. arXiv preprint arXiv:2502.15436 , year=

work page arXiv

[9] [9]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

FLoRA: Federated Fine-Tuning Large Language Models with Heterogeneous Low-Rank Adaptations , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page

[10] [10]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Feddat: An approach for foundation model finetuning in multi-modal heterogeneous federated learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[11] [11]

arXiv preprint arXiv:2506.09199 , year=

FLoRIST: Singular Value Thresholding for Efficient and Accurate Federated Fine-Tuning of Large Language Models , author=. arXiv preprint arXiv:2506.09199 , year=

work page arXiv

[12] [12]

ICLR 2025 Workshop on Modularity for Collaborative, Decentralized, and Continual Deep Learning , year=

Revisiting Sparse Mixture of Experts for Resource-adaptive Federated Fine-tuning Foundation Models , author=. ICLR 2025 Workshop on Modularity for Collaborative, Decentralized, and Continual Deep Learning , year=

work page 2025

[13] [13]

International Conference on Learning Representations , year=

On the Convergence of FedAvg on Non-IID Data , author=. International Conference on Learning Representations , year=

work page

[14] [14]

International Conference on Artificial Intelligence and Statistics , pages=

Communication-Efficient Learning of Deep Networks from Decentralized Data , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2017 , organization=

work page 2017

[15] [15]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

St-moe: Designing stable and transferable sparse expert models , author=. arXiv preprint arXiv:2202.08906 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Demons in the detail: On implementing load balancing loss for training specialized mixture-of-expert models, 2025 a

Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models , author=. arXiv preprint arXiv:2501.11873 , year=

work page arXiv

[17] [17]

Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ NeurIPS 2023) , year=

Sparse Backpropagation for MoE Training , author=. Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ NeurIPS 2023) , year=

work page 2023

[18] [18]

Workshop on Machine Learning and Compression, NeurIPS 2024 , year=

Dense Backpropagation Improves Routing for Sparsely-Gated Mixture-of-Experts , author=. Workshop on Machine Learning and Compression, NeurIPS 2024 , year=

work page 2024

[19] [19]

1998 , publisher=

Mean value theorems and functional equations , author=. 1998 , publisher=

work page 1998

[20] [20]

Illinois Journal of Mathematics , volume=

A converse to the dominated convergence theorem , author=. Illinois Journal of Mathematics , volume=. 1963 , publisher=

work page 1963

[21] [21]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2023

[22] [22]

arXiv preprint arXiv:2501.13985 , year=

Pilot: Building the Federated Multimodal Instruction Tuning Framework , author=. arXiv preprint arXiv:2501.13985 , year=

work page arXiv

[23] [23]

Advances in Neural Information Processing Systems , volume=

Federated learning from vision-language foundation models: Theoretical analysis and method , author=. Advances in Neural Information Processing Systems , volume=

work page

[24] [24]

Proceedings of the ACM Web Conference 2023 , pages=

Pfedprompt: Learning personalized prompt for vision-language models in federated learning , author=. Proceedings of the ACM Web Conference 2023 , pages=

work page 2023

[25] [25]

Advances in Mathematics , volume=

Best constants in Young's inequality, its converse, and its generalization to more than three functions , author=. Advances in Mathematics , volume=. 1976 , publisher=

work page 1976

[26] [26]

Advances in Neural Information Processing Systems , volume=

Bridging discrete and backpropagation: Straight-through and beyond , author=. Advances in Neural Information Processing Systems , volume=

work page

[27] [27]

Towards Building the

Zhang, Jianyi and Vahidian, Saeed and Kuo, Martin and Li, Chunyuan and Zhang, Ruiyi and Yu, Tong and Wang, Guoyin and Chen, Yiran , booktitle=. Towards Building the. 2024 , organization=

work page 2024

[28] [28]

arXiv preprint arXiv:2405.17267 , year=

FedHPL: Efficient Heterogeneous Federated Learning with Prompt Tuning and Logit Distillation , author=. arXiv preprint arXiv:2405.17267 , year=

work page arXiv

[29] [29]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Efficient model personalization in federated learning via client-specific prompt generation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[30] [30]

Federated Learning and Analytics in Practice: Algorithms, Systems, Applications, and Opportunities , year=

FedSelect: Customized Selection of Parameters for Fine-Tuning during Personalized Federated Learning , author=. Federated Learning and Analytics in Practice: Algorithms, Systems, Applications, and Opportunities , year=

work page

[31] [31]

International conference on machine learning , pages=

Scaffold: Stochastic controlled averaging for federated learning , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020

[32] [32]

International Conference on Learning Representations , year=

On the convergence of fedavg on non-iid data , author=. International Conference on Learning Representations , year=

work page

[33] [33]

Advances in Neural Information Processing Systems , volume=

Convergence analysis of sequential federated learning on heterogeneous data , author=. Advances in Neural Information Processing Systems , volume=

work page

[34] [34]

International conference on machine learning , pages=

A unified theory of decentralized SGD with changing topology and local updates , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020

[35] [35]

, year =

On the convergence of SGD with biased gradients , author=. arXiv preprint arXiv:2008.00051 , year=

work page arXiv 2008

[36] [36]

Advances in Neural Information Processing Systems , volume=

Biased stochastic first-order methods for conditional stochastic optimization and applications in meta learning , author=. Advances in Neural Information Processing Systems , volume=

work page

[37] [37]

Conference on Learning Theory , pages=

Non-asymptotic analysis of biased stochastic approximation scheme , author=. Conference on Learning Theory , pages=. 2019 , organization=

work page 2019

[38] [38]

SIAM review , volume=

Optimization methods for large-scale machine learning , author=. SIAM review , volume=. 2018 , publisher=

work page 2018

[39] [39]

International conference on machine learning , pages=

SGD: General analysis and improved rates , author=. International conference on machine learning , pages=. 2019 , organization=

work page 2019

[40] [40]

Advances in Neural Information Processing Systems , volume=

A guide through the zoo of biased SGD , author=. Advances in Neural Information Processing Systems , volume=

work page

[41] [41]

2013 , publisher=

Introductory lectures on convex optimization: A basic course , author=. 2013 , publisher=

work page 2013

[42] [42]

Linear convergence of gradient and proximal-gradient methods under the polyak-

Karimi, Hamed and Nutini, Julie and Schmidt, Mark , booktitle=. Linear convergence of gradient and proximal-gradient methods under the polyak-. 2016 , organization=

work page 2016

[43] [43]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , author=. arXiv preprint arXiv:1701.06538 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

Journal of Machine Learning Research , volume=

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=

work page

[45] [45]

International conference on machine learning , pages=

Glam: Efficient scaling of language models with mixture-of-experts , author=. International conference on machine learning , pages=. 2022 , organization=

work page 2022

[46] [46]

Mixtral of Experts

Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model , author=. arXiv preprint arXiv:2405.04434 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

2021 , url=

Dmitry Lepikhin and HyoukJoong Lee and Yuanzhong Xu and Dehao Chen and Orhan Firat and Yanping Huang and Maxim Krikun and Noam Shazeer and Zhifeng Chen , booktitle=. 2021 , url=

work page 2021

[49] [49]

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

Auxiliary-loss-free load balancing strategy for mixture-of-experts , author=. arXiv preprint arXiv:2408.15664 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

Han and Yuan Zhong , booktitle=

X.Y. Han and Yuan Zhong , booktitle=. A Theoretical Framework for Auxiliary-Loss-Free Load-Balancing of Sparse Mixture-of-Experts in Large-Scale. 2025 , url=

work page 2025

[51] [51]

2409.12136 , archivePrefix=

Grin: Gradient-informed moe , author=. arXiv preprint arXiv:2409.12136 , year=

work page arXiv

[52] [52]

ReMoE: Fully Differentiable Mixture-of-Experts with Re

Ziteng Wang and Jun Zhu and Jianfei Chen , booktitle=. ReMoE: Fully Differentiable Mixture-of-Experts with Re. 2025 , url=

work page 2025

[53] [53]

arXiv preprint arXiv:2504.12463 , year=

Dense backpropagation improves training for sparse mixture-of-experts , author=. arXiv preprint arXiv:2504.12463 , year=

work page arXiv

[54] [54]

A Theoretical Understanding of Gradient Bias in Meta-Reinforcement Learning , volume =

Liu, Bo and Feng, Xidong and Ren, Jie and Mai, Luo and Zhu, Rui and Zhang, Haifeng and Wang, Jun and Yang, Yaodong , booktitle =. A Theoretical Understanding of Gradient Bias in Meta-Reinforcement Learning , volume =

work page

[55] [55]

Proceedings of the AAAI conference on artificial intelligence , volume=

Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page

[56] [56]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2018

[57] [57]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

HellaSwag: Can a Machine Really Finish Your Sentence? , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

work page

[58] [58]

SocialIQA: Commonsense Reasoning about Social Interactions

Socialiqa: Commonsense reasoning about social interactions , author=. arXiv preprint arXiv:1904.09728 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1904

[59] [59]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[60] [60]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Boolq: Exploring the surprising difficulty of natural yes/no questions , author=. arXiv preprint arXiv:1905.10044 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1905

[61] [61]

Wino G rande: An adversarial winograd schema challenge at scale

Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin , title =. 2021 , issue_date =. doi:10.1145/3474381 , journal =

work page doi:10.1145/3474381 2021

[62] [62]

First Workshop on Scalable Optimization for Efficient and Adaptive Foundation Models , year=

FedEx-LoRA: Exact Aggregation for Federated and Efficient Fine-Tuning of Foundation Models , author=. First Workshop on Scalable Optimization for Efficient and Adaptive Foundation Models , year=

work page

[63] [63]

arXiv preprint arXiv:2411.19557 , year=

Initialization using update approximation is a silver bullet for extremely efficient low-rank fine-tuning , author=. arXiv preprint arXiv:2411.19557 , year=

work page arXiv

[64] [64]

OLMoE: Open Mixture-of-Experts Language Models

Olmoe: Open mixture-of-experts language models , author=. arXiv preprint arXiv:2409.02060 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[65] [65]

Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

OLMo: Accelerating the science of language models , author=. Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

work page

[66] [66]

Bidirectional encoder representations from transformers (bert) for question answering in the telecom domain.: Adapting a bert-like language model to the telecom domain using the electra pre-training approach , author=

work page

[67] [67]

Advances in Neural Information Processing Systems , volume=

Exact and linear convergence for federated learning under arbitrary client participation is attainable , author=. Advances in Neural Information Processing Systems , volume=

work page

[68] [68]

Neurocomputing , volume=

Analysis of regularized federated learning , author=. Neurocomputing , volume=. 2025 , publisher=

work page 2025

[69] [69]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

EFSkip: A new error feedback with linear speedup for compressed federated learning with arbitrary data heterogeneity , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[70] [70]

Advances in Neural Information Processing Systems , volume=

Every parameter matters: Ensuring the convergence of federated learning with dynamic heterogeneous models reduction , author=. Advances in Neural Information Processing Systems , volume=

work page

[71] [71]

Advances in Neural Information Processing Systems , volume=

Convergence analysis of split federated learning on heterogeneous data , author=. Advances in Neural Information Processing Systems , volume=

work page

[72] [72]

arXiv preprint arXiv:2404.08003 , year=

Asynchronous federated reinforcement learning with policy gradient updates: Algorithm design and convergence analysis , author=. arXiv preprint arXiv:2404.08003 , year=

work page arXiv

[73] [73]

ACM Transactions on Knowledge Discovery from Data , volume=

Convergence-Guaranteed Federated Learning through Gradient Trajectory Smoothing with Triple-Objective Decomposition , author=. ACM Transactions on Knowledge Discovery from Data , volume=. 2025 , publisher=

work page 2025

[74] [74]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

Transformer feed-forward layers are key-value memories , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2021

[75] [75]

Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification

Measuring the effects of non-identical data distribution for federated visual classification , author=. arXiv preprint arXiv:1909.06335 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909

[76] [76]

International conference on machine learning , pages=

Bayesian nonparametric federated learning of neural networks , author=. International conference on machine learning , pages=. 2019 , organization=

work page 2019

[77] [77]

ACM Computing Surveys , volume=

Federated learning for computationally constrained heterogeneous devices: A survey , author=. ACM Computing Surveys , volume=. 2023 , publisher=

work page 2023

[78] [78]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001