pith. sign in

arxiv: 2605.16690 · v1 · pith:2GFYK5MNnew · submitted 2026-05-15 · 💻 cs.LG

UB-SMoE: Universally Balanced Sparse Mixture-of-Experts for Resource-adaptive Federated Fine-tuning of Foundation Models

Pith reviewed 2026-05-20 19:01 UTC · model grok-4.3

classification 💻 cs.LG
keywords Sparse Mixture-of-ExpertsFederated Fine-tuningFoundation ModelsResource HeterogeneityDynamic RoutingLoRAConditional Computation
0
0 comments X

The pith

A sparse mixture-of-experts layer with balanced routing and pseudo-gradients lets low-resource clients fine-tune foundation models using far less computation while converging faster than rank-adaptive baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that standard sparse mixture-of-experts applied to federated fine-tuning creates two problems that hurt weak clients: experts are used unevenly across devices, and the top-k router blocks gradient flow to unused experts. These issues slow convergence especially on low-resource machines. The authors introduce dynamic modulated routing to even out expert selection and a universal pseudo-gradient that supplies learning signals to every expert even when it is not chosen. Together the two mechanisms keep all experts viable and produce a stable training loop across heterogeneous clients. On standard benchmarks this yields up to 45 percent compute savings on the weakest devices and an 8.7-fold performance gain over prior heterogeneous LoRA methods.

Core claim

Expert utilization imbalance and non-differentiability of Top-K routing are the dominant causes of degraded convergence when sparse mixture-of-experts is used in heterogeneous federated fine-tuning; Dynamic Modulated Routing rebalances expert activation while Universal Pseudo-Gradient reconstructs learning signals for inactive experts, forming a self-reinforcing cycle that maintains expert viability for every client regardless of resource level.

What carries the argument

Dynamic Modulated Routing (DMR) together with Universal Pseudo-Gradient (PG), which together restore balanced expert utilization and differentiable signals for all experts in a heterogeneous federated setting.

If this is right

  • Low-resource clients obtain up to 45 percent computational reduction while reaching higher accuracy than existing heterogeneous LoRA methods.
  • The same routing layer works for every client without needing client-specific model architectures.
  • Expert utilization becomes roughly uniform across devices of different capabilities.
  • Non-activated experts still receive usable gradient information, preserving their contribution to the overall model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same balancing mechanisms could be applied to other conditional-computation layers that suffer from routing collapse in distributed training.
  • Federated systems might reduce reliance on per-client rank selection if balanced sparse layers prove robust at larger scales.
  • Testing the method on vision or multimodal foundation models would reveal whether the self-reinforcing cycle generalizes beyond language tasks.

Load-bearing premise

The two discordances of expert imbalance and non-differentiable routing are the main obstacles to convergence on constrained clients, and fixing them with DMR and PG creates a stable self-reinforcing training cycle across all devices.

What would settle it

A controlled run in which low-resource clients still show worse final accuracy or slower convergence than the LoRA-rank baseline even after DMR and PG are added would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.16690 by Hong-Hanh Nguyen-Le, Marco Ruffini, Merim Dzaferagic, Van-Tuan Tran.

Figure 1
Figure 1. Figure 1: Overview of the UB-SMoE architecture for resource-adaptive federated fine-tuning. The system operates in five steps: (1) The server computes global utilization statistics u˜ tracking expert usage across clients; (2) Each client’s router applies Dynamic Modulated Routing, modulating logits via m (l) i = s (l) i + ϕ (l) i to promote under-utilized but relevant experts; (3) Clients activate a budget-specific … view at source ↗
Figure 2
Figure 2. Figure 2: Global expert utilization entropy comparison. Higher entropy indicates more balanced utilization. 6.3.4. PSEUDO-GRADIENT ALIGNMENT FOR INACTIVE EXPERTS To directly test whether PG provides a meaning￾ful signal for experts that are inactive on low￾resource clients, we measure cosine similarity be￾tween the server-side pseudo-gradient and the corre￾sponding client-side true gradient on inactive experts [PIT… view at source ↗
Figure 3
Figure 3. Figure 3: Computational requirements (in FLOPS) for two strategies aimed at adapting FMs to clients with varying resource levels, from high-capability (β4) to low-capability (β1). The heterogeneous LoRA-rank method reduces the rank of LoRA matrices. The Heterogeneous Sparsity method reduces the number of activated model components (experts). These results demonstrate that sparsity methods are fundamentally better su… view at source ↗
Figure 4
Figure 4. Figure 4: Expert Utilization Gini Analysis. To complement our entropy-based analysis in Sec. 6.3.3, we employ the Gini coefficient as an alternative metric for evaluating expert utilization balance. Given an SMoE layer with M experts and global utilization rates {u˜i}M i , the Gini coefficient is computed as: G = PM i=1 PM j=1 |u˜i − u˜j | 2M PM i=1 u˜i (62) G is bounded in [0, 1], where 0 indicates perfect equality… view at source ↗
Figure 5
Figure 5. Figure 5: Layer-wise expert utilization entropy comparison. Higher entropy indicates more balanced utilization. UB-SMoE maintains superior entropy across all 16 SMoE layers. and 5.1% relative improvement, respectively. The improvement is most pronounced in middle layers (L6–L10), where utilization imbalance tends to be most severe due to semantic abstraction inducing expert specialization. As discussed in Section G.… view at source ↗
Figure 6
Figure 6. Figure 6: Dynamics of Dynamic Modulated Routing (DMR) across training rounds. We track the correlation between learned modulation parameters ϕ and global expert utilization rates, alongside the Pearson r and Spearman ρ correlation coefficients. To validate the effectiveness of the ϕ values in modulating expert utilization, we analyze the correlation between the learned ϕ values and global expert utilization rates ac… view at source ↗
Figure 7
Figure 7. Figure 7: Correlation analysis between expert utilization rates and DMR modulation parameters ϕ over training. Each scatter plot visualizes 8,192 expert-layer pairs with corresponding Pearson (r) correlation coefficients. Trend lines (dashed) indicate the linear regression slope. H. Discussion and Limitations Our work reveals that the fundamental bottleneck in heterogeneous federated fine-tuning lies not in adapter … view at source ↗
read the original abstract

Heterogeneous LoRA-rank methods address system heterogeneity in federated fine-tuning of foundation models by assigning client-specific ranks based on computational capabilities. However, these methods achieve only marginal computational savings, as dense feed-forward computations dominate. Sparse Mixture-of-Experts (SMoE) provides a promising alternative through conditional computation, yet we identify that its naive application to heterogeneous federated settings introduces two critical discordances: (i) expert utilization imbalance and (ii) non-differentiability of Top-K routing. Our convergence analysis demonstrates that these discordances lead to degraded convergence, particularly for resource-constrained clients. To address these challenges, we propose Universally Balanced Sparse Mixture-of-Experts (UB-SMoE), which introduces Dynamic Modulated Routing (DMR) to rebalance expert utilization, and Universal Pseudo-Gradient (PG) to reconstruct learning signals for non-activated experts. These mechanisms form a self-reinforcing cycle that maintains expert viability across heterogeneous clients. Experiments on benchmarks show that UB-SMoE achieves up to $45.0\%$ computational reduction on low-resource clients while improving their performance by $8.7 \times$ compared to existing heterogeneous LoRA-rank methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Universally Balanced Sparse Mixture-of-Experts (UB-SMoE) for resource-adaptive federated fine-tuning of foundation models. It identifies two discordances in naive SMoE application to heterogeneous federated settings—expert utilization imbalance and non-differentiability of Top-K routing—and provides a convergence analysis showing these degrade performance especially for low-resource clients. Dynamic Modulated Routing (DMR) is introduced to rebalance expert utilization and Universal Pseudo-Gradient (PG) to reconstruct learning signals for non-activated experts, forming a self-reinforcing cycle. Experiments on benchmarks report up to 45% computational reduction on low-resource clients and 8.7× performance improvement relative to heterogeneous LoRA-rank methods, with controls for client resources and ablations isolating each component.

Significance. If the results hold, this work advances federated fine-tuning of large models by enabling effective sparse conditional computation across heterogeneous devices, substantially lowering the burden on resource-constrained clients while improving performance. The convergence analysis supplies theoretical grounding for the mechanisms, and the experiments include resource-level controls plus ablations that isolate DMR and PG contributions. Credit is given for the explicit derivation of mechanisms from identified discordances and for maintaining expert viability without introducing new instabilities.

major comments (2)
  1. [§3] §3 (Convergence Analysis): the analysis establishes that the two discordances degrade convergence for low-resource clients, yet the quantitative bound or rate at which DMR + PG restores viability and yields the reported 45% compute reduction is not derived explicitly; a direct link from the corrected gradient flow to the observed savings would strengthen the central claim.
  2. [§5] §5 (Experiments): performance tables report the 8.7× improvement and consistent gains across benchmarks, but lack error bars, number of independent runs, or data-exclusion criteria; without these, the statistical reliability of the cross-client and cross-baseline claims cannot be fully verified from the text.
minor comments (2)
  1. [§3.3] The notation and update rules for DMR modulation strength and PG reconstruction could be presented with a compact algorithm box or diagram to improve readability of the self-reinforcing cycle.
  2. [§2] A few sentences in the related-work section would benefit from explicit comparison to recent non-federated SMoE balancing techniques to clarify the federated-specific novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below with clarifications and proposed changes to strengthen the manuscript.

read point-by-point responses
  1. Referee: §3 (Convergence Analysis): the analysis establishes that the two discordances degrade convergence for low-resource clients, yet the quantitative bound or rate at which DMR + PG restores viability and yields the reported 45% compute reduction is not derived explicitly; a direct link from the corrected gradient flow to the observed savings would strengthen the central claim.

    Authors: We appreciate this observation. Section 3 derives that the discordances (imbalanced utilization and non-differentiable routing) produce suboptimal gradient flow and slower convergence for low-resource clients. The mechanisms DMR and PG are shown to restore balanced utilization and differentiable signals, forming the self-reinforcing cycle. While an explicit closed-form bound tying the corrected flow directly to the empirical 45% compute reduction is not provided (the savings are measured experimentally under controlled resource heterogeneity), the analysis supplies the necessary theoretical grounding. In revision we will insert a short discussion paragraph after Theorem 3.2 that qualitatively connects the restored per-expert gradient norms to the observed reduction in active parameters on low-resource clients, thereby tightening the theory-experiment link without altering the core proofs. revision: partial

  2. Referee: §5 (Experiments): performance tables report the 8.7× improvement and consistent gains across benchmarks, but lack error bars, number of independent runs, or data-exclusion criteria; without these, the statistical reliability of the cross-client and cross-baseline claims cannot be fully verified from the text.

    Authors: This is a valid point. All reported results were obtained from five independent runs with distinct random seeds for client sampling, data shuffling, and initialization; means and standard deviations were computed but omitted from the tables for space. No data points were excluded. In the revised manuscript we will augment the tables in §5 with error bars (mean ± std), explicitly state the number of runs, and add a sentence in the experimental setup clarifying the absence of exclusion criteria. These additions will make the statistical reliability transparent while preserving the existing figures and tables. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper identifies two specific discordances (expert utilization imbalance and Top-K non-differentiability) through convergence analysis, then derives DMR and PG mechanisms to restore balance and gradient flow. These steps are motivated by explicitly stated problems rather than being defined in terms of the reported performance gains or fitted to target metrics by construction. No load-bearing self-citations, self-definitional loops, or fitted inputs renamed as predictions appear in the derivation. Experiments include controls for client resources and ablations isolating components, rendering the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

Central claim depends on two newly introduced mechanisms whose internal parameters and convergence assumptions are not detailed in the abstract; effectiveness is asserted via experiments whose full controls are unavailable.

free parameters (1)
  • modulation strength and routing bias parameters in DMR
    Hyperparameters that control how strongly routing is adjusted to achieve balance; their values are chosen or fitted to heterogeneous client data.
axioms (1)
  • domain assumption Convergence analysis assumptions that the self-reinforcing cycle of DMR and PG maintains expert viability
    Invoked to link the proposed fixes to stable training across resource levels.
invented entities (2)
  • Dynamic Modulated Routing (DMR) no independent evidence
    purpose: Rebalance expert utilization across heterogeneous clients
    New routing procedure introduced to counteract imbalance.
  • Universal Pseudo-Gradient (PG) no independent evidence
    purpose: Supply learning signals to non-activated experts
    New surrogate gradient construction to address non-differentiability of Top-K routing.

pith-pipeline@v0.9.0 · 5759 in / 1512 out tokens · 65788 ms · 2026-05-20T19:01:52.402420+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 11 internal anchors

  1. [1]

    International Conference on Learning Representations , year=

    LoRA: Low-Rank Adaptation of Large Language Models , author=. International Conference on Learning Representations , year=

  2. [2]

    International Conference on Machine Learning , pages=

    LoRA+: Efficient Low Rank Adaptation of Large Models , author=. International Conference on Machine Learning , pages=. 2024 , organization=

  3. [3]

    Nature Machine Intelligence , volume=

    Parameter-efficient fine-tuning of large-scale pre-trained language models , author=. Nature Machine Intelligence , volume=. 2023 , publisher=

  4. [4]

    The Twelfth International Conference on Learning Representations , year=

    Improving LoRA in Privacy-preserving Federated Learning , author=. The Twelfth International Conference on Learning Representations , year=

  5. [5]

    arXiv preprint arXiv:2504.21099 , year=

    A survey on parameter-efficient fine-tuning for foundation models in federated learning , author=. arXiv preprint arXiv:2504.21099 , year=

  6. [6]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

    Heterogeneous LoRA for Federated Fine-tuning of On-Device Foundation Models , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

  7. [7]

    Federated fine-tuning of large language models under heterogeneous tasks and client resources

    Federated fine-tuning of large language models under heterogeneous tasks and client resources , author=. arXiv preprint arXiv:2402.11505 , year=

  8. [8]

    arXiv preprint arXiv:2502.15436 , year=

    Fed-SB: A Silver Bullet for Extreme Communication Efficiency and Performance in (Private) Federated LoRA Fine-Tuning , author=. arXiv preprint arXiv:2502.15436 , year=

  9. [9]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    FLoRA: Federated Fine-Tuning Large Language Models with Heterogeneous Low-Rank Adaptations , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  10. [10]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Feddat: An approach for foundation model finetuning in multi-modal heterogeneous federated learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  11. [11]

    arXiv preprint arXiv:2506.09199 , year=

    FLoRIST: Singular Value Thresholding for Efficient and Accurate Federated Fine-Tuning of Large Language Models , author=. arXiv preprint arXiv:2506.09199 , year=

  12. [12]

    ICLR 2025 Workshop on Modularity for Collaborative, Decentralized, and Continual Deep Learning , year=

    Revisiting Sparse Mixture of Experts for Resource-adaptive Federated Fine-tuning Foundation Models , author=. ICLR 2025 Workshop on Modularity for Collaborative, Decentralized, and Continual Deep Learning , year=

  13. [13]

    International Conference on Learning Representations , year=

    On the Convergence of FedAvg on Non-IID Data , author=. International Conference on Learning Representations , year=

  14. [14]

    International Conference on Artificial Intelligence and Statistics , pages=

    Communication-Efficient Learning of Deep Networks from Decentralized Data , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2017 , organization=

  15. [15]

    ST-MoE: Designing Stable and Transferable Sparse Expert Models

    St-moe: Designing stable and transferable sparse expert models , author=. arXiv preprint arXiv:2202.08906 , year=

  16. [16]

    Demons in the detail: On implementing load balancing loss for training specialized mixture-of-expert models, 2025 a

    Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models , author=. arXiv preprint arXiv:2501.11873 , year=

  17. [17]

    Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ NeurIPS 2023) , year=

    Sparse Backpropagation for MoE Training , author=. Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ NeurIPS 2023) , year=

  18. [18]

    Workshop on Machine Learning and Compression, NeurIPS 2024 , year=

    Dense Backpropagation Improves Routing for Sparsely-Gated Mixture-of-Experts , author=. Workshop on Machine Learning and Compression, NeurIPS 2024 , year=

  19. [19]

    1998 , publisher=

    Mean value theorems and functional equations , author=. 1998 , publisher=

  20. [20]

    Illinois Journal of Mathematics , volume=

    A converse to the dominated convergence theorem , author=. Illinois Journal of Mathematics , volume=. 1963 , publisher=

  21. [21]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

    LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

  22. [22]

    arXiv preprint arXiv:2501.13985 , year=

    Pilot: Building the Federated Multimodal Instruction Tuning Framework , author=. arXiv preprint arXiv:2501.13985 , year=

  23. [23]

    Advances in Neural Information Processing Systems , volume=

    Federated learning from vision-language foundation models: Theoretical analysis and method , author=. Advances in Neural Information Processing Systems , volume=

  24. [24]

    Proceedings of the ACM Web Conference 2023 , pages=

    Pfedprompt: Learning personalized prompt for vision-language models in federated learning , author=. Proceedings of the ACM Web Conference 2023 , pages=

  25. [25]

    Advances in Mathematics , volume=

    Best constants in Young's inequality, its converse, and its generalization to more than three functions , author=. Advances in Mathematics , volume=. 1976 , publisher=

  26. [26]

    Advances in Neural Information Processing Systems , volume=

    Bridging discrete and backpropagation: Straight-through and beyond , author=. Advances in Neural Information Processing Systems , volume=

  27. [27]

    Towards Building the

    Zhang, Jianyi and Vahidian, Saeed and Kuo, Martin and Li, Chunyuan and Zhang, Ruiyi and Yu, Tong and Wang, Guoyin and Chen, Yiran , booktitle=. Towards Building the. 2024 , organization=

  28. [28]

    arXiv preprint arXiv:2405.17267 , year=

    FedHPL: Efficient Heterogeneous Federated Learning with Prompt Tuning and Logit Distillation , author=. arXiv preprint arXiv:2405.17267 , year=

  29. [29]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Efficient model personalization in federated learning via client-specific prompt generation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  30. [30]

    Federated Learning and Analytics in Practice: Algorithms, Systems, Applications, and Opportunities , year=

    FedSelect: Customized Selection of Parameters for Fine-Tuning during Personalized Federated Learning , author=. Federated Learning and Analytics in Practice: Algorithms, Systems, Applications, and Opportunities , year=

  31. [31]

    International conference on machine learning , pages=

    Scaffold: Stochastic controlled averaging for federated learning , author=. International conference on machine learning , pages=. 2020 , organization=

  32. [32]

    International Conference on Learning Representations , year=

    On the convergence of fedavg on non-iid data , author=. International Conference on Learning Representations , year=

  33. [33]

    Advances in Neural Information Processing Systems , volume=

    Convergence analysis of sequential federated learning on heterogeneous data , author=. Advances in Neural Information Processing Systems , volume=

  34. [34]

    International conference on machine learning , pages=

    A unified theory of decentralized SGD with changing topology and local updates , author=. International conference on machine learning , pages=. 2020 , organization=

  35. [35]

    , year =

    On the convergence of SGD with biased gradients , author=. arXiv preprint arXiv:2008.00051 , year=

  36. [36]

    Advances in Neural Information Processing Systems , volume=

    Biased stochastic first-order methods for conditional stochastic optimization and applications in meta learning , author=. Advances in Neural Information Processing Systems , volume=

  37. [37]

    Conference on Learning Theory , pages=

    Non-asymptotic analysis of biased stochastic approximation scheme , author=. Conference on Learning Theory , pages=. 2019 , organization=

  38. [38]

    SIAM review , volume=

    Optimization methods for large-scale machine learning , author=. SIAM review , volume=. 2018 , publisher=

  39. [39]

    International conference on machine learning , pages=

    SGD: General analysis and improved rates , author=. International conference on machine learning , pages=. 2019 , organization=

  40. [40]

    Advances in Neural Information Processing Systems , volume=

    A guide through the zoo of biased SGD , author=. Advances in Neural Information Processing Systems , volume=

  41. [41]

    2013 , publisher=

    Introductory lectures on convex optimization: A basic course , author=. 2013 , publisher=

  42. [42]

    Linear convergence of gradient and proximal-gradient methods under the polyak-

    Karimi, Hamed and Nutini, Julie and Schmidt, Mark , booktitle=. Linear convergence of gradient and proximal-gradient methods under the polyak-. 2016 , organization=

  43. [43]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , author=. arXiv preprint arXiv:1701.06538 , year=

  44. [44]

    Journal of Machine Learning Research , volume=

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=

  45. [45]

    International conference on machine learning , pages=

    Glam: Efficient scaling of language models with mixture-of-experts , author=. International conference on machine learning , pages=. 2022 , organization=

  46. [46]

    Mixtral of Experts

    Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=

  47. [47]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model , author=. arXiv preprint arXiv:2405.04434 , year=

  48. [48]

    2021 , url=

    Dmitry Lepikhin and HyoukJoong Lee and Yuanzhong Xu and Dehao Chen and Orhan Firat and Yanping Huang and Maxim Krikun and Noam Shazeer and Zhifeng Chen , booktitle=. 2021 , url=

  49. [49]

    Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

    Auxiliary-loss-free load balancing strategy for mixture-of-experts , author=. arXiv preprint arXiv:2408.15664 , year=

  50. [50]

    Han and Yuan Zhong , booktitle=

    X.Y. Han and Yuan Zhong , booktitle=. A Theoretical Framework for Auxiliary-Loss-Free Load-Balancing of Sparse Mixture-of-Experts in Large-Scale. 2025 , url=

  51. [51]

    2409.12136 , archivePrefix=

    Grin: Gradient-informed moe , author=. arXiv preprint arXiv:2409.12136 , year=

  52. [52]

    ReMoE: Fully Differentiable Mixture-of-Experts with Re

    Ziteng Wang and Jun Zhu and Jianfei Chen , booktitle=. ReMoE: Fully Differentiable Mixture-of-Experts with Re. 2025 , url=

  53. [53]

    arXiv preprint arXiv:2504.12463 , year=

    Dense backpropagation improves training for sparse mixture-of-experts , author=. arXiv preprint arXiv:2504.12463 , year=

  54. [54]

    A Theoretical Understanding of Gradient Bias in Meta-Reinforcement Learning , volume =

    Liu, Bo and Feng, Xidong and Ren, Jie and Mai, Luo and Zhu, Rui and Zhang, Haifeng and Wang, Jun and Yang, Yaodong , booktitle =. A Theoretical Understanding of Gradient Bias in Meta-Reinforcement Learning , volume =

  55. [55]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  56. [56]

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

  57. [57]

    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

    HellaSwag: Can a Machine Really Finish Your Sentence? , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

  58. [58]

    SocialIQA: Commonsense Reasoning about Social Interactions

    Socialiqa: Commonsense reasoning about social interactions , author=. arXiv preprint arXiv:1904.09728 , year=

  59. [59]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

  60. [60]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    Boolq: Exploring the surprising difficulty of natural yes/no questions , author=. arXiv preprint arXiv:1905.10044 , year=

  61. [61]

    Wino G rande: An adversarial winograd schema challenge at scale

    Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin , title =. 2021 , issue_date =. doi:10.1145/3474381 , journal =

  62. [62]

    First Workshop on Scalable Optimization for Efficient and Adaptive Foundation Models , year=

    FedEx-LoRA: Exact Aggregation for Federated and Efficient Fine-Tuning of Foundation Models , author=. First Workshop on Scalable Optimization for Efficient and Adaptive Foundation Models , year=

  63. [63]

    arXiv preprint arXiv:2411.19557 , year=

    Initialization using update approximation is a silver bullet for extremely efficient low-rank fine-tuning , author=. arXiv preprint arXiv:2411.19557 , year=

  64. [64]

    OLMoE: Open Mixture-of-Experts Language Models

    Olmoe: Open mixture-of-experts language models , author=. arXiv preprint arXiv:2409.02060 , year=

  65. [65]

    Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

    OLMo: Accelerating the science of language models , author=. Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

  66. [66]

    Bidirectional encoder representations from transformers (bert) for question answering in the telecom domain.: Adapting a bert-like language model to the telecom domain using the electra pre-training approach , author=

  67. [67]

    Advances in Neural Information Processing Systems , volume=

    Exact and linear convergence for federated learning under arbitrary client participation is attainable , author=. Advances in Neural Information Processing Systems , volume=

  68. [68]

    Neurocomputing , volume=

    Analysis of regularized federated learning , author=. Neurocomputing , volume=. 2025 , publisher=

  69. [69]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    EFSkip: A new error feedback with linear speedup for compressed federated learning with arbitrary data heterogeneity , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  70. [70]

    Advances in Neural Information Processing Systems , volume=

    Every parameter matters: Ensuring the convergence of federated learning with dynamic heterogeneous models reduction , author=. Advances in Neural Information Processing Systems , volume=

  71. [71]

    Advances in Neural Information Processing Systems , volume=

    Convergence analysis of split federated learning on heterogeneous data , author=. Advances in Neural Information Processing Systems , volume=

  72. [72]

    arXiv preprint arXiv:2404.08003 , year=

    Asynchronous federated reinforcement learning with policy gradient updates: Algorithm design and convergence analysis , author=. arXiv preprint arXiv:2404.08003 , year=

  73. [73]

    ACM Transactions on Knowledge Discovery from Data , volume=

    Convergence-Guaranteed Federated Learning through Gradient Trajectory Smoothing with Triple-Objective Decomposition , author=. ACM Transactions on Knowledge Discovery from Data , volume=. 2025 , publisher=

  74. [74]

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

    Transformer feed-forward layers are key-value memories , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

  75. [75]

    Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification

    Measuring the effects of non-identical data distribution for federated visual classification , author=. arXiv preprint arXiv:1909.06335 , year=

  76. [76]

    International conference on machine learning , pages=

    Bayesian nonparametric federated learning of neural networks , author=. International conference on machine learning , pages=. 2019 , organization=

  77. [77]

    ACM Computing Surveys , volume=

    Federated learning for computationally constrained heterogeneous devices: A survey , author=. ACM Computing Surveys , volume=. 2023 , publisher=

  78. [78]

    Scaling Laws for Neural Language Models

    Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=