arxiv: 2605.08992 · v1 · submitted 2026-05-09 · 💻 cs.LG

Recognition: no theorem link

When More Parameters Hurt: Foundation Model Priors Amplify Worst-Client Disparity Under Extreme Federated Heterogeneity

Kiran Naseer , Umar Shoaib

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:36 UTC · model grok-4.3

classification 💻 cs.LG

keywords federated learningfoundation modelsheterogeneityfairnessworst-client accuracylabel skewLoRAtext classification

0 comments

The pith

Foundation model priors amplify worst-client accuracy gaps under extreme federated label skew.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that in federated text classification, large foundation models like DistilBERT with LoRA create substantially larger performance gaps between best- and worst-performing clients when data is extremely non-IID. At the highest heterogeneity level tested, the foundation model produces a 50.1 percent worst-client gap compared with 32.2 percent for a much smaller TextCNN model, even though it has 25 times more parameters and was pretrained on massive data. The effect reverses under milder heterogeneity, where the foundation model nearly eliminates the gap. The authors term this reversal the FM Fairness Paradox and show that reweighting client updates during aggregation does not fix it.

Core claim

Under extreme label skew (Dirichlet alpha = 0.1), DistilBERT+LoRA yields a worst-client accuracy gap of 50.1 percent, 56 percent larger than the 32.2 percent gap produced by TextCNN, despite 25 times more parameters and extensive pretraining. At moderate heterogeneity (alpha >= 0.5) the pattern inverts and the foundation model reduces disparity. An inverse-weighted LoRA aggregation variant fails to close the gap, indicating that the priors embedded in the foundation model can systematically disadvantage clients with atypical label distributions.

What carries the argument

The FM Fairness Paradox, measured by comparing worst-client accuracy between a small CNN and a LoRA-adapted foundation model across four levels of Dirichlet-controlled label skew.

If this is right

Under extreme heterogeneity, foundation models increase rather than reduce accuracy disparity for the most disadvantaged clients.
Moderate heterogeneity allows the same foundation models to improve fairness across clients.
Standard aggregation reweighting is insufficient to counteract the effect.
High-stakes federated deployments such as healthcare or education require explicit mechanisms to protect minority clients.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Client-specific regularization or prompt-based adaptation may be needed to offset the influence of shared priors.
The same pattern could appear when foundation models are fine-tuned on image or speech data under extreme non-IID partitions.
Testing the effect across multiple foundation model families would clarify whether the disparity scales with model size or pretraining corpus.

Load-bearing premise

The larger worst-client disparity is caused by the foundation model's pretraining priors rather than by LoRA adaptation, differences in model architecture, or the specific way heterogeneity was generated.

What would settle it

Re-running the same controlled experiments on a different foundation model or with full fine-tuning instead of LoRA and observing no increase (or a decrease) in the worst-client gap at alpha = 0.1 would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.08992 by Kiran Naseer, Umar Shoaib.

**Figure 3.** Figure 3: shows the training trajectories for both models under extreme heterogeneity. TextCNN’s worst-client accuracy grows steadily from 0% at Round 1, peaking at 32.3% and converging to a stable floor. DistilBERT+LoRA’s worstclient accuracy recovers more slowly from 0% (reaching only 5% by Round 6) and never stabilises, oscillating throughout training. 10 20 30 40 50 Communication Round 0 20 40 60 80 100 Accur… view at source ↗

**Figure 5.** Figure 5: FedAvg vs. FedAvgW (b = 0.5) at α = 0.1 over 20 rounds. Results for b = 0.1 are reported in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

Federated learning (FL) is increasingly used to fine-tune foundation models (FMs) on distributed private data. The community largely assumes that large-scale pretraining serves as a 'rising tide that lifts all boats' in federated settings. However, our experiments reveal that these powerful priors can hinder rather than help the most disadvantaged clients under extreme heterogeneity. Through controlled experiments on federated text classification, we compare worst-client accuracy between TextCNN (2.7M parameters) and DistilBERT with Low-Rank Adaptation (LoRA, 66M parameters) across four Non-IID heterogeneity levels. Under extreme label skew (alpha = 0.1), DistilBERT+LoRA produces a worst-client accuracy gap of 50.1% -- 56% larger than TextCNN's 32.2% gap, despite having 25x more parameters and extensive pretraining. Under moderate heterogeneity (alpha >= 0.5), the pattern reverses: the FM nearly eliminates the gap. We call this the FM Fairness Paradox. We further show that an inverse-weighted LoRA aggregation method (FedAvgW) does not resolve the disparity, suggesting aggregation reweighting alone may be insufficient. Our results highlight the need for mechanisms that explicitly protect minority clients before deploying foundation models in high-stakes federated contexts such as healthcare and education.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows larger pre-trained models can widen worst-client gaps under extreme label skew in federated text classification, but the comparison mixes architecture, size, and pretraining so the cause stays unclear.

read the letter

The main observation is that DistilBERT with LoRA produces a 50.1% worst-client accuracy gap under alpha=0.1 label skew, 56% larger than the 32.2% gap for a small TextCNN, while the pattern flips and the larger model helps under milder skew. They label this the FM Fairness Paradox and note that one inverse-weighted aggregation fix does not close it. The experiments run controlled comparisons across four heterogeneity levels on text classification, which gives a clean reversal pattern worth seeing in the data. That reversal challenges the default assumption that scaling pretraining will lift performance for every client in federated settings. The work is empirical and direct, with no circular derivations or invented metrics. The soft spot is the attribution to foundation model priors. The two models differ in capacity (2.7M vs 66M parameters), architecture (CNN vs transformer), training regime (scratch vs LoRA on pre-trained weights), and adaptation method. Without controls that hold architecture or capacity fixed while varying only the pretraining, the larger gap could come from any of those factors rather than the priors themselves. The abstract does not describe such ablations, so the causal claim remains provisional. This paper is useful for groups working on federated fine-tuning of large models in high-heterogeneity domains like healthcare. It deserves peer review because the empirical pattern is straightforward to verify or extend, even if the current design needs tighter controls to isolate the effect.

Referee Report

2 major / 2 minor

Summary. The paper claims that foundation model priors amplify worst-client disparity in federated learning under extreme heterogeneity. Through text classification experiments, it shows that under Dirichlet label skew with alpha=0.1, DistilBERT+LoRA (66M params) yields a 50.1% worst-client accuracy gap, 56% larger than TextCNN's 32.2% gap (2.7M params), despite pretraining and greater capacity. The pattern reverses for alpha >=0.5. An inverse-weighted aggregation (FedAvgW) fails to close the gap, leading to the 'FM Fairness Paradox' and a call for mechanisms protecting minority clients.

Significance. If the attribution to foundation model priors holds after isolating confounders, the result would be significant for federated fine-tuning of large models in heterogeneous, high-stakes domains. It provides empirical counter-evidence to the 'rising tide' assumption and motivates new fairness techniques beyond standard aggregation. The controlled comparison across heterogeneity levels is a strength, though the current design limits causal claims.

major comments (2)

[Abstract and experimental design] The central claim attributes the 50.1% vs 32.2% worst-client gap under alpha=0.1 to foundation model priors. However, the design compares TextCNN (trained from scratch) against DistilBERT+LoRA (pretrained weights with low-rank adaptation). This confounds priors with architecture (CNN vs transformer), capacity (2.7M vs 66M), optimization, and adaptation method. No controls such as randomly initialized DistilBERT or matched-capacity non-pretrained models are described to isolate the priors' effect (Abstract; experimental sections).
[Results] The reported accuracy gaps and the 56% larger disparity lack error bars, standard deviations across random seeds, or statistical tests. This makes it difficult to assess whether the FM Fairness Paradox observation is robust or reproducible (Results and tables).

minor comments (2)

[Experimental setup] Expand the experimental protocol section with exact client count, local epochs, communication rounds, optimizer settings, and precise Dirichlet sampling procedure to enable full reproducibility.
Consider adding a table or figure showing per-client accuracy distributions or additional heterogeneity metrics (e.g., label distribution entropy) to better illustrate the worst-client disparity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of experimental controls and statistical rigor that we will address in the revision to strengthen the attribution of results to foundation model priors.

read point-by-point responses

Referee: [Abstract and experimental design] The central claim attributes the 50.1% vs 32.2% worst-client gap under alpha=0.1 to foundation model priors. However, the design compares TextCNN (trained from scratch) against DistilBERT+LoRA (pretrained weights with low-rank adaptation). This confounds priors with architecture (CNN vs transformer), capacity (2.7M vs 66M), optimization, and adaptation method. No controls such as randomly initialized DistilBERT or matched-capacity non-pretrained models are described to isolate the priors' effect (Abstract; experimental sections).

Authors: We agree that the experimental comparison confounds the effect of pretrained priors with differences in architecture, capacity, optimization, and adaptation method. The design was chosen to reflect realistic FL scenarios (small from-scratch models vs. practical FM fine-tuning with LoRA), and the larger gap despite greater capacity and pretraining already provides counter-evidence to the 'rising tide' assumption. To more cleanly isolate the priors, we will add control experiments with randomly initialized DistilBERT (no pretraining) and matched-capacity non-pretrained transformer baselines in the revised manuscript, allowing direct attribution while preserving the original comparisons. revision: yes
Referee: [Results] The reported accuracy gaps and the 56% larger disparity lack error bars, standard deviations across random seeds, or statistical tests. This makes it difficult to assess whether the FM Fairness Paradox observation is robust or reproducible (Results and tables).

Authors: We acknowledge that variability measures are essential for assessing robustness. In the revision, we will rerun all experiments across at least five random seeds, report standard deviations as error bars in tables and figures, and include statistical tests (e.g., paired t-tests) to evaluate the significance of the worst-client accuracy gaps and the 56% relative increase under alpha=0.1. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of measured accuracy gaps

full rationale

The paper reports direct experimental measurements of worst-client accuracy under controlled federated heterogeneity settings (alpha=0.1 and higher). It compares two distinct models (TextCNN trained from scratch vs. DistilBERT+LoRA) on the same task and data partitions, presenting the 50.1% vs 32.2% gap as an observed outcome rather than a derived prediction. No equations, parameter fits, self-referential definitions, or load-bearing self-citations are used to generate the central claim; the result does not reduce to its inputs by construction. The FM Fairness Paradox is simply a label for the empirical pattern. This is the expected non-finding for an empirical study whose claims rest on falsifiable measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests entirely on empirical measurements under controlled heterogeneity; no mathematical derivation or fitted parameters are used to produce the reported gaps.

axioms (1)

domain assumption Non-IID client data distributions can be simulated using a Dirichlet distribution parameterized by alpha
The paper uses alpha values (0.1 and >=0.5) to define heterogeneity levels in the experiments.

pith-pipeline@v0.9.0 · 5550 in / 1247 out tokens · 39169 ms · 2026-05-12T02:36:05.494524+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 1 internal anchor

[1]

[Dettmers et al., 2024] Dettmers, T.; Pagnoni, A.; Holtzman, A.; and Zettlemoyer, L

work page 2024
[2]

T.; Tran, N

[Dinh, Tran, and Nguyen, 2020] Dinh, C. T.; Tran, N. H.; and Nguyen, T. A

work page 2020
[3]

InAdvances in Neural Infor- mation Processing Systems (NeurIPS), volume 33, 21394– 21405

Personalized federated learn- ing with Moreau envelopes. InAdvances in Neural Infor- mation Processing Systems (NeurIPS), volume 33, 21394– 21405. [Fallah, Mokhtari, and Ozdaglar, 2020] Fallah, A.; Mokhtari, A.; and Ozdaglar, A

work page 2020
[4]

InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, 9143–9154

Personalized federated learning with theoretical guarantees: A model- agnostic meta-learning approach. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, 9143–9154. [Go, Bhayani, and Huang, 2009] Go, A.; Bhayani, R.; and Huang, L

work page 2009
[5]

Technical report, Stanford University

Twitter sentiment classification using distant supervision. Technical report, Stanford University. CS224N Project Report. [Hard et al., 2018] Hard, A.; Rao, K.; Mathews, R.; Beau- fays, F.; Augenstein, S.; Eichner, H.; Kiddon, C.; and Ra- mage, D

work page 2018
[6]

Federated Learning for Mobile Keyboard Prediction

Federated learning for mobile keyboard prediction.arXiv preprint arXiv:1811.03604. [Hsu, Qi, and Brown, 2019] Hsu, T.-M. H.; Qi, H.; and Brown, M

work page Pith review arXiv 2019
[7]

Measuring the

Measuring the effects of non-IID data on federated learning.arXiv preprint arXiv:1909.06335. [Hu et al., 2022] Hu, E. J.; Shen, Y .; Wallis, P.; Allen-Zhu, Z.; Li, Y .; Wang, S.; Wang, L.; and Chen, W

work page arXiv 1909
[8]

InInter- national Conference on Learning Representations (ICLR)

LoRA: Low-rank adaptation of large language models. InInter- national Conference on Learning Representations (ICLR). [Jiang, Li, and Song, 2025a] Jiang, Y .; Li, Z.; and Song, B. 2025a. Fairness-aware prompt selection for federated LLMs.Neural Networks182:106883. [Jiang, Li, and Song, 2025b] Jiang, Y .; Li, Z.; and Song, B. 2025b. Fairness-aware prompt se...

work page 2024
[9]

[Karimireddy et al., 2020] Karimireddy, S

Non-IID data in federated learning: A survey with taxonomy, metrics, methods, frameworks and future directions.arXiv preprint arXiv:2411.12377. [Karimireddy et al., 2020] Karimireddy, S. P.; Kale, S.; Mohri, M.; Reddi, S. J.; Stich, S. U.; and Suresh, A. T

work page arXiv 2020
[10]

InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1746–1751

Convolutional neural networks for sentence classification. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1746–1751. [Li et al., 2020a] Li, T.; Sahu, A. K.; Sanjabi, M.; Zaheer, M.; Talwalkar, A.; and Smith, V . 2020a. Federated op- timization in heterogeneous networks. InProceedings of Machine Learning a...

work page 2014
[11]

[Li et al., 2020b] Li, T.; Sanjabi, M.; Beirami, A.; and Smith, V . 2020b. Fair resource allocation in federated learning. InInternational Conference on Learning Representations (ICLR). [Li et al., 2021a] Li, T.; Hu, S.; Beirami, A.; and Smith, V . 2021a. Ditto: Fair and robust federated learning through personalization. InProceedings of the 38th Internat...

work page 2022
[12]

[McMahan et al., 2017] McMahan, B.; Moore, E.; Ram- age, D.; Hampson, S.; and Agüera y Arcas, B

work page 2017
[13]

InProceedings of the 20th Interna- tional Conference on Artificial Intelligence and Statistics (AISTATS), 1273–1282

Communication-efficient learning of deep networks from decentralized data. InProceedings of the 20th Interna- tional Conference on Artificial Intelligence and Statistics (AISTATS), 1273–1282. [Sanh et al., 2019] Sanh, V .; Debut, L.; Chaumond, J.; and Wolf, T

work page 2019
[14]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108. [Shi, Yu, and Leung, 2023] Shi, Y .; Yu, K.; and Leung, V . C. M

work page internal anchor Pith review Pith/arXiv arXiv 1910
[15]

arXiv preprint arXiv:2311.02641

A survey of fairness-aware federated learning. arXiv preprint arXiv:2311.02641. [Woisetschläger et al., 2024] Woisetschläger, J. et al

work page arXiv 2024
[16]

[Zhang, Zhao, and LeCun, 2015] Zhang, X.; Zhao, J.; and LeCun, Y

A survey on efficient federated learning methods for foun- dation model training.arXiv preprint arXiv:2403.02154. [Zhang, Zhao, and LeCun, 2015] Zhang, X.; Zhao, J.; and LeCun, Y

work page arXiv 2015