Recognition: 2 theorem links
· Lean TheoremFederated Learning with Non-IID Data
Pith reviewed 2026-05-16 10:18 UTC · model grok-4.3
The pith
A small globally shared data subset recovers up to 30% accuracy lost to non-IID data in federated learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When each client trains only on samples from one class, federated averaging produces models whose accuracy falls by more than half compared with the IID case. The divergence of local weight vectors grows in proportion to the earth mover's distance between the device's class distribution and the population distribution. Introducing a globally shared data subset whose size is only a few percent of the total training set allows the averaged model to recover most of the lost accuracy.
What carries the argument
A small globally shared data subset that every client mixes with its local non-IID examples during training to reduce weight divergence.
Load-bearing premise
That a small globally shared data subset can be created and distributed without violating the privacy or regulatory constraints that motivated federated learning in the first place.
What would settle it
Re-training the CIFAR-10 models with exactly 0% versus 5% shared data and checking whether the accuracy difference reaches approximately 30%.
read the original abstract
Federated learning enables resource-constrained edge compute devices, such as mobile phones and IoT devices, to learn a shared model for prediction, while keeping the training data local. This decentralized approach to train models provides privacy, security, regulatory and economic benefits. In this work, we focus on the statistical challenge of federated learning when local data is non-IID. We first show that the accuracy of federated learning reduces significantly, by up to 55% for neural networks trained for highly skewed non-IID data, where each client device trains only on a single class of data. We further show that this accuracy reduction can be explained by the weight divergence, which can be quantified by the earth mover's distance (EMD) between the distribution over classes on each device and the population distribution. As a solution, we propose a strategy to improve training on non-IID data by creating a small subset of data which is globally shared between all the edge devices. Experiments show that accuracy can be increased by 30% for the CIFAR-10 dataset with only 5% globally shared data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that non-IID data distributions in federated learning cause large accuracy drops (up to 55% for neural networks when each client holds data from only one class), that this degradation can be explained by weight divergence quantified via Earth Mover's Distance (EMD) between local and population class distributions, and that sharing a small (5%) globally representative data subset across clients recovers up to 30% accuracy on CIFAR-10.
Significance. If the mitigation holds under realistic constraints, the work supplies useful early empirical baselines on the severity of non-IID effects in federated learning and a simple heuristic for partial recovery. The EMD diagnostic is interpretable and the reported gains on standard vision benchmarks are practically relevant, though the privacy feasibility of the shared-subset construction remains the central open question.
major comments (1)
- Abstract and proposed strategy: the reported 30% accuracy gain on CIFAR-10 is obtained only after a 5% globally shared subset has been sampled from the full centralized training distribution and broadcast to every client. No procedure is given for constructing an equivalent representative subset when data never leaves the clients, directly contradicting the privacy premise stated in the introduction.
minor comments (1)
- Experiments section: the abstract states clear empirical drops and recoveries but provides no error bars, number of random seeds, or ablation details on the EMD-weight-divergence causal link.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for highlighting an important practical limitation in our proposed strategy. We address the concern directly below and will revise the manuscript to clarify the assumptions and limitations.
read point-by-point responses
-
Referee: Abstract and proposed strategy: the reported 30% accuracy gain on CIFAR-10 is obtained only after a 5% globally shared subset has been sampled from the full centralized training distribution and broadcast to every client. No procedure is given for constructing an equivalent representative subset when data never leaves the clients, directly contradicting the privacy premise stated in the introduction.
Authors: We agree that the experimental construction of the 5% shared subset relies on sampling from the full centralized training distribution, which assumes a form of global access not available under strict per-client data isolation. This is a genuine limitation of the current presentation. In the revised manuscript we will (1) explicitly qualify the abstract and introduction to state that the shared subset is created from a centralized view in our experiments, (2) add a dedicated limitations paragraph discussing how the subset could be obtained in practice (e.g., via a small public dataset drawn from a similar distribution, synthetic data, or a trusted curator), and (3) note that the approach therefore represents a practical heuristic that relaxes the strictest privacy model rather than a fully decentralized solution. These changes will remove any implication of contradiction with the privacy premise. revision: yes
Circularity Check
No significant circularity; empirical results independent of inputs
full rationale
The paper's core claims rest on experimental measurements of accuracy under non-IID partitions and the effect of adding a shared data subset. No derivation chain, equations, or 'predictions' are presented that reduce by construction to fitted parameters or self-referential definitions. The weight divergence explanation invokes EMD as a standard metric between class distributions and the global one, without deriving EMD from the target result. No self-citations are load-bearing for uniqueness or ansatz. The reported accuracy gains are direct measurements conditional on the experimental protocol, not tautological outputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- shared_data_fraction
axioms (1)
- domain assumption Local data distributions are fixed and known for the purpose of computing EMD to the global distribution.
Forward citations
Cited by 19 Pith papers
-
Heterogeneous Model Fusion for Privacy-Aware Multi-Camera Surveillance via Synthetic Domain Adaptation
HeroCrystal uses single-image diffusion synthesis, probabilistic federated Faster R-CNN with contrastive debiasing, and inconsistent-category integration to reach 33.4% mAP in privacy-preserving multi-camera object detection.
-
Byzantine-Robust Distributed SGD: A Unified Analysis and Tight Error Bounds
Unified convergence rates and tight lower bounds for Byzantine-robust distributed SGD under stochasticity and general data heterogeneity, showing local momentum reduces stochastic error floors.
-
FedVSSAM: Mitigating Flatness Incompatibility in Sharpness-Aware Federated Learning
FedVSSAM mitigates flatness incompatibility in SAM-based federated learning by consistently using a variance-suppressed adjusted direction for local perturbation, descent, and global updates, with non-convex convergen...
-
ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations
ForgeVLA enables federated VLA model training from unlabeled vision-action pairs by recovering language via embodied classifiers and using contrastive planning plus adaptive aggregation to avoid feature collapse.
-
Robust Synchronisation for Federated Learning in The Face of Correlated Device Failure
AW-PSP dynamically weights node sampling by real-time availability predictions and failure correlations to improve robustness, label coverage, and fairness in federated learning under correlated device failures.
-
HierFedCEA: Hierarchical Federated Edge Learning for Privacy-Preserving Climate Control Optimization Across Heterogeneous Controlled Environment Agriculture Facilities
HierFedCEA delivers a hierarchical federated learning framework for privacy-preserving climate control optimization across heterogeneous CEA facilities, reaching 94% of centralized performance with under 1 MB communication.
-
Client-Conditional Federated Learning via Local Training Data Statistics
Conditioning a global FL model on local PCA statistics of client data matches oracle cluster performance across heterogeneous settings and is robust to sparse data with zero added communication.
-
Practical Quantum Federated Learning for Privacy-Sensitive Healthcare: Communication Efficiency and Noise Resilience
Hybrid QFL cuts quantum transmissions from 3TNMP to {3t + 2(T-t)}NMP over T rounds while preserving near-centralized convergence and improving depolarizing-noise resilience via decentralized aggregation and Steane-code QEC.
-
Fed-Listing: Federated Label Distribution Inference in Graph Neural Networks
Fed-Listing infers client label proportions in FedGNNs from final-layer gradients, outperforming baselines on four datasets and three architectures even in non-i.i.d. settings.
-
FedSurrogate: Backdoor Defense in Federated Learning via Layer Criticality and Surrogate Replacement
FedSurrogate defends federated learning against backdoors by clustering on security-critical layers and substituting malicious updates with benign surrogates, reporting false-positive rates below 10% and attack succes...
-
Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction
Rennala MVR improves time complexity over Rennala SGD for smooth nonconvex stochastic optimization in heterogeneous parallel systems under a mean-squared smoothness assumption.
-
CLAD: A Clustered Label-Agnostic Federated Learning Framework for Joint Anomaly Detection and Attack Classification
CLAD is a clustered federated learning framework with a dual-mode architecture for joint anomaly detection and attack classification in IoT using labeled and unlabeled data.
-
Heterogeneous Model Fusion for Privacy-Aware Multi-Camera Surveillance via Synthetic Domain Adaptation
HeroCrystal achieves 33.4% mAP on cross-domain multi-camera object detection by combining one-shot diffusion-based synthetic data generation, probabilistic federated Faster R-CNN, and inconsistent-category distillatio...
-
FMCL: Class-Aware Client Clustering with Foundation Model Representations for Heterogeneous Federated Learning
FMCL performs one-shot class-aware client clustering in heterogeneous federated learning by deriving semantic signatures from foundation model embeddings and using cosine distance, yielding improved performance and st...
-
Evaluating Federated Learning approaches for mammography under breast density heterogeneity
FedAvg matches centralized training accuracy on mammography data split by breast density heterogeneity, showing standard FL can handle this clinical variation without special fixes.
-
FedKPer: Tackling Generalization and Personalization in Medical Federated Learning via Knowledge Personalization
FedKPer improves the generalization-personalization trade-off in medical federated learning via local knowledge personalization and selective aggregation that emphasizes reliable updates.
-
Automating aggregation strategy selection in federated learning
A framework automates federated learning aggregation strategy selection via LLM inference in single-trial mode and genetic search in multi-trial mode, improving robustness under non-IID data.
-
Privacy-Preserving Federated Learning: Integrating Zero-Knowledge Proofs in Scalable Distributed Architectures
A hybrid federated learning architecture using zero-knowledge proofs for computation verification retains 94.2% accuracy under adversarial conditions across 1,000 nodes.
-
Split and Aggregation Learning for Foundation Models Over Mobile Embodied AI Network (MEAN): A Comprehensive Survey
The paper surveys split and aggregation learning for foundation models in 6G networks to improve efficiency, resource use, and data privacy in distributed AI.
Reference graph
Works this paper leans on
-
[1]
Hello Edge: Keyword Spotting on Microcontrollers
Y . Zhang, N. Suda, L. Lai, and V . Chandra, “Hello edge: Keyword spotting on microcontrollers,”arXiv preprint arXiv:1711.07128, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[2]
CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs
L. Lai, N. Suda, and V . Chandra, “Cmsis-nn: Efficient neural network kernels for arm cortex-m cpus,” arXiv preprint arXiv:1801.06601, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[3]
Communication-efficient learning of deep networks from decentralized data,
H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Int. Conf. on Artificial Intelligence and Statistics, 2017
work page 2017
-
[4]
Federated Optimization:Distributed Optimization Beyond the Datacenter
J. Koneˇcn`y, B. McMahan, and D. Ramage, “Federated optimization: Distributed optimization beyond the datacenter,” arXiv preprint arXiv:1511.03575, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[5]
Federated learning: Collaborative machine learning without centralized training data,
H. B. McMahan and D. Ramage, “Federated learning: Collaborative machine learning without centralized training data,” in Google, 2017
work page 2017
-
[6]
Gradient-based learning applied to document recognition,
Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, pp. 2278–2324, 1998
work page 1998
-
[7]
Learning multiple layers of features from tiny images,
A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” in Technical Report, University of Toronto, 2009
work page 2009
-
[8]
The complete works of william shakespeare,
W. Shakespeare, “The complete works of william shakespeare,” in https://www.gutenberg.org/ebooks/100
-
[9]
Practical secure aggregation for privacy-preserving machine learning,
K. Bonawitz, V . Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth, “Practical secure aggregation for privacy-preserving machine learning,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 1175–1191, ACM, 2017
work page 2017
-
[10]
Federated Learning: Strategies for Improving Communication Efficiency
J. Koneˇcn`y, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,”arXiv preprint arXiv:1610.05492, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[11]
Deep gradient compression: Reducing the communica- tion bandwidth for distributed training,
Y . Lin, S. Han, H. Mao, Y . Wang, and W. J. Dally, “Deep gradient compression: Reducing the communica- tion bandwidth for distributed training,” arXiv preprint arXiv:1712.01887, 2017
-
[12]
Adaptive subgradient methods for online learning and stochastic optimization,
J. Duchi, E. Hazan, and Y . Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011
work page 2011
-
[13]
T. Tieleman and G. Hinton, “Divide the gradient by a running average of its recent magnitude. cours- era: Neural networks for machine learning,” tech. rep., Technical Report. Available online: https://zh. coursera. org/learn/neuralnetworks/lecture/YQHki/rmsprop-divide-the-gradient-by-a-running-average-of- its-recent-magnitude (accessed on 21 April 2017)
work page 2017
-
[14]
A method for stochastic optimization,
D. Kinga and J. B. Adam, “A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), 2015
work page 2015
-
[15]
Large-scale machine learning with stochastic gradient descent,
L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMP- STAT’2010, pp. 177–186, Springer, 2010
work page 2010
-
[16]
Making gradient descent optimal for strongly convex stochastic optimization.,
A. Rakhlin, O. Shamir, K. Sridharan,et al., “Making gradient descent optimal for strongly convex stochastic optimization.,” in ICML, Citeseer, 2012
work page 2012
-
[17]
Large scale distributed deep networks,
J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V . Le, et al., “Large scale distributed deep networks,” in Advances in neural information processing systems , pp. 1223–1231, 2012
work page 2012
-
[18]
Stochastic first-and zeroth-order methods for nonconvex stochastic programming,
S. Ghadimi and G. Lan, “Stochastic first-and zeroth-order methods for nonconvex stochastic programming,” SIAM Journal on Optimization, vol. 23, no. 4, pp. 2341–2368, 2013
work page 2013
-
[19]
Federated multi-task learning,
V . Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated multi-task learning,” inAdvances in Neural Information Processing Systems, pp. 4427–4437, 2017
work page 2017
-
[20]
Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition
P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” arXiv preprint arXiv:1804.03209, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[21]
Imagenet classification with deep convolutional neural networks,
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, pp. 1097–1105, 2012
work page 2012
-
[22]
B. Graham, “Fractional max-pooling,” arXiv preprint arXiv:1412.6071, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[23]
Qualitatively characterizing neural network optimization problems
I. J. Goodfellow, O. Vinyals, and A. M. Saxe, “Qualitatively characterizing neural network optimization problems,” in arXiv preprint arXiv:1412.6544, 2014. 9 A Appendix A.1 Test accuracy over communication rounds for a smaller batch size 0 100 200 300 400 500 Communication rounds 0.75 0.80 0.85 0.90 0.95 1.00Test accuracy (a) MNIST B=100 SGD B=10 E=1 IID ...
work page internal anchor Pith review Pith/arXiv arXiv 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.