Discovering Collaboration from Novelty: Random Network Distillation for Clustered Federated Learning
Pith reviewed 2026-06-30 07:16 UTC · model grok-4.3
The pith
Clients use prediction errors from locally trained Random Network Distillation predictors to form data-similar clusters for federated learning before training begins.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a Random Network Distillation predictor trained locally on each client's data produces a prediction error that reliably proxies similarity between client distributions. Using these errors, clients can discover meaningful groups prior to federated training without sharing raw data or repeatedly evaluating the main model. The resulting federations form dynamically from the local novelty estimates, supporting autonomous operation in settings where neither cluster count nor collaboration structure is specified beforehand.
What carries the argument
Random Network Distillation predictor: a compact neural network trained locally by each client to match the output of a fixed random network on its own data, with the prediction error serving as the novelty signal for inter-client similarity estimation.
If this is right
- Clustering is performed once before the main federated training loop, avoiding repeated integration costs.
- Neither the number of clusters nor the collaboration structure must be specified ahead of time.
- The method supplies a task-agnostic way to enable collaboration under non-IID data.
- Similarity estimation occurs without exchanging raw client data or additional main-model evaluations.
Where Pith is reading between the lines
- If the novelty signal remains stable across training rounds, clusters could be refreshed periodically without restarting the entire process.
- The same local-predictor technique might apply to other heterogeneous distributed learning settings beyond federated learning.
- Empirical checks on real deployments with unknown cluster counts would test whether the autonomy property holds in practice.
Load-bearing premise
The prediction error produced by a locally trained Random Network Distillation predictor on a client's data serves as a reliable proxy for similarity to other clients' distributions, without any shared data or evaluation of the main federated model.
What would settle it
Run the method on a controlled set of clients whose data distributions are known in advance; if the resulting clusters fail to group clients with matching distributions or incorrectly merge clients with dissimilar ones, the claim is falsified.
Figures
read the original abstract
Federated Learning often suffers under non-independently and identically distributed data, where a single global model may fail to represent the diversity of client distributions. Clustered Federated Learning mitigates this issue by training specialized models for groups of similar clients, but existing approaches often couple cluster assignment with the main training loop, increasing computational and communication costs. We propose a lightweight clustering approach based on Random Network Distillation. Each client trains a compact Random Network Distillation predictor on its local data and uses its prediction error as a novelty signal to estimate similarity with other clients. This enables the discovery of meaningful client groups before federated training, without sharing raw data or repeatedly evaluating the main model. Crucially, the resulting federations emerge from local novelty estimates at runtime, making the method suitable for autonomous large-scale distributed systems where neither the number of clusters nor the collaboration structure can be specified a priori. Overall, by decoupling clustering from learning, the method provides a task-agnostic and efficient mechanism for autonomous collaboration under non-independently and identically distributed data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce a lightweight pre-training clustering method for clustered federated learning using Random Network Distillation. Each client trains a compact local RND predictor on its private data and employs the resulting prediction error as a novelty signal to estimate similarity with other clients. This enables discovery of client groups before any federated training begins, without sharing raw data or repeatedly evaluating the main model, and supports autonomous formation of federations where neither the number of clusters nor the collaboration structure is specified a priori.
Significance. If the RND novelty signal is shown to produce clusters that improve downstream FL performance, the decoupling of clustering from the training loop would offer a computationally and communication-efficient, task-agnostic approach suitable for large-scale autonomous systems under non-IID data.
major comments (1)
- [Abstract] Abstract: the claim that the locally computed RND prediction error serves as a reliable proxy for client similarity (and thus for forming useful federations) lacks any derivation, reference to RND properties, or argument establishing correlation with FL-relevant quantities such as gradient alignment or label-conditional shifts rather than input-space novelty alone. This assumption is load-bearing for the central claim that meaningful autonomous federations emerge at runtime.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on the abstract. We address the concern point-by-point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the locally computed RND prediction error serves as a reliable proxy for client similarity (and thus for forming useful federations) lacks any derivation, reference to RND properties, or argument establishing correlation with FL-relevant quantities such as gradient alignment or label-conditional shifts rather than input-space novelty alone. This assumption is load-bearing for the central claim that meaningful autonomous federations emerge at runtime.
Authors: We agree that the abstract would benefit from an explicit reference to RND properties. The full paper motivates the approach by noting that the prediction error of a predictor trained on one client's data, when evaluated on another's, quantifies distributional mismatch; this is a direct consequence of the RND formulation (Burda et al., 2018), where the fixed random target network induces a feature space in which reconstruction error reflects novelty relative to the training distribution. In non-IID FL, such input-space distributional differences frequently coincide with the heterogeneity that clustered FL aims to address. We do not provide a new theoretical derivation linking the signal to gradient alignment, as the method is presented as an empirical, task-agnostic heuristic whose utility is validated through downstream FL performance gains. In the revision we will (i) insert a one-sentence justification and citation into the abstract and (ii) add a short paragraph in Section 3 clarifying the scope of the claim (empirical utility rather than proven equivalence to gradient-based similarity). revision: yes
Circularity Check
RND novelty signal applied directly as clustering heuristic without self-referential reduction
full rationale
The paper defines client similarity explicitly via the prediction error of a locally trained RND predictor and uses that signal for pre-training clustering. No equation, parameter fit, or self-citation is shown that would make the claimed emergence of useful federations equivalent to the input definition by construction. The core mechanism is a straightforward, task-agnostic application of an existing novelty-detection technique, with the decoupling benefit stated as a direct consequence of performing clustering before main-model training rather than derived from any tautological step.
Axiom & Free-Parameter Ledger
free parameters (1)
- RND predictor size and training details
axioms (1)
- domain assumption Local RND prediction error reliably indicates similarity between client data distributions
Reference graph
Works this paper leans on
-
[1]
Communication-efficient learning of deep networks from decentralized data,
B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” inProceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, A. Singh and X. J. Zhu, Eds., vol. 54. PMLR, 2017, pp. 1273–1282
2017
-
[2]
Efficient participant contribution evaluation for horizontal and vertical federated learning,
Q. Li, Y . Diao, Q. Chen, and B. He, “Federated learning on non-iid data silos: An experimental study,” in38th IEEE International Conference on Data Engineering, ICDE 2022, Kuala Lumpur, Malaysia, May 9-12, 2022. IEEE, 2022, pp. 965–978. [Online]. Available: https://doi.org/10.1109/ICDE53745.2022.00077
-
[3]
Advances and open problems in federated learning,
P. Kairouz and et al., “Advances and open problems in federated learning,”Found. Trends Mach. Learn., vol. 14, no. 1-2, pp. 1–210,
-
[4]
[Online]. Available: https://doi.org/10.1561/2200000083
-
[5]
Deep learning in multiagent systems,
L. Esterle, “Deep learning in multiagent systems,” inDeep Learning for Robot Perception and Cognition. Elsevier, 2022, pp. 435–460
2022
-
[6]
A survey of clustering federated learning in heterogeneous data scenarios,
E. Liu, W. Yang, Y . Gu, W. Long, S. Istv ´an, and L. Jiang, “A survey of clustering federated learning in heterogeneous data scenarios,”Journal of Computing and Electronic Information Management, vol. 16, no. 3, pp. 17–22, 2025
2025
-
[7]
Filip Hanzely and Peter Richtárik
A. Ghosh, J. Chung, D. Yin, and K. Ramchandran, “An efficient framework for clustered federated learning,”IEEE Trans. Inf. Theory, vol. 68, no. 12, pp. 8076–8091, 2022. [Online]. Available: https://doi.org/10.1109/TIT.2022.3192506
-
[8]
Decentralized proximity-aware clustering for collective self-federated learning,
D. Domini, N. Farabegoli, G. Aguzzi, M. Viroli, and L. Esterle, “Decentralized proximity-aware clustering for collective self-federated learning,”Internet of Things, vol. 35, p. 101841, 2026. [Online]. Available: https://doi.org/10.1016/j.iot.2025.101841
-
[9]
Exploration by random network distillation,
Y . Burda, H. Edwards, A. J. Storkey, and O. Klimov, “Exploration by random network distillation,” in7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. [Online]. Available: https: //openreview.net/forum?id=H1lJJnR5Ym
2019
-
[10]
Expert-free online transfer learning in multi-agent reinforcement learning,
A. Castagna and I. Dusparic, “Expert-free online transfer learning in multi-agent reinforcement learning,” inECAI 2023 - 26th European Conference on Artificial Intelligence, September 30 - October 4, 2023, Krak´ow, Poland., K. Gal, A. Now ´e, G. J. Nalepa, R. Fairstein, and R. Radulescu, Eds. IOS Press, 2023, pp. 357–364. [Online]. Available: https://doi....
-
[11]
Brendan McMahan, Eider Moore, Daniel Ramage, and Blaise Agüera y Arcas
H. B. McMahan, E. Moore, D. Ramage, and B. A. y Arcas, “Federated learning of deep networks using model averaging,”CoRR, vol. abs/1602.05629, 2016. [Online]. Available: http://arxiv.org/abs/ 1602.05629
-
[12]
A performance evaluation of federated learning algorithms,
A. Nilsson, S. Smith, G. Ulm, E. Gustavsson, and M. Jirstrand, “A performance evaluation of federated learning algorithms,” in Proceedings of the Second Workshop on Distributed Infrastructures for Deep Learning, DIDL@Middleware 2018, Rennes, France, December 10, 2018. ACM, 2018, pp. 1–8. [Online]. Available: https://doi.org/10.1145/3286490.3286559
-
[13]
C2FL: Clustered Continual Federated Learning under Spatial and Temporal Drift
D. Domini, G. Aguzzi, L. Pellegrini, M. Viroli, and L. Esterle, “C2fl: Clustered continual federated learning under spatial and temporal drift,”CoRR, vol. abs/2606.18003, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2606.18003
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2606.18003 2026
-
[14]
FBFL: A field-based coordination approach for data heterogeneity in federated learning,
D. Domini, G. Aguzzi, L. Esterle, and M. Viroli, “FBFL: A field-based coordination approach for data heterogeneity in federated learning,” Log. Methods Comput. Sci., vol. 22, no. 1, 2026. [Online]. Available: https://doi.org/10.46298/lmcs-22(1:19)2026
-
[15]
Sparseful: Self-organizing sparse federated learning over spatially non-iid data
D. Domini, G. Aguzzi, A. D. Zenoozi, L. Erhan, L. Cavallaro, A. Liotta, and M. Viroli, “Sparseful: Self-organizing sparse federated learning over spatially non-iid data.”SSRN, 2026. [Online]. Available: https://ssrn.com/abstract=6584516
2026
-
[16]
Learning multiple layers of features from tiny images,
A. Krizhevsky, G. Hintonet al., “Learning multiple layers of features from tiny images,” 2009
2009
-
[17]
Profed: a benchmark for proximity-based non-iid federated learning,
D. Domini, C. O. Ingemann, G. Aguzzi, L. Esterle, and M. Viroli, “Profed: a benchmark for proximity-based non-iid federated learning,” Joural of Open Research Software, vol. 14, 2026. [Online]. Available: https://openresearchsoftware.metajnl.com/articles/10.5334/jors.624
-
[18]
Deep Residual Learning for Image Recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV , USA, June 27-30, 2016. IEEE Computer Society, 2016, pp. 770–778. [Online]. Available: https://doi.org/10.1109/CVPR.2016.90
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.