Discovering Collaboration from Novelty: Random Network Distillation for Clustered Federated Learning

Danilo Pianini; Davide Domini; Gianluca Aguzzi; Ivana Dusparic; Mirko Viroli

arxiv: 2606.30499 · v1 · pith:5MGIPAS3new · submitted 2026-06-29 · 💻 cs.LG

Discovering Collaboration from Novelty: Random Network Distillation for Clustered Federated Learning

Davide Domini , Gianluca Aguzzi , Ivana Dusparic , Danilo Pianini , Mirko Viroli This is my paper

Pith reviewed 2026-06-30 07:16 UTC · model grok-4.3

classification 💻 cs.LG

keywords clustered federated learningrandom network distillationnon-IID dataclient clusteringnovelty estimationautonomous collaborationdistributed systems

0 comments

The pith

Clients use prediction errors from locally trained Random Network Distillation predictors to form data-similar clusters for federated learning before training begins.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that each client can train a compact Random Network Distillation predictor on its own data and treat the resulting prediction error as a novelty signal. This signal estimates how similar one client's data distribution is to others, enabling groups to be identified without sharing raw data or running the main model. Clustering therefore happens as a separate lightweight step rather than inside the federated training loop. The approach matters for large-scale systems because the number of clusters and the collaboration structure do not need to be known in advance; the federations simply emerge from the local signals at runtime.

Core claim

The central claim is that a Random Network Distillation predictor trained locally on each client's data produces a prediction error that reliably proxies similarity between client distributions. Using these errors, clients can discover meaningful groups prior to federated training without sharing raw data or repeatedly evaluating the main model. The resulting federations form dynamically from the local novelty estimates, supporting autonomous operation in settings where neither cluster count nor collaboration structure is specified beforehand.

What carries the argument

Random Network Distillation predictor: a compact neural network trained locally by each client to match the output of a fixed random network on its own data, with the prediction error serving as the novelty signal for inter-client similarity estimation.

If this is right

Clustering is performed once before the main federated training loop, avoiding repeated integration costs.
Neither the number of clusters nor the collaboration structure must be specified ahead of time.
The method supplies a task-agnostic way to enable collaboration under non-IID data.
Similarity estimation occurs without exchanging raw client data or additional main-model evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the novelty signal remains stable across training rounds, clusters could be refreshed periodically without restarting the entire process.
The same local-predictor technique might apply to other heterogeneous distributed learning settings beyond federated learning.
Empirical checks on real deployments with unknown cluster counts would test whether the autonomy property holds in practice.

Load-bearing premise

The prediction error produced by a locally trained Random Network Distillation predictor on a client's data serves as a reliable proxy for similarity to other clients' distributions, without any shared data or evaluation of the main federated model.

What would settle it

Run the method on a controlled set of clients whose data distributions are known in advance; if the resulting clusters fail to group clients with matching distributions or incorrectly merge clients with dissimilar ones, the claim is falsified.

Figures

Figures reproduced from arXiv: 2606.30499 by Danilo Pianini, Davide Domini, Gianluca Aguzzi, Ivana Dusparic, Mirko Viroli.

**Figure 2.** Figure 2: Difference between self-novelty sii and cross-novelty sij across devices. The colors of the device labels indicate the true clusters used to generate the experimental data. Near-zero values within diagonal blocks indicate that devices in the same group induce similar RND residuals. 0 20 40 60 80 100 Global round 10 1 10 2 10 3 Cumulative time [s] RND clustering RND every 10 rounds RND every round IFCA [PI… view at source ↗

**Figure 3.** Figure 3: Clustering overhead over federated rounds. Compared with IFCA, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Federated Learning often suffers under non-independently and identically distributed data, where a single global model may fail to represent the diversity of client distributions. Clustered Federated Learning mitigates this issue by training specialized models for groups of similar clients, but existing approaches often couple cluster assignment with the main training loop, increasing computational and communication costs. We propose a lightweight clustering approach based on Random Network Distillation. Each client trains a compact Random Network Distillation predictor on its local data and uses its prediction error as a novelty signal to estimate similarity with other clients. This enables the discovery of meaningful client groups before federated training, without sharing raw data or repeatedly evaluating the main model. Crucially, the resulting federations emerge from local novelty estimates at runtime, making the method suitable for autonomous large-scale distributed systems where neither the number of clusters nor the collaboration structure can be specified a priori. Overall, by decoupling clustering from learning, the method provides a task-agnostic and efficient mechanism for autonomous collaboration under non-independently and identically distributed data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper decouples clustering from the main training loop in clustered federated learning by having clients compute local RND prediction errors as a novelty signal, but provides no support for why those errors would group clients in a way that improves the federated objective.

read the letter

The main thing to know is that this work separates client grouping from the actual federated training by running a small Random Network Distillation predictor on each client's local data and treating the resulting prediction error as a similarity measure. That decoupling step is the concrete proposal.

What is new is the specific choice to apply RND in a pre-training phase for this purpose. Earlier clustered federated learning methods often interleave assignment decisions with model updates, so pulling the grouping out front does reduce some repeated communication and evaluation costs.

The description is clear on the workflow: each client trains its own compact predictor, computes errors without sharing raw data, and the groups form from those local signals at runtime. This targets settings where the number of clusters cannot be fixed in advance.

The central weakness is that nothing in the abstract explains or tests why RND error on one client's data would indicate useful similarity to another client for the learning task. RND was built for measuring novelty within a single stream of experience, not for comparing distributions across clients in terms of label shifts or gradient alignment. If the signal mainly captures input-space differences that do not affect the main model, the resulting clusters will not deliver the claimed benefit.

This is aimed at people working on practical federated systems that must handle unknown non-IID structure without central coordination. A reader already familiar with clustered FL would see the engineering angle quickly.

I would send it for peer review. The decoupling direction is worth referee time even if the similarity proxy needs experiments and justification to hold up.

Referee Report

1 major / 0 minor

Summary. The paper claims to introduce a lightweight pre-training clustering method for clustered federated learning using Random Network Distillation. Each client trains a compact local RND predictor on its private data and employs the resulting prediction error as a novelty signal to estimate similarity with other clients. This enables discovery of client groups before any federated training begins, without sharing raw data or repeatedly evaluating the main model, and supports autonomous formation of federations where neither the number of clusters nor the collaboration structure is specified a priori.

Significance. If the RND novelty signal is shown to produce clusters that improve downstream FL performance, the decoupling of clustering from the training loop would offer a computationally and communication-efficient, task-agnostic approach suitable for large-scale autonomous systems under non-IID data.

major comments (1)

[Abstract] Abstract: the claim that the locally computed RND prediction error serves as a reliable proxy for client similarity (and thus for forming useful federations) lacks any derivation, reference to RND properties, or argument establishing correlation with FL-relevant quantities such as gradient alignment or label-conditional shifts rather than input-space novelty alone. This assumption is load-bearing for the central claim that meaningful autonomous federations emerge at runtime.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the abstract. We address the concern point-by-point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the locally computed RND prediction error serves as a reliable proxy for client similarity (and thus for forming useful federations) lacks any derivation, reference to RND properties, or argument establishing correlation with FL-relevant quantities such as gradient alignment or label-conditional shifts rather than input-space novelty alone. This assumption is load-bearing for the central claim that meaningful autonomous federations emerge at runtime.

Authors: We agree that the abstract would benefit from an explicit reference to RND properties. The full paper motivates the approach by noting that the prediction error of a predictor trained on one client's data, when evaluated on another's, quantifies distributional mismatch; this is a direct consequence of the RND formulation (Burda et al., 2018), where the fixed random target network induces a feature space in which reconstruction error reflects novelty relative to the training distribution. In non-IID FL, such input-space distributional differences frequently coincide with the heterogeneity that clustered FL aims to address. We do not provide a new theoretical derivation linking the signal to gradient alignment, as the method is presented as an empirical, task-agnostic heuristic whose utility is validated through downstream FL performance gains. In the revision we will (i) insert a one-sentence justification and citation into the abstract and (ii) add a short paragraph in Section 3 clarifying the scope of the claim (empirical utility rather than proven equivalence to gradient-based similarity). revision: yes

Circularity Check

0 steps flagged

RND novelty signal applied directly as clustering heuristic without self-referential reduction

full rationale

The paper defines client similarity explicitly via the prediction error of a locally trained RND predictor and uses that signal for pre-training clustering. No equation, parameter fit, or self-citation is shown that would make the claimed emergence of useful federations equivalent to the input definition by construction. The core mechanism is a straightforward, task-agnostic application of an existing novelty-detection technique, with the decoupling benefit stated as a direct consequence of performing clustering before main-model training rather than derived from any tautological step.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified assumption that RND prediction error correlates with client distribution similarity; no free parameters or invented entities are named in the abstract, but the predictor architecture itself is an implicit modeling choice.

free parameters (1)

RND predictor size and training details
Compact predictor architecture and training procedure must be chosen but are not specified in the abstract.

axioms (1)

domain assumption Local RND prediction error reliably indicates similarity between client data distributions
Invoked when the novelty signal is used to estimate client similarity without further validation.

pith-pipeline@v0.9.1-grok · 5726 in / 1250 out tokens · 30503 ms · 2026-06-30T07:16:52.523229+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 11 canonical work pages · 1 internal anchor

[1]

Communication-efficient learning of deep networks from decentralized data,

B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” inProceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, A. Singh and X. J. Zhu, Eds., vol. 54. PMLR, 2017, pp. 1273–1282

2017
[2]

Efficient participant contribution evaluation for horizontal and vertical federated learning,

Q. Li, Y . Diao, Q. Chen, and B. He, “Federated learning on non-iid data silos: An experimental study,” in38th IEEE International Conference on Data Engineering, ICDE 2022, Kuala Lumpur, Malaysia, May 9-12, 2022. IEEE, 2022, pp. 965–978. [Online]. Available: https://doi.org/10.1109/ICDE53745.2022.00077

work page doi:10.1109/icde53745.2022.00077 2022
[3]

Advances and open problems in federated learning,

P. Kairouz and et al., “Advances and open problems in federated learning,”Found. Trends Mach. Learn., vol. 14, no. 1-2, pp. 1–210,
[4]

Foundations and Trends in Machine Learning 14(1–2), 1–210 (2021) https://doi.org/10.1561/2200000083 27

[Online]. Available: https://doi.org/10.1561/2200000083

work page doi:10.1561/2200000083
[5]

Deep learning in multiagent systems,

L. Esterle, “Deep learning in multiagent systems,” inDeep Learning for Robot Perception and Cognition. Elsevier, 2022, pp. 435–460

2022
[6]

A survey of clustering federated learning in heterogeneous data scenarios,

E. Liu, W. Yang, Y . Gu, W. Long, S. Istv ´an, and L. Jiang, “A survey of clustering federated learning in heterogeneous data scenarios,”Journal of Computing and Electronic Information Management, vol. 16, no. 3, pp. 17–22, 2025

2025
[7]

Filip Hanzely and Peter Richtárik

A. Ghosh, J. Chung, D. Yin, and K. Ramchandran, “An efficient framework for clustered federated learning,”IEEE Trans. Inf. Theory, vol. 68, no. 12, pp. 8076–8091, 2022. [Online]. Available: https://doi.org/10.1109/TIT.2022.3192506

work page doi:10.1109/tit.2022.3192506 2022
[8]

Decentralized proximity-aware clustering for collective self-federated learning,

D. Domini, N. Farabegoli, G. Aguzzi, M. Viroli, and L. Esterle, “Decentralized proximity-aware clustering for collective self-federated learning,”Internet of Things, vol. 35, p. 101841, 2026. [Online]. Available: https://doi.org/10.1016/j.iot.2025.101841

work page doi:10.1016/j.iot.2025.101841 2026
[9]

Exploration by random network distillation,

Y . Burda, H. Edwards, A. J. Storkey, and O. Klimov, “Exploration by random network distillation,” in7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. [Online]. Available: https: //openreview.net/forum?id=H1lJJnR5Ym

2019
[10]

Expert-free online transfer learning in multi-agent reinforcement learning,

A. Castagna and I. Dusparic, “Expert-free online transfer learning in multi-agent reinforcement learning,” inECAI 2023 - 26th European Conference on Artificial Intelligence, September 30 - October 4, 2023, Krak´ow, Poland., K. Gal, A. Now ´e, G. J. Nalepa, R. Fairstein, and R. Radulescu, Eds. IOS Press, 2023, pp. 357–364. [Online]. Available: https://doi....

work page doi:10.3233/faia230291 2023
[11]

Brendan McMahan, Eider Moore, Daniel Ramage, and Blaise Agüera y Arcas

H. B. McMahan, E. Moore, D. Ramage, and B. A. y Arcas, “Federated learning of deep networks using model averaging,”CoRR, vol. abs/1602.05629, 2016. [Online]. Available: http://arxiv.org/abs/ 1602.05629

work page arXiv 2016
[12]

A performance evaluation of federated learning algorithms,

A. Nilsson, S. Smith, G. Ulm, E. Gustavsson, and M. Jirstrand, “A performance evaluation of federated learning algorithms,” in Proceedings of the Second Workshop on Distributed Infrastructures for Deep Learning, DIDL@Middleware 2018, Rennes, France, December 10, 2018. ACM, 2018, pp. 1–8. [Online]. Available: https://doi.org/10.1145/3286490.3286559

work page doi:10.1145/3286490.3286559 2018
[13]

C2FL: Clustered Continual Federated Learning under Spatial and Temporal Drift

D. Domini, G. Aguzzi, L. Pellegrini, M. Viroli, and L. Esterle, “C2fl: Clustered continual federated learning under spatial and temporal drift,”CoRR, vol. abs/2606.18003, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2606.18003

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2606.18003 2026
[14]

FBFL: A field-based coordination approach for data heterogeneity in federated learning,

D. Domini, G. Aguzzi, L. Esterle, and M. Viroli, “FBFL: A field-based coordination approach for data heterogeneity in federated learning,” Log. Methods Comput. Sci., vol. 22, no. 1, 2026. [Online]. Available: https://doi.org/10.46298/lmcs-22(1:19)2026

work page doi:10.46298/lmcs-22(1:19)2026 2026
[15]

Sparseful: Self-organizing sparse federated learning over spatially non-iid data

D. Domini, G. Aguzzi, A. D. Zenoozi, L. Erhan, L. Cavallaro, A. Liotta, and M. Viroli, “Sparseful: Self-organizing sparse federated learning over spatially non-iid data.”SSRN, 2026. [Online]. Available: https://ssrn.com/abstract=6584516

2026
[16]

Learning multiple layers of features from tiny images,

A. Krizhevsky, G. Hintonet al., “Learning multiple layers of features from tiny images,” 2009

2009
[17]

Profed: a benchmark for proximity-based non-iid federated learning,

D. Domini, C. O. Ingemann, G. Aguzzi, L. Esterle, and M. Viroli, “Profed: a benchmark for proximity-based non-iid federated learning,” Joural of Open Research Software, vol. 14, 2026. [Online]. Available: https://openresearchsoftware.metajnl.com/articles/10.5334/jors.624

work page doi:10.5334/jors.624 2026
[18]

Deep Residual Learning for Image Recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV , USA, June 27-30, 2016. IEEE Computer Society, 2016, pp. 770–778. [Online]. Available: https://doi.org/10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016

[1] [1]

Communication-efficient learning of deep networks from decentralized data,

B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” inProceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, A. Singh and X. J. Zhu, Eds., vol. 54. PMLR, 2017, pp. 1273–1282

2017

[2] [2]

Efficient participant contribution evaluation for horizontal and vertical federated learning,

Q. Li, Y . Diao, Q. Chen, and B. He, “Federated learning on non-iid data silos: An experimental study,” in38th IEEE International Conference on Data Engineering, ICDE 2022, Kuala Lumpur, Malaysia, May 9-12, 2022. IEEE, 2022, pp. 965–978. [Online]. Available: https://doi.org/10.1109/ICDE53745.2022.00077

work page doi:10.1109/icde53745.2022.00077 2022

[3] [3]

Advances and open problems in federated learning,

P. Kairouz and et al., “Advances and open problems in federated learning,”Found. Trends Mach. Learn., vol. 14, no. 1-2, pp. 1–210,

[4] [4]

Foundations and Trends in Machine Learning 14(1–2), 1–210 (2021) https://doi.org/10.1561/2200000083 27

[Online]. Available: https://doi.org/10.1561/2200000083

work page doi:10.1561/2200000083

[5] [5]

Deep learning in multiagent systems,

L. Esterle, “Deep learning in multiagent systems,” inDeep Learning for Robot Perception and Cognition. Elsevier, 2022, pp. 435–460

2022

[6] [6]

A survey of clustering federated learning in heterogeneous data scenarios,

E. Liu, W. Yang, Y . Gu, W. Long, S. Istv ´an, and L. Jiang, “A survey of clustering federated learning in heterogeneous data scenarios,”Journal of Computing and Electronic Information Management, vol. 16, no. 3, pp. 17–22, 2025

2025

[7] [7]

Filip Hanzely and Peter Richtárik

A. Ghosh, J. Chung, D. Yin, and K. Ramchandran, “An efficient framework for clustered federated learning,”IEEE Trans. Inf. Theory, vol. 68, no. 12, pp. 8076–8091, 2022. [Online]. Available: https://doi.org/10.1109/TIT.2022.3192506

work page doi:10.1109/tit.2022.3192506 2022

[8] [8]

Decentralized proximity-aware clustering for collective self-federated learning,

D. Domini, N. Farabegoli, G. Aguzzi, M. Viroli, and L. Esterle, “Decentralized proximity-aware clustering for collective self-federated learning,”Internet of Things, vol. 35, p. 101841, 2026. [Online]. Available: https://doi.org/10.1016/j.iot.2025.101841

work page doi:10.1016/j.iot.2025.101841 2026

[9] [9]

Exploration by random network distillation,

Y . Burda, H. Edwards, A. J. Storkey, and O. Klimov, “Exploration by random network distillation,” in7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. [Online]. Available: https: //openreview.net/forum?id=H1lJJnR5Ym

2019

[10] [10]

Expert-free online transfer learning in multi-agent reinforcement learning,

A. Castagna and I. Dusparic, “Expert-free online transfer learning in multi-agent reinforcement learning,” inECAI 2023 - 26th European Conference on Artificial Intelligence, September 30 - October 4, 2023, Krak´ow, Poland., K. Gal, A. Now ´e, G. J. Nalepa, R. Fairstein, and R. Radulescu, Eds. IOS Press, 2023, pp. 357–364. [Online]. Available: https://doi....

work page doi:10.3233/faia230291 2023

[11] [11]

Brendan McMahan, Eider Moore, Daniel Ramage, and Blaise Agüera y Arcas

H. B. McMahan, E. Moore, D. Ramage, and B. A. y Arcas, “Federated learning of deep networks using model averaging,”CoRR, vol. abs/1602.05629, 2016. [Online]. Available: http://arxiv.org/abs/ 1602.05629

work page arXiv 2016

[12] [12]

A performance evaluation of federated learning algorithms,

A. Nilsson, S. Smith, G. Ulm, E. Gustavsson, and M. Jirstrand, “A performance evaluation of federated learning algorithms,” in Proceedings of the Second Workshop on Distributed Infrastructures for Deep Learning, DIDL@Middleware 2018, Rennes, France, December 10, 2018. ACM, 2018, pp. 1–8. [Online]. Available: https://doi.org/10.1145/3286490.3286559

work page doi:10.1145/3286490.3286559 2018

[13] [13]

C2FL: Clustered Continual Federated Learning under Spatial and Temporal Drift

D. Domini, G. Aguzzi, L. Pellegrini, M. Viroli, and L. Esterle, “C2fl: Clustered continual federated learning under spatial and temporal drift,”CoRR, vol. abs/2606.18003, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2606.18003

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2606.18003 2026

[14] [14]

FBFL: A field-based coordination approach for data heterogeneity in federated learning,

D. Domini, G. Aguzzi, L. Esterle, and M. Viroli, “FBFL: A field-based coordination approach for data heterogeneity in federated learning,” Log. Methods Comput. Sci., vol. 22, no. 1, 2026. [Online]. Available: https://doi.org/10.46298/lmcs-22(1:19)2026

work page doi:10.46298/lmcs-22(1:19)2026 2026

[15] [15]

Sparseful: Self-organizing sparse federated learning over spatially non-iid data

D. Domini, G. Aguzzi, A. D. Zenoozi, L. Erhan, L. Cavallaro, A. Liotta, and M. Viroli, “Sparseful: Self-organizing sparse federated learning over spatially non-iid data.”SSRN, 2026. [Online]. Available: https://ssrn.com/abstract=6584516

2026

[16] [16]

Learning multiple layers of features from tiny images,

A. Krizhevsky, G. Hintonet al., “Learning multiple layers of features from tiny images,” 2009

2009

[17] [17]

Profed: a benchmark for proximity-based non-iid federated learning,

D. Domini, C. O. Ingemann, G. Aguzzi, L. Esterle, and M. Viroli, “Profed: a benchmark for proximity-based non-iid federated learning,” Joural of Open Research Software, vol. 14, 2026. [Online]. Available: https://openresearchsoftware.metajnl.com/articles/10.5334/jors.624

work page doi:10.5334/jors.624 2026

[18] [18]

Deep Residual Learning for Image Recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV , USA, June 27-30, 2016. IEEE Computer Society, 2016, pp. 770–778. [Online]. Available: https://doi.org/10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016