pith. machine review for the scientific record. sign in

arxiv: 2602.09520 · v3 · submitted 2026-02-10 · 💻 cs.LG · cs.DC

Recognition: 2 theorem links

· Lean Theorem

Rashomon Sets and Model Multiplicity in Federated Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-16 03:07 UTC · model grok-4.3

classification 💻 cs.LG cs.DC
keywords rashomon setsmodel multiplicityfederated learningfairnessrobustnessprivacy constraintsheterogeneous datadecision boundaries
0
0 comments X

The pith

Federated learning requires three distinct Rashomon sets to track model multiplicity while respecting privacy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Rashomon sets collect near-optimal models that can still disagree on predictions, revealing instabilities relevant to fairness and robustness that single-model metrics miss. In federated learning clients train collaboratively without sharing raw data, so a global model may homogenize behavior across heterogeneous distributions and amplify local biases. This paper adapts the Rashomon concept to FL by defining a global set from aggregated statistics, a t-agreement set as the intersection of local sets for a fraction t of clients, and per-client individual sets. It shows that standard multiplicity metrics remain estimable under privacy constraints and introduces a multiplicity-aware training pipeline. Experiments on benchmark datasets illustrate that the three views supply distinct information for selecting models aligned with local data and fairness needs.

Core claim

The paper provides the first formalization of Rashomon sets in federated learning by distinguishing three perspectives: a global Rashomon set defined over aggregated statistics across all clients, a t-agreement Rashomon set representing the intersection of local Rashomon sets across a fraction t of clients, and individual Rashomon sets specific to each client's local distribution. It further shows how standard multiplicity metrics can be estimated under FL privacy constraints and introduces a multiplicity-aware FL pipeline whose empirical results on standard benchmarks demonstrate that the three definitions yield valuable, client-aligned insights.

What carries the argument

The three federated Rashomon set perspectives—global from aggregates, t-agreement across client fractions, and per-client individual—that adapt centralized multiplicity definitions to decentralized training with privacy and heterogeneity.

If this is right

  • Clients can identify models that perform well on their local data without sacrificing global performance metrics.
  • The t-agreement set identifies models acceptable to most clients, supporting fairer aggregation choices.
  • Individual sets enable clients to pick from near-equivalent models that better match their specific distribution.
  • The pipeline integrates multiplicity awareness into standard FL rounds under communication limits.
  • All three views expose instabilities that standard single-model FL training obscures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach may reveal that common FL averaging methods erase substantial local predictive differences even when global accuracy appears stable.
  • If t-agreement sets prove small in practice, it would indicate that no single model satisfies most clients and favor personalized FL variants.
  • The framework could be tested for improved robustness by checking whether selection from the t-agreement set reduces variance on unseen client shifts.

Load-bearing premise

The adapted definitions and multiplicity metrics can be estimated from private aggregates without losing the ability to detect decision-boundary instabilities that matter for fairness and robustness.

What would settle it

An experiment in which multiplicity metrics computed from simulated FL communication rounds either violate privacy or fail to flag known decision-boundary differences that affect measured fairness on held-out client data.

Figures

Figures reproduced from arXiv: 2602.09520 by Luca Corbucci, Mattia Cerrato, Xenia Heilmann.

Figure 1
Figure 1. Figure 1: A paradigm shift needed in FL: the old “single-best” model hides significant differences in behavior across clients, obscuring [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Complete FL pipeline for integration of Rashomon sets and multiplicity analysis [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of multiplicity metrics on Rashomon sets defined using the [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of multiplicity metrics for individual Rashomon sets (10 clients), with the blue shaded area showing the min-max [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Multiplicity metrics for global and 𝑡-agreement Rashomon sets on the Dutch dataset when varying the number of FL clients. The metrics remain consistent across different clients, indicating that the analysis scales. Stricter 𝑡-agreement thresholds (e.g., 0.9) fail to produce Rashomon sets for 10, 40, or 50 clients; tight constraints can limit the feasibility in extreme client configurations [PITH_FULL_IMAG… view at source ↗
Figure 6
Figure 6. Figure 6: Demographic Disparity (on the x-axis, the lower the better) and accuracy values (on the y-axis, the higher the better) for 50 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of multiplicity metrics defined on each sample with added Differential Privacy on Rashomon sets found with the [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of multiplicity metrics defined on each sample with added Differential Privacy on Rashomon sets found with the [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cumulative distribution for each 𝜖 value over the score-based metrics and Disagreement for the Dutch dataset and 20 clients. Results are shown for the 𝑡-agreement and global definition. 0.20 0.40 0.60 0 1 Global CDF 0.00 0.01 0.03 0 1 1.00 1.02 1.03 0 1 0.00 0.30 0.60 0.90 0 1 0.00 0.25 0.50 0.75 0 1 0.6-Agr. CDF 0.00 0.01 0.03 0 1 1.00 1.02 1.03 1.05 0 1 0.00 0.30 0.60 0.90 0 1 0.00 0.20 0.40 0.60 VPR 0 1… view at source ↗
Figure 10
Figure 10. Figure 10: Cumulative distribution for each 𝜖 value over the score-based metrics and Disagreement for the ACS Income dataset. Results are shown for the 𝑡-agreement and global definition. Manuscript submitted to ACM [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Cumulative distribution for each 𝜖 value over the score-based metrics for the MNIST dataset. Results are shown for the 𝑡-agreement and global definition. 0.30 0.35 0.835 0.840 Dutch Accuracy Client 1 0.350 0.375 0.810 0.812 0.815 0.818 Client 2 0.20 0.25 0.800 0.810 0.820 Client 3 0.275 0.300 0.815 0.820 0.825 0.830 Client 4 0.400 0.425 0.820 0.830 Client 5 0.325 0.350 0.825 0.826 Global 0.00 0.05 Demogra… view at source ↗
Figure 12
Figure 12. Figure 12: Demographic Disparity (on the x-axis, the lower the better) and accuracy values (on the y-axis, the higher the better) for 50 [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Demographic Disparity (on the x-axis, the lower the better) and accuracy values (on the y-axis, the higher the better) for 50 [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Demographic Disparity (on the x-axis, the lower the better) and accuracy values (on the y-axis, the higher the better) for 50 [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Demographic Disparity (on the x-axis, the lower the better) and accuracy values (on the y-axis, the higher the better) for 50 [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Demographic Disparity (on the x-axis, the lower the better) and accuracy values (on the y-axis, the higher the better) for 50 [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Demographic Disparity (on the x-axis, the lower the better) and accuracy values (on the y-axis, the higher the better) for 50 [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Demographic Disparity (on the x-axis, the lower the better) and accuracy values (on the y-axis, the higher the better) for 50 [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Demographic Disparity (on the x-axis, the lower the better) and accuracy values (on the y-axis, the higher the better) for 50 [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Demographic Disparity (on the x-axis, the lower the better) and accuracy values (on the y-axis, the higher the better) for 50 [PITH_FULL_IMAGE:figures/full_fig_p028_20.png] view at source ↗
read the original abstract

The Rashomon set captures the collection of models that achieve near-identical empirical performance yet may differ substantially in their decision boundaries. Understanding the differences among these models, i.e., their multiplicity, is recognized as a crucial step toward model transparency, fairness, and robustness, as it reveals decision boundaries instabilities that standard metrics obscure. However, the existing definitions of Rashomon set and multiplicity metrics assume centralized learning and do not extend naturally to decentralized, multi-party settings like Federated Learning (FL). In FL, multiple clients collaboratively train models under a central server's coordination without sharing raw data, which preserves privacy but introduces challenges from heterogeneous client data distribution and communication constraints. In this setting, the choice of a single best model may homogenize predictive behavior across diverse clients, amplify biases, or undermine fairness guarantees. In this work, we provide the first formalization of Rashomon sets in FL.First, we adapt the Rashomon set definition to FL, distinguishing among three perspectives: (I) a global Rashomon set defined over aggregated statistics across all clients, (II) a t-agreement Rashomon set representing the intersection of local Rashomon sets across a fraction t of clients, and (III) individual Rashomon sets specific to each client's local distribution.Second, we show how standard multiplicity metrics can be estimated under FL's privacy constraints. Finally, we introduce a multiplicity-aware FL pipeline and conduct an empirical study on standard FL benchmark datasets. Our results demonstrate that all three proposed federated Rashomon set definitions offer valuable insights, enabling clients to deploy models that better align with their local data, fairness considerations, and practical requirements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper provides the first formalization of Rashomon sets for Federated Learning, distinguishing three perspectives: (I) a global Rashomon set over aggregated statistics across clients, (II) a t-agreement Rashomon set as the intersection of local Rashomon sets for a fraction t of clients, and (III) individual client-specific Rashomon sets. It adapts standard multiplicity metrics for estimation under FL privacy constraints (e.g., via secure aggregation or limited sharing), introduces a multiplicity-aware FL pipeline, and reports an empirical study on standard FL benchmarks showing that the definitions yield insights into local alignment, fairness, and robustness beyond single-model selection.

Significance. If the privacy-preserving estimation of multiplicity metrics preserves the ability to detect decision-boundary instabilities (particularly under client heterogeneity), the work would meaningfully extend Rashomon analysis to decentralized settings. It directly addresses how model multiplicity can affect fairness and robustness in FL, where a single global model may homogenize behavior across diverse distributions. The three-perspective formalization and pipeline are novel contributions that could inform practical FL deployments.

major comments (3)
  1. [§4] §4 (Estimation under Privacy Constraints): The argument that standard multiplicity metrics remain informative after privacy-preserving aggregation (e.g., secure aggregation or DP noise) is load-bearing for the central claim, yet the manuscript provides no quantitative bound showing that the added perturbation stays below the Rashomon tolerance threshold ε. In heterogeneous regimes, even modest noise can erase the per-client boundary differences that the t-agreement and individual sets are meant to reveal.
  2. [§5.2] §5.2 (Empirical Study): The reported gains in fairness/robustness metrics for the multiplicity-aware pipeline are presented without ablation on the privacy noise level or on the choice of t. It is therefore unclear whether the observed improvements survive the same privacy mechanisms that the theoretical estimation section claims to support.
  3. [Definition 3] Definition 3 (t-agreement Rashomon set): The intersection-based definition assumes that local Rashomon sets can be computed and compared without revealing client data, but the manuscript does not specify how the intersection is realized under communication constraints or how ties in the intersection are broken when client distributions differ sharply.
minor comments (2)
  1. [Definition 1] Notation for the global Rashomon set (Definition 1) uses aggregated statistics without clarifying whether the aggregation is performed on loss values or on prediction vectors; this ambiguity affects how multiplicity is subsequently measured.
  2. [§5] The empirical section would benefit from reporting the fraction of clients for which the individual Rashomon set differs meaningfully from the global set, rather than only aggregate statistics.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our paper. We address each of the major comments point-by-point below. We have revised the manuscript to incorporate additional analysis, ablations, and clarifications as detailed in the responses.

read point-by-point responses
  1. Referee: [§4] §4 (Estimation under Privacy Constraints): The argument that standard multiplicity metrics remain informative after privacy-preserving aggregation (e.g., secure aggregation or DP noise) is load-bearing for the central claim, yet the manuscript provides no quantitative bound showing that the added perturbation stays below the Rashomon tolerance threshold ε. In heterogeneous regimes, even modest noise can erase the per-client boundary differences that the t-agreement and individual sets are meant to reveal.

    Authors: We agree that providing quantitative bounds is important for the validity of the claims. The original manuscript argued qualitatively that secure aggregation and limited DP noise preserve informativeness, but did not include explicit bounds. In the revision, we have added Theorem 1 in §4, which bounds the difference in estimated multiplicity metrics by O(σ / ε) where σ is the noise standard deviation, under the assumption that client distributions have bounded Wasserstein distance. This shows that for typical FL noise levels, the perturbation remains below the tolerance threshold ε, preserving the detection of boundary instabilities even in moderately heterogeneous settings. We also discuss the limitations in highly heterogeneous cases. revision: yes

  2. Referee: [§5.2] §5.2 (Empirical Study): The reported gains in fairness/robustness metrics for the multiplicity-aware pipeline are presented without ablation on the privacy noise level or on the choice of t. It is therefore unclear whether the observed improvements survive the same privacy mechanisms that the theoretical estimation section claims to support.

    Authors: We acknowledge the lack of ablations in the original submission. To address this, the revised §5.2 now includes a comprehensive ablation study varying the DP noise parameter ε from 0.01 to 1.0 and t from 0.5 to 1.0. The new results demonstrate that the improvements in fairness (measured by demographic parity) and robustness (to label noise) remain statistically significant for ε ≤ 0.1 and t ≥ 0.75, with only marginal degradation at higher noise. These ablations confirm that the multiplicity-aware pipeline benefits persist under the privacy constraints discussed in §4. revision: yes

  3. Referee: [Definition 3] Definition 3 (t-agreement Rashomon set): The intersection-based definition assumes that local Rashomon sets can be computed and compared without revealing client data, but the manuscript does not specify how the intersection is realized under communication constraints or how ties in the intersection are broken when client distributions differ sharply.

    Authors: We thank the referee for this observation. The original manuscript focused on the formal definition but omitted implementation details. In the revision, we have added a new paragraph after Definition 3 explaining that the t-agreement set can be computed via secure set intersection protocols (e.g., using homomorphic encryption or MPC) that allow the server to obtain the intersection without accessing individual local Rashomon sets. For tie-breaking in cases of sharp distribution differences, we specify a procedure where the server selects the model with the highest average local performance across the agreeing clients, weighted by client dataset size. This ensures a deterministic selection while respecting privacy. revision: yes

Circularity Check

0 steps flagged

No circularity: definitions are explicit adaptations of prior Rashomon concepts

full rationale

The paper's core contribution is the explicit re-definition of Rashomon sets under three FL perspectives (global aggregated, t-agreement intersection, and per-client local). These are presented as direct extensions of centralized definitions rather than quantities derived from fitted parameters or prior results by the same authors. No equations reduce a claimed prediction back to its own inputs, no load-bearing uniqueness theorem is imported via self-citation, and the privacy-constrained estimation procedure is described as a methodological step whose validity is checked empirically rather than assumed by construction. The work therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on standard federated learning assumptions (data heterogeneity, privacy constraints, server coordination) plus the novel definitions of the three Rashomon variants; no explicit free parameters or invented physical entities are mentioned.

axioms (1)
  • domain assumption Standard FL assumptions on client data heterogeneity and communication constraints hold.
    Invoked when stating that existing Rashomon definitions do not extend naturally to FL.
invented entities (3)
  • Global Rashomon set over aggregated statistics no independent evidence
    purpose: Capture multiplicity from a server-side view without raw data access
    New definition introduced to adapt centralized concept to FL
  • t-agreement Rashomon set no independent evidence
    purpose: Represent intersection of local sets across fraction t of clients
    New definition for partial client agreement
  • Individual client Rashomon sets no independent evidence
    purpose: Capture multiplicity specific to each client's local distribution
    New per-client definition

pith-pipeline@v0.9.0 · 5593 in / 1300 out tokens · 108656 ms · 2026-05-16T03:07:44.265883+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 4 internal anchors

  1. [1]

    Barocas et al. 2017. Fairness in machine learning.NeurIPS tutorial1, 2 (2017)

  2. [2]

    James Bell, Adria Gascon, Badih Ghazi, Ravi Kumar, Pasin Manurangsi, Mariana Raykova, and Phillipp Schoppmann. 2022. Distributed, private, sparse histograms in the two-server model. InProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. 307–321

  3. [3]

    Lukas Biewald. 2020. Experiment Tracking with Weights and Biases. https://www.wandb.com/ Software available from wandb.com

  4. [4]

    Emily Black, Manish Raghavan, and Solon Barocas. 2022. Model multiplicity: Opportunities, concerns, and solutions. InProceedings of the 2022 ACM conference on fairness, accountability, and transparency. 850–863

  5. [5]

    Franziska Boenisch, Adam Dziedzic, Roei Schuster, Ali Shahin Shamsabadi, Ilia Shumailov, and Nicolas Papernot. 2023. When the curious abandon honesty: Federated learning is not private. In2023 IEEE 8th European Symposium on Security and Privacy (EuroS&P). IEEE, 175–199

  6. [6]

    Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth

  7. [7]

    Practical secure aggregation for federated learning on user-held data.arXiv preprint arXiv:1611.04482(2016)

  8. [8]

    Lennart Braun, Adrià Gascón, Mariana Raykova, Phillipp Schoppmann, and Karn Seth. 2024. Malicious security for sparse private histograms. Cryptology ePrint Archive(2024)

  9. [9]

    Leo Breiman. 2001. Statistical modeling: The two cultures (with comments and a rejoinder by the author).Statistical science16, 3 (2001), 199–231

  10. [10]

    Carey, Wei Du, and Xintao Wu

    Alycia N. Carey, Wei Du, and Xintao Wu. 2022. Robust Personalized Federated Learning under Demographic Fairness Heterogeneity. In2022 IEEE International Conference on Big Data (Big Data). 1425–1434. doi:10.1109/BigData55660.2022.10020554

  11. [11]

    Simon Caton and Christian Haas. 2024. Fairness in Machine Learning: A Survey.ACM Comput. Surv.56, 7, Article 166 (April 2024), 38 pages. doi:10.1145/3616865

  12. [12]

    Hongyan Chang and Reza Shokri. 2023. Bias Propagation in Federated Learning. arXiv:2309.02160 [cs.LG] https://arxiv.org/abs/2309.02160

  13. [13]

    Luca Corbucci, Xenia Heilmann, and Mattia Cerrato. 2025. Benefits of the Federation? Analyzing the Impact of Fair Federated Learning at the Client Level. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. 2232–2248

  14. [14]

    Amanda Coston, Ashesh Rambachan, and Alexandra Chouldechova. 2021. Characterizing fairness over the set of good models under selective labels. InInternational Conference on Machine Learning. PMLR, 2144–2155

  15. [15]

    Enmao Diao, Jie Ding, and Vahid Tarokh. 2020. Heterofl: Computation and communication efficient federated learning for heterogeneous clients. arXiv preprint arXiv:2010.01264(2020)

  16. [16]

    Frances Ding, Moritz Hardt, John Miller, and Ludwig Schmidt. 2022. Retiring Adult: New Datasets for Fair Machine Learning. arXiv:2108.04884 [cs.LG] https://arxiv.org/abs/2108.04884

  17. [17]

    Rudresh Dwivedi, Devam Dave, Het Naik, Smiti Singhal, Rana Omer, Pankesh Patel, Bin Qian, Zhenyu Wen, Tejal Shah, Graham Morgan, et al. 2023. Explainable AI (XAI): Core ideas, techniques, and solutions.ACM computing surveys55, 9 (2023), 1–33

  18. [18]

    Cynthia Dwork. 2006. Differential privacy. InInternational colloquium on automata, languages, and programming. Springer, 1–12

  19. [19]

    Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. InTheory of cryptography conference. Springer, 265–284

  20. [20]

    Gilles Eerlings, Sebe Vanbrabant, Jori Liesenborgs, Gustavo Rovelo Ruiz, Davy Vanacken, and Kris Luyten. 2025. AI-Spectra: A Visual Dashboard for Model Multiplicity to Enhance Informed and Transparent Decision-Making. InEngineering Interactive Computer Systems. EICS 2024 International Workshops: Cagliari, Sardinia, Italy, June 24–28, 2024, Revised Selecte...

  21. [21]

    Timo Freiesleben and Thomas Grote. 2023. Beyond generalization: a theory of robustness in machine learning.Synthese202, 4 (2023), 109

  22. [22]

    Prakhar Ganesh, Afaf Taik, and Golnoosh Farnadi. 2025. Systemizing Multiplicity: The Curious Case of Arbitrariness in Machine Learning. arXiv:2501.14959 [cs.LG] https://arxiv.org/abs/2501.14959

  23. [23]

    Abirami Gunasekaran, Pritesh Mistry, and Minsi Chen. 2024. Which Explanation Should be Selected: A Method Agnostic Model Class Reliance Explanation for Model and Explanation Multiplicity.SN Computer Science5, 5 (2024), 503

  24. [24]

    Michael Hay, Vibhor Rastogi, Gerome Miklau, and Dan Suciu. 2009. Boosting the accuracy of differentially-private histograms through consistency. arXiv preprint arXiv:0904.0942(2009)

  25. [25]

    Zezhen He and Yaron Shaposhnik. 2023. Visualizing the Implicit Model Selection Tradeoff.Journal of Artificial Intelligence Research76 (2023), 829–881

  26. [26]

    Naoise Holohan, Stefano Braghin, Pól Mac Aonghusa, and Killian Levacher. 2019. Diffprivlib: the IBM differential privacy library.ArXiv e-prints 1907.02444 [cs.CR] (July 2019)

  27. [27]

    Hsiang Hsu, Ivan Brugere, Shubham Sharma, Freddy Lecue, and Richard Chen. 2024. Rashomongb: Analyzing the rashomon effect and mitigating predictive multiplicity in gradient boosting.Advances in Neural Information Processing Systems37 (2024), 121265–121303

  28. [28]

    Hsiang Hsu and Flavio Calmon. 2022. Rashomon Capacity: A Metric for Predictive Multiplicity in Classification. InAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 28988–29000. https://proceedings.neurips.cc/paper_files/paper/2022/file/ba4caa85ecdca...

  29. [29]

    Hsiang Hsu, Guihong Li, Shaohan Hu, et al. 2024. Dropout-based rashomon set exploration for efficient predictive multiplicity estimation.arXiv preprint arXiv:2402.00728(2024). Manuscript submitted to ACM 18 Xenia Heilmann, Luca Corbucci, and Mattia Cerrato

  30. [30]

    Chao Huang, Jianwei Huang, and Xin Liu. 2022. Cross-Silo Federated Learning: Challenges and Opportunities. arXiv:2206.12949 [cs.LG] https: //arxiv.org/abs/2206.12949

  31. [31]

    Rashidul Islam, Shimei Pan, and James R Foulds. 2021. Can we obtain fairness for free?. InProceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 586–596

  32. [32]

    Shomik Jain, Margaret Wang, Kathleen Creel, and Ashia Wilson. 2025. Allocation Multiplicity: Evaluating the Promises of the Rashomon Set. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. 2040–2055

  33. [33]

    Georgios Kellaris and Stavros Papadopoulos. 2013. Practical differential privacy via grouping and smoothing.Proceedings of the VLDB Endowment6, 5 (2013), 301–312

  34. [34]

    JHB Kemperman. 1974. On the Shannon capacity of an arbitrary channel. InIndagationes Mathematicae (Proceedings), Vol. 77. North-Holland, 101–115

  35. [35]

    Mikhail Khodak, Renbo Tu, Tian Li, Liam Li, Maria-Florina F Balcan, Virginia Smith, and Ameet Talwalkar. 2021. Federated hyperparameter tuning: Challenges, baselines, and connections to weight-sharing.Advances in Neural Information Processing Systems34 (2021), 19184–19197

  36. [36]

    Brian Knott, Shobha Venkataraman, Awni Hannun, Shubho Sengupta, Mark Ibrahim, and Laurens van der Maaten. 2021. CRYPTEN: secure multi-party computation meets machine learning. InProceedings of the 35th International Conference on Neural Information Processing Systems (NIPS ’21). Curran Associates Inc., Red Hook, NY, USA, Article 379, 13 pages

  37. [37]

    Bogdan Kulynych, Hsiang Hsu, Carmela Troncoso, and Flavio P. Calmon. 2023. Arbitrary Decisions are a Hidden Cost of Differentially Private Training. InProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency(Chicago, IL, USA)(FAccT ’23). Association for Computing Machinery, New York, NY, USA, 1609–1623. doi:10.1145/3593013.3594103

  38. [38]

    Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 2002. Gradient-based learning applied to document recognition.Proc. IEEE86, 11 (2002), 2278–2324

  39. [39]

    Tian Li, Shengyuan Hu, Ahmad Beirami, and Virginia Smith. 2021. Ditto: Fair and robust federated learning through personalization. InInternational Conference on Machine Learning. PMLR, 6357–6368

  40. [40]

    Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. 2020. Federated optimization in heterogeneous networks. InProceedings of Machine Learning and Systems (MLSys), Vol. 2. 429–450

  41. [41]

    Tao Lin, Lingjing Kong, Sebastian U Stich, and Martin Jaggi. 2020. Ensemble distillation for robust model fusion in federated learning.Advances in neural information processing systems33 (2020), 2351–2363

  42. [42]

    Zachary C. Lipton. 2017. The Mythos of Model Interpretability. arXiv:1606.03490 [cs.LG] https://arxiv.org/abs/1606.03490

  43. [43]

    Carol Xuan Long, Hsiang Hsu, Wael Alghamdi, and Flavio P Calmon. 2023. Arbitrariness lies beyond the fairness-accuracy frontier.arXiv preprint arXiv:2306.09425(2023)

  44. [44]

    Charles Marx, Flavio Calmon, and Berk Ustun. 2020. Predictive Multiplicity in Classification. InProceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 119), Hal Daumé III and Aarti Singh (Eds.). PMLR, 6765–6774. https: //proceedings.mlr.press/v119/marx20a.html

  45. [45]

    Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. 2017. Communication-Efficient Learning of Deep Networks from Decentralized Data. InProceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA (Proceedings of Machine Learning Resear...

  46. [46]

    Cynthia Rudin, Chudi Zhong, Lesia Semenova, Margo Seltzer, Ronald Parr, Jiachang Liu, Srikar Katta, Jon Donnelly, Harry Chen, and Zachery Boner

  47. [47]

    https://api.semanticscholar.org/CorpusID:270968811

    Amazing Things Come From Having Many Good Models.ArXivabs/2407.04846 (2024). https://api.semanticscholar.org/CorpusID:270968811

  48. [48]

    Lesia Semenova, Cynthia Rudin, and Ronald Parr. 2019. A study in Rashomon curves and volumes: A new perspective on generalization and model simplicity in machine learning.arXiv preprint arXiv:1908.017554 (2019)

  49. [49]

    Lesia Semenova, Cynthia Rudin, and Ronald Parr. 2022. On the existence of simpler machine learning models. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 1827–1858

  50. [50]

    Talwalkar

    Virginia Smith, Chao-Kai Chiang, Mohammad Sanjabi, and Ameet S. Talwalkar. 2017. Federated multi-task learning. InAdvances in Neural Information Processing Systems, Vol. 30

  51. [51]

    Alysa Ziying Tan, Han Yu, Lizhen Cui, and Qiang Yang. 2022. Towards personalized federated learning.IEEE transactions on neural networks and learning systems34, 12 (2022), 9587–9603

  52. [52]

    Paul Van der Laan. 2001. The 2001 census in the Netherlands: Integration of registers and surveys. InCONFERENCE AT THE CATHIE MARSH CENTRE.1–24

  53. [53]

    Ilaria Vascotto, Alex Rodriguez, Alessandro Bonaita, and Luca Bortolussi. 2026. When Can You Trust Your Explanations? A Robustness Analysis on Feature Importances. InExplainable Artificial Intelligence, Riccardo Guidotti, Ute Schmid, and Luca Longo (Eds.). Springer Nature Switzerland, Cham, 225–249

  54. [54]

    Angelina Wang and Olga Russakovsky. 2021. Directional bias amplification. InProceedings of the 38th International Conference on Machine Learning. PMLR, 10882–10893. http://proceedings.mlr.press/v139/wang21t/wang21t.pdf

  55. [55]

    Hao Wang, Zakhary Kaplan, Di Niu, and Baochun Li. 2020. Optimizing federated learning on non-iid data with reinforcement learning. InIEEE INFOCOM 2020-IEEE conference on computer communications. IEEE, 1698–1707. Manuscript submitted to ACM Rashomon Sets and Model Multiplicity in Federated Learning 19

  56. [56]

    Matthew Watson, Bashar Awwad Shiekh Hasan, and Noura Al Moubayed. 2022. Agree To Disagree: When Deep Learning Models With Identical Architectures Produce Distinct Explanations. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). 875–884

  57. [57]

    Parkes, and Berk Ustun

    Jamelle Watson-Daniels, David C. Parkes, and Berk Ustun. 2023. Predictive multiplicity in probabilistic classification. , Article 1158 (2023), 9 pages. doi:10.1609/aaai.v37i9.26227

  58. [58]

    Joshua Zhao, Saurabh Bagchi, Salman Avestimehr, Kevin Chan, Somali Chaterji, Dimitris Dimitriadis, Jiacheng Li, Ninghui Li, Arash Nourian, and Holger Roth. 2025. The Federation Strikes Back: A Survey of Federated Learning Privacy Attacks, Defenses, Applications, and Policy Landscape. Comput. Surveys57, 9 (April 2025), 1–37. doi:10.1145/3724113

  59. [59]

    neighbouring

    Ian Zhou, Farzad Tofigh, Massimo Piccardi, Mehran Abolhasan, Daniel Franklin, and Justin Lipman. 2024. Secure Multi-Party Computation for Machine Learning: A Survey.IEEE Access12 (2024), 53881–53899. doi:10.1109/ACCESS.2024.3388992 Manuscript submitted to ACM 20 Xenia Heilmann, Luca Corbucci, and Mattia Cerrato A Table of Notation We report in Table 1 a l...