arxiv: 2602.09520 · v3 · submitted 2026-02-10 · 💻 cs.LG · cs.DC

Recognition: 2 theorem links

· Lean Theorem

Rashomon Sets and Model Multiplicity in Federated Learning

Xenia Heilmann , Luca Corbucci , Mattia Cerrato

Authors on Pith no claims yet

Pith reviewed 2026-05-16 03:07 UTC · model grok-4.3

classification 💻 cs.LG cs.DC

keywords rashomon setsmodel multiplicityfederated learningfairnessrobustnessprivacy constraintsheterogeneous datadecision boundaries

0 comments

The pith

Federated learning requires three distinct Rashomon sets to track model multiplicity while respecting privacy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Rashomon sets collect near-optimal models that can still disagree on predictions, revealing instabilities relevant to fairness and robustness that single-model metrics miss. In federated learning clients train collaboratively without sharing raw data, so a global model may homogenize behavior across heterogeneous distributions and amplify local biases. This paper adapts the Rashomon concept to FL by defining a global set from aggregated statistics, a t-agreement set as the intersection of local sets for a fraction t of clients, and per-client individual sets. It shows that standard multiplicity metrics remain estimable under privacy constraints and introduces a multiplicity-aware training pipeline. Experiments on benchmark datasets illustrate that the three views supply distinct information for selecting models aligned with local data and fairness needs.

Core claim

The paper provides the first formalization of Rashomon sets in federated learning by distinguishing three perspectives: a global Rashomon set defined over aggregated statistics across all clients, a t-agreement Rashomon set representing the intersection of local Rashomon sets across a fraction t of clients, and individual Rashomon sets specific to each client's local distribution. It further shows how standard multiplicity metrics can be estimated under FL privacy constraints and introduces a multiplicity-aware FL pipeline whose empirical results on standard benchmarks demonstrate that the three definitions yield valuable, client-aligned insights.

What carries the argument

The three federated Rashomon set perspectives—global from aggregates, t-agreement across client fractions, and per-client individual—that adapt centralized multiplicity definitions to decentralized training with privacy and heterogeneity.

If this is right

Clients can identify models that perform well on their local data without sacrificing global performance metrics.
The t-agreement set identifies models acceptable to most clients, supporting fairer aggregation choices.
Individual sets enable clients to pick from near-equivalent models that better match their specific distribution.
The pipeline integrates multiplicity awareness into standard FL rounds under communication limits.
All three views expose instabilities that standard single-model FL training obscures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach may reveal that common FL averaging methods erase substantial local predictive differences even when global accuracy appears stable.
If t-agreement sets prove small in practice, it would indicate that no single model satisfies most clients and favor personalized FL variants.
The framework could be tested for improved robustness by checking whether selection from the t-agreement set reduces variance on unseen client shifts.

Load-bearing premise

The adapted definitions and multiplicity metrics can be estimated from private aggregates without losing the ability to detect decision-boundary instabilities that matter for fairness and robustness.

What would settle it

An experiment in which multiplicity metrics computed from simulated FL communication rounds either violate privacy or fail to flag known decision-boundary differences that affect measured fairness on held-out client data.

Figures

Figures reproduced from arXiv: 2602.09520 by Luca Corbucci, Mattia Cerrato, Xenia Heilmann.

**Figure 1.** Figure 1: A paradigm shift needed in FL: the old “single-best” model hides significant differences in behavior across clients, obscuring [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Complete FL pipeline for integration of Rashomon sets and multiplicity analysis [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of multiplicity metrics on Rashomon sets defined using the [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of multiplicity metrics for individual Rashomon sets (10 clients), with the blue shaded area showing the min-max [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Multiplicity metrics for global and 𝑡-agreement Rashomon sets on the Dutch dataset when varying the number of FL clients. The metrics remain consistent across different clients, indicating that the analysis scales. Stricter 𝑡-agreement thresholds (e.g., 0.9) fail to produce Rashomon sets for 10, 40, or 50 clients; tight constraints can limit the feasibility in extreme client configurations [PITH_FULL_IMAG… view at source ↗

**Figure 6.** Figure 6: Demographic Disparity (on the x-axis, the lower the better) and accuracy values (on the y-axis, the higher the better) for 50 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of multiplicity metrics defined on each sample with added Differential Privacy on Rashomon sets found with the [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of multiplicity metrics defined on each sample with added Differential Privacy on Rashomon sets found with the [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Cumulative distribution for each 𝜖 value over the score-based metrics and Disagreement for the Dutch dataset and 20 clients. Results are shown for the 𝑡-agreement and global definition. 0.20 0.40 0.60 0 1 Global CDF 0.00 0.01 0.03 0 1 1.00 1.02 1.03 0 1 0.00 0.30 0.60 0.90 0 1 0.00 0.25 0.50 0.75 0 1 0.6-Agr. CDF 0.00 0.01 0.03 0 1 1.00 1.02 1.03 1.05 0 1 0.00 0.30 0.60 0.90 0 1 0.00 0.20 0.40 0.60 VPR 0 1… view at source ↗

**Figure 10.** Figure 10: Cumulative distribution for each 𝜖 value over the score-based metrics and Disagreement for the ACS Income dataset. Results are shown for the 𝑡-agreement and global definition. Manuscript submitted to ACM [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Cumulative distribution for each 𝜖 value over the score-based metrics for the MNIST dataset. Results are shown for the 𝑡-agreement and global definition. 0.30 0.35 0.835 0.840 Dutch Accuracy Client 1 0.350 0.375 0.810 0.812 0.815 0.818 Client 2 0.20 0.25 0.800 0.810 0.820 Client 3 0.275 0.300 0.815 0.820 0.825 0.830 Client 4 0.400 0.425 0.820 0.830 Client 5 0.325 0.350 0.825 0.826 Global 0.00 0.05 Demogra… view at source ↗

**Figure 12.** Figure 12: Demographic Disparity (on the x-axis, the lower the better) and accuracy values (on the y-axis, the higher the better) for 50 [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 13.** Figure 13: Demographic Disparity (on the x-axis, the lower the better) and accuracy values (on the y-axis, the higher the better) for 50 [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗

**Figure 14.** Figure 14: Demographic Disparity (on the x-axis, the lower the better) and accuracy values (on the y-axis, the higher the better) for 50 [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗

**Figure 15.** Figure 15: Demographic Disparity (on the x-axis, the lower the better) and accuracy values (on the y-axis, the higher the better) for 50 [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

**Figure 16.** Figure 16: Demographic Disparity (on the x-axis, the lower the better) and accuracy values (on the y-axis, the higher the better) for 50 [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗

**Figure 17.** Figure 17: Demographic Disparity (on the x-axis, the lower the better) and accuracy values (on the y-axis, the higher the better) for 50 [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗

**Figure 18.** Figure 18: Demographic Disparity (on the x-axis, the lower the better) and accuracy values (on the y-axis, the higher the better) for 50 [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗

**Figure 19.** Figure 19: Demographic Disparity (on the x-axis, the lower the better) and accuracy values (on the y-axis, the higher the better) for 50 [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗

**Figure 20.** Figure 20: Demographic Disparity (on the x-axis, the lower the better) and accuracy values (on the y-axis, the higher the better) for 50 [PITH_FULL_IMAGE:figures/full_fig_p028_20.png] view at source ↗

read the original abstract

The Rashomon set captures the collection of models that achieve near-identical empirical performance yet may differ substantially in their decision boundaries. Understanding the differences among these models, i.e., their multiplicity, is recognized as a crucial step toward model transparency, fairness, and robustness, as it reveals decision boundaries instabilities that standard metrics obscure. However, the existing definitions of Rashomon set and multiplicity metrics assume centralized learning and do not extend naturally to decentralized, multi-party settings like Federated Learning (FL). In FL, multiple clients collaboratively train models under a central server's coordination without sharing raw data, which preserves privacy but introduces challenges from heterogeneous client data distribution and communication constraints. In this setting, the choice of a single best model may homogenize predictive behavior across diverse clients, amplify biases, or undermine fairness guarantees. In this work, we provide the first formalization of Rashomon sets in FL.First, we adapt the Rashomon set definition to FL, distinguishing among three perspectives: (I) a global Rashomon set defined over aggregated statistics across all clients, (II) a t-agreement Rashomon set representing the intersection of local Rashomon sets across a fraction t of clients, and (III) individual Rashomon sets specific to each client's local distribution.Second, we show how standard multiplicity metrics can be estimated under FL's privacy constraints. Finally, we introduce a multiplicity-aware FL pipeline and conduct an empirical study on standard FL benchmark datasets. Our results demonstrate that all three proposed federated Rashomon set definitions offer valuable insights, enabling clients to deploy models that better align with their local data, fairness considerations, and practical requirements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts Rashomon sets to federated learning with three perspectives and a privacy-aware pipeline, but the estimation step may lose the boundary instabilities that make multiplicity useful.

read the letter

The core move is to define three Rashomon sets for FL: a global one over aggregated statistics, a t-agreement version that takes the intersection across a fraction of clients, and per-client local sets. They then adapt standard multiplicity metrics so they can be computed without raw data sharing and test the whole thing in a multiplicity-aware training pipeline on common FL benchmarks. That formalization and the three-way split are new; prior Rashomon work stayed centralized and did not address the no-sharing constraint or client heterogeneity directly. The pipeline idea is practical on paper and the experiments apparently show that the three views surface different fairness and robustness signals, which is the kind of concrete output that could matter for deployed systems. The soft spot is the privacy-preserving estimation itself. Secure aggregation or differential privacy noise will perturb the per-client prediction vectors that define the decision boundaries. If those perturbations are larger than the tolerance used to build the Rashomon set, the multiplicity signal the authors want to preserve gets smoothed away, especially on non-IID client distributions where the instabilities are largest. The abstract does not spell out the exact estimation procedure or report ablations on privacy budget and heterogeneity level, so it is unclear whether the reported insights survive realistic constraints. This is for researchers working on federated learning who already care about interpretability and fairness beyond accuracy. A reader who wants concrete definitions and a starting pipeline will find usable material here. The work is coherent enough and the problem is real enough that it should go to peer review rather than desk reject; the referees can check whether the estimation actually keeps the signal intact.

Referee Report

3 major / 2 minor

Summary. The paper provides the first formalization of Rashomon sets for Federated Learning, distinguishing three perspectives: (I) a global Rashomon set over aggregated statistics across clients, (II) a t-agreement Rashomon set as the intersection of local Rashomon sets for a fraction t of clients, and (III) individual client-specific Rashomon sets. It adapts standard multiplicity metrics for estimation under FL privacy constraints (e.g., via secure aggregation or limited sharing), introduces a multiplicity-aware FL pipeline, and reports an empirical study on standard FL benchmarks showing that the definitions yield insights into local alignment, fairness, and robustness beyond single-model selection.

Significance. If the privacy-preserving estimation of multiplicity metrics preserves the ability to detect decision-boundary instabilities (particularly under client heterogeneity), the work would meaningfully extend Rashomon analysis to decentralized settings. It directly addresses how model multiplicity can affect fairness and robustness in FL, where a single global model may homogenize behavior across diverse distributions. The three-perspective formalization and pipeline are novel contributions that could inform practical FL deployments.

major comments (3)

[§4] §4 (Estimation under Privacy Constraints): The argument that standard multiplicity metrics remain informative after privacy-preserving aggregation (e.g., secure aggregation or DP noise) is load-bearing for the central claim, yet the manuscript provides no quantitative bound showing that the added perturbation stays below the Rashomon tolerance threshold ε. In heterogeneous regimes, even modest noise can erase the per-client boundary differences that the t-agreement and individual sets are meant to reveal.
[§5.2] §5.2 (Empirical Study): The reported gains in fairness/robustness metrics for the multiplicity-aware pipeline are presented without ablation on the privacy noise level or on the choice of t. It is therefore unclear whether the observed improvements survive the same privacy mechanisms that the theoretical estimation section claims to support.
[Definition 3] Definition 3 (t-agreement Rashomon set): The intersection-based definition assumes that local Rashomon sets can be computed and compared without revealing client data, but the manuscript does not specify how the intersection is realized under communication constraints or how ties in the intersection are broken when client distributions differ sharply.

minor comments (2)

[Definition 1] Notation for the global Rashomon set (Definition 1) uses aggregated statistics without clarifying whether the aggregation is performed on loss values or on prediction vectors; this ambiguity affects how multiplicity is subsequently measured.
[§5] The empirical section would benefit from reporting the fraction of clients for which the individual Rashomon set differs meaningfully from the global set, rather than only aggregate statistics.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our paper. We address each of the major comments point-by-point below. We have revised the manuscript to incorporate additional analysis, ablations, and clarifications as detailed in the responses.

read point-by-point responses

Referee: [§4] §4 (Estimation under Privacy Constraints): The argument that standard multiplicity metrics remain informative after privacy-preserving aggregation (e.g., secure aggregation or DP noise) is load-bearing for the central claim, yet the manuscript provides no quantitative bound showing that the added perturbation stays below the Rashomon tolerance threshold ε. In heterogeneous regimes, even modest noise can erase the per-client boundary differences that the t-agreement and individual sets are meant to reveal.

Authors: We agree that providing quantitative bounds is important for the validity of the claims. The original manuscript argued qualitatively that secure aggregation and limited DP noise preserve informativeness, but did not include explicit bounds. In the revision, we have added Theorem 1 in §4, which bounds the difference in estimated multiplicity metrics by O(σ / ε) where σ is the noise standard deviation, under the assumption that client distributions have bounded Wasserstein distance. This shows that for typical FL noise levels, the perturbation remains below the tolerance threshold ε, preserving the detection of boundary instabilities even in moderately heterogeneous settings. We also discuss the limitations in highly heterogeneous cases. revision: yes
Referee: [§5.2] §5.2 (Empirical Study): The reported gains in fairness/robustness metrics for the multiplicity-aware pipeline are presented without ablation on the privacy noise level or on the choice of t. It is therefore unclear whether the observed improvements survive the same privacy mechanisms that the theoretical estimation section claims to support.

Authors: We acknowledge the lack of ablations in the original submission. To address this, the revised §5.2 now includes a comprehensive ablation study varying the DP noise parameter ε from 0.01 to 1.0 and t from 0.5 to 1.0. The new results demonstrate that the improvements in fairness (measured by demographic parity) and robustness (to label noise) remain statistically significant for ε ≤ 0.1 and t ≥ 0.75, with only marginal degradation at higher noise. These ablations confirm that the multiplicity-aware pipeline benefits persist under the privacy constraints discussed in §4. revision: yes
Referee: [Definition 3] Definition 3 (t-agreement Rashomon set): The intersection-based definition assumes that local Rashomon sets can be computed and compared without revealing client data, but the manuscript does not specify how the intersection is realized under communication constraints or how ties in the intersection are broken when client distributions differ sharply.

Authors: We thank the referee for this observation. The original manuscript focused on the formal definition but omitted implementation details. In the revision, we have added a new paragraph after Definition 3 explaining that the t-agreement set can be computed via secure set intersection protocols (e.g., using homomorphic encryption or MPC) that allow the server to obtain the intersection without accessing individual local Rashomon sets. For tie-breaking in cases of sharp distribution differences, we specify a procedure where the server selects the model with the highest average local performance across the agreeing clients, weighted by client dataset size. This ensures a deterministic selection while respecting privacy. revision: yes

Circularity Check

0 steps flagged

No circularity: definitions are explicit adaptations of prior Rashomon concepts

full rationale

The paper's core contribution is the explicit re-definition of Rashomon sets under three FL perspectives (global aggregated, t-agreement intersection, and per-client local). These are presented as direct extensions of centralized definitions rather than quantities derived from fitted parameters or prior results by the same authors. No equations reduce a claimed prediction back to its own inputs, no load-bearing uniqueness theorem is imported via self-citation, and the privacy-constrained estimation procedure is described as a methodological step whose validity is checked empirically rather than assumed by construction. The work therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on standard federated learning assumptions (data heterogeneity, privacy constraints, server coordination) plus the novel definitions of the three Rashomon variants; no explicit free parameters or invented physical entities are mentioned.

axioms (1)

domain assumption Standard FL assumptions on client data heterogeneity and communication constraints hold.
Invoked when stating that existing Rashomon definitions do not extend naturally to FL.

invented entities (3)

Global Rashomon set over aggregated statistics no independent evidence
purpose: Capture multiplicity from a server-side view without raw data access
New definition introduced to adapt centralized concept to FL
t-agreement Rashomon set no independent evidence
purpose: Represent intersection of local sets across fraction t of clients
New definition for partial client agreement
Individual client Rashomon sets no independent evidence
purpose: Capture multiplicity specific to each client's local distribution
New per-client definition

pith-pipeline@v0.9.0 · 5593 in / 1300 out tokens · 108656 ms · 2026-05-16T03:07:44.265883+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Definition 3 (Global Rashomon set) … f_e(δ_i(h1,h2)1,… ) ≤ ε_i
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Definition 4 (t-agreement Rashomon set) … |{c : δ_i(h1,h2)_c ≤ ε_i}| / |C_E| ≥ t

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 4 internal anchors

[1]

Barocas et al. 2017. Fairness in machine learning.NeurIPS tutorial1, 2 (2017)

work page 2017
[2]

James Bell, Adria Gascon, Badih Ghazi, Ravi Kumar, Pasin Manurangsi, Mariana Raykova, and Phillipp Schoppmann. 2022. Distributed, private, sparse histograms in the two-server model. InProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. 307–321

work page 2022
[3]

Lukas Biewald. 2020. Experiment Tracking with Weights and Biases. https://www.wandb.com/ Software available from wandb.com

work page 2020
[4]

Emily Black, Manish Raghavan, and Solon Barocas. 2022. Model multiplicity: Opportunities, concerns, and solutions. InProceedings of the 2022 ACM conference on fairness, accountability, and transparency. 850–863

work page 2022
[5]

Franziska Boenisch, Adam Dziedzic, Roei Schuster, Ali Shahin Shamsabadi, Ilia Shumailov, and Nicolas Papernot. 2023. When the curious abandon honesty: Federated learning is not private. In2023 IEEE 8th European Symposium on Security and Privacy (EuroS&P). IEEE, 175–199

work page 2023
[6]

Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth

work page
[7]

Practical secure aggregation for federated learning on user-held data.arXiv preprint arXiv:1611.04482(2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[8]

Lennart Braun, Adrià Gascón, Mariana Raykova, Phillipp Schoppmann, and Karn Seth. 2024. Malicious security for sparse private histograms. Cryptology ePrint Archive(2024)

work page 2024
[9]

Leo Breiman. 2001. Statistical modeling: The two cultures (with comments and a rejoinder by the author).Statistical science16, 3 (2001), 199–231

work page 2001
[10]

Carey, Wei Du, and Xintao Wu

Alycia N. Carey, Wei Du, and Xintao Wu. 2022. Robust Personalized Federated Learning under Demographic Fairness Heterogeneity. In2022 IEEE International Conference on Big Data (Big Data). 1425–1434. doi:10.1109/BigData55660.2022.10020554

work page doi:10.1109/bigdata55660.2022.10020554 2022
[11]

Simon Caton and Christian Haas. 2024. Fairness in Machine Learning: A Survey.ACM Comput. Surv.56, 7, Article 166 (April 2024), 38 pages. doi:10.1145/3616865

work page doi:10.1145/3616865 2024
[12]

Hongyan Chang and Reza Shokri. 2023. Bias Propagation in Federated Learning. arXiv:2309.02160 [cs.LG] https://arxiv.org/abs/2309.02160

work page arXiv 2023
[13]

Luca Corbucci, Xenia Heilmann, and Mattia Cerrato. 2025. Benefits of the Federation? Analyzing the Impact of Fair Federated Learning at the Client Level. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. 2232–2248

work page 2025
[14]

Amanda Coston, Ashesh Rambachan, and Alexandra Chouldechova. 2021. Characterizing fairness over the set of good models under selective labels. InInternational Conference on Machine Learning. PMLR, 2144–2155

work page 2021
[15]

Enmao Diao, Jie Ding, and Vahid Tarokh. 2020. Heterofl: Computation and communication efficient federated learning for heterogeneous clients. arXiv preprint arXiv:2010.01264(2020)

work page arXiv 2020
[16]

Frances Ding, Moritz Hardt, John Miller, and Ludwig Schmidt. 2022. Retiring Adult: New Datasets for Fair Machine Learning. arXiv:2108.04884 [cs.LG] https://arxiv.org/abs/2108.04884

work page arXiv 2022
[17]

Rudresh Dwivedi, Devam Dave, Het Naik, Smiti Singhal, Rana Omer, Pankesh Patel, Bin Qian, Zhenyu Wen, Tejal Shah, Graham Morgan, et al. 2023. Explainable AI (XAI): Core ideas, techniques, and solutions.ACM computing surveys55, 9 (2023), 1–33

work page 2023
[18]

Cynthia Dwork. 2006. Differential privacy. InInternational colloquium on automata, languages, and programming. Springer, 1–12

work page 2006
[19]

Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. InTheory of cryptography conference. Springer, 265–284

work page 2006
[20]

Gilles Eerlings, Sebe Vanbrabant, Jori Liesenborgs, Gustavo Rovelo Ruiz, Davy Vanacken, and Kris Luyten. 2025. AI-Spectra: A Visual Dashboard for Model Multiplicity to Enhance Informed and Transparent Decision-Making. InEngineering Interactive Computer Systems. EICS 2024 International Workshops: Cagliari, Sardinia, Italy, June 24–28, 2024, Revised Selecte...

work page 2025
[21]

Timo Freiesleben and Thomas Grote. 2023. Beyond generalization: a theory of robustness in machine learning.Synthese202, 4 (2023), 109

work page 2023
[22]

Prakhar Ganesh, Afaf Taik, and Golnoosh Farnadi. 2025. Systemizing Multiplicity: The Curious Case of Arbitrariness in Machine Learning. arXiv:2501.14959 [cs.LG] https://arxiv.org/abs/2501.14959

work page arXiv 2025
[23]

Abirami Gunasekaran, Pritesh Mistry, and Minsi Chen. 2024. Which Explanation Should be Selected: A Method Agnostic Model Class Reliance Explanation for Model and Explanation Multiplicity.SN Computer Science5, 5 (2024), 503

work page 2024
[24]

Michael Hay, Vibhor Rastogi, Gerome Miklau, and Dan Suciu. 2009. Boosting the accuracy of differentially-private histograms through consistency. arXiv preprint arXiv:0904.0942(2009)

work page internal anchor Pith review Pith/arXiv arXiv 2009
[25]

Zezhen He and Yaron Shaposhnik. 2023. Visualizing the Implicit Model Selection Tradeoff.Journal of Artificial Intelligence Research76 (2023), 829–881

work page 2023
[26]

Naoise Holohan, Stefano Braghin, Pól Mac Aonghusa, and Killian Levacher. 2019. Diffprivlib: the IBM differential privacy library.ArXiv e-prints 1907.02444 [cs.CR] (July 2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[27]

Hsiang Hsu, Ivan Brugere, Shubham Sharma, Freddy Lecue, and Richard Chen. 2024. Rashomongb: Analyzing the rashomon effect and mitigating predictive multiplicity in gradient boosting.Advances in Neural Information Processing Systems37 (2024), 121265–121303

work page 2024
[28]

Hsiang Hsu and Flavio Calmon. 2022. Rashomon Capacity: A Metric for Predictive Multiplicity in Classification. InAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 28988–29000. https://proceedings.neurips.cc/paper_files/paper/2022/file/ba4caa85ecdca...

work page 2022
[29]

Hsiang Hsu, Guihong Li, Shaohan Hu, et al. 2024. Dropout-based rashomon set exploration for efficient predictive multiplicity estimation.arXiv preprint arXiv:2402.00728(2024). Manuscript submitted to ACM 18 Xenia Heilmann, Luca Corbucci, and Mattia Cerrato

work page arXiv 2024
[30]

Chao Huang, Jianwei Huang, and Xin Liu. 2022. Cross-Silo Federated Learning: Challenges and Opportunities. arXiv:2206.12949 [cs.LG] https: //arxiv.org/abs/2206.12949

work page arXiv 2022
[31]

Rashidul Islam, Shimei Pan, and James R Foulds. 2021. Can we obtain fairness for free?. InProceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 586–596

work page 2021
[32]

Shomik Jain, Margaret Wang, Kathleen Creel, and Ashia Wilson. 2025. Allocation Multiplicity: Evaluating the Promises of the Rashomon Set. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. 2040–2055

work page 2025
[33]

Georgios Kellaris and Stavros Papadopoulos. 2013. Practical differential privacy via grouping and smoothing.Proceedings of the VLDB Endowment6, 5 (2013), 301–312

work page 2013
[34]

JHB Kemperman. 1974. On the Shannon capacity of an arbitrary channel. InIndagationes Mathematicae (Proceedings), Vol. 77. North-Holland, 101–115

work page 1974
[35]

Mikhail Khodak, Renbo Tu, Tian Li, Liam Li, Maria-Florina F Balcan, Virginia Smith, and Ameet Talwalkar. 2021. Federated hyperparameter tuning: Challenges, baselines, and connections to weight-sharing.Advances in Neural Information Processing Systems34 (2021), 19184–19197

work page 2021
[36]

Brian Knott, Shobha Venkataraman, Awni Hannun, Shubho Sengupta, Mark Ibrahim, and Laurens van der Maaten. 2021. CRYPTEN: secure multi-party computation meets machine learning. InProceedings of the 35th International Conference on Neural Information Processing Systems (NIPS ’21). Curran Associates Inc., Red Hook, NY, USA, Article 379, 13 pages

work page 2021
[37]

Bogdan Kulynych, Hsiang Hsu, Carmela Troncoso, and Flavio P. Calmon. 2023. Arbitrary Decisions are a Hidden Cost of Differentially Private Training. InProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency(Chicago, IL, USA)(FAccT ’23). Association for Computing Machinery, New York, NY, USA, 1609–1623. doi:10.1145/3593013.3594103

work page doi:10.1145/3593013.3594103 2023
[38]

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 2002. Gradient-based learning applied to document recognition.Proc. IEEE86, 11 (2002), 2278–2324

work page 2002
[39]

Tian Li, Shengyuan Hu, Ahmad Beirami, and Virginia Smith. 2021. Ditto: Fair and robust federated learning through personalization. InInternational Conference on Machine Learning. PMLR, 6357–6368

work page 2021
[40]

Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. 2020. Federated optimization in heterogeneous networks. InProceedings of Machine Learning and Systems (MLSys), Vol. 2. 429–450

work page 2020
[41]

Tao Lin, Lingjing Kong, Sebastian U Stich, and Martin Jaggi. 2020. Ensemble distillation for robust model fusion in federated learning.Advances in neural information processing systems33 (2020), 2351–2363

work page 2020
[42]

Zachary C. Lipton. 2017. The Mythos of Model Interpretability. arXiv:1606.03490 [cs.LG] https://arxiv.org/abs/1606.03490

work page internal anchor Pith review Pith/arXiv arXiv 2017
[43]

Carol Xuan Long, Hsiang Hsu, Wael Alghamdi, and Flavio P Calmon. 2023. Arbitrariness lies beyond the fairness-accuracy frontier.arXiv preprint arXiv:2306.09425(2023)

work page arXiv 2023
[44]

Charles Marx, Flavio Calmon, and Berk Ustun. 2020. Predictive Multiplicity in Classification. InProceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 119), Hal Daumé III and Aarti Singh (Eds.). PMLR, 6765–6774. https: //proceedings.mlr.press/v119/marx20a.html

work page 2020
[45]

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. 2017. Communication-Efficient Learning of Deep Networks from Decentralized Data. InProceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA (Proceedings of Machine Learning Resear...

work page 2017
[46]

Cynthia Rudin, Chudi Zhong, Lesia Semenova, Margo Seltzer, Ronald Parr, Jiachang Liu, Srikar Katta, Jon Donnelly, Harry Chen, and Zachery Boner

work page
[47]

https://api.semanticscholar.org/CorpusID:270968811

Amazing Things Come From Having Many Good Models.ArXivabs/2407.04846 (2024). https://api.semanticscholar.org/CorpusID:270968811

work page arXiv 2024
[48]

Lesia Semenova, Cynthia Rudin, and Ronald Parr. 2019. A study in Rashomon curves and volumes: A new perspective on generalization and model simplicity in machine learning.arXiv preprint arXiv:1908.017554 (2019)

work page arXiv 2019
[49]

Lesia Semenova, Cynthia Rudin, and Ronald Parr. 2022. On the existence of simpler machine learning models. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 1827–1858

work page 2022
[50]

Talwalkar

Virginia Smith, Chao-Kai Chiang, Mohammad Sanjabi, and Ameet S. Talwalkar. 2017. Federated multi-task learning. InAdvances in Neural Information Processing Systems, Vol. 30

work page 2017
[51]

Alysa Ziying Tan, Han Yu, Lizhen Cui, and Qiang Yang. 2022. Towards personalized federated learning.IEEE transactions on neural networks and learning systems34, 12 (2022), 9587–9603

work page 2022
[52]

Paul Van der Laan. 2001. The 2001 census in the Netherlands: Integration of registers and surveys. InCONFERENCE AT THE CATHIE MARSH CENTRE.1–24

work page 2001
[53]

Ilaria Vascotto, Alex Rodriguez, Alessandro Bonaita, and Luca Bortolussi. 2026. When Can You Trust Your Explanations? A Robustness Analysis on Feature Importances. InExplainable Artificial Intelligence, Riccardo Guidotti, Ute Schmid, and Luca Longo (Eds.). Springer Nature Switzerland, Cham, 225–249

work page 2026
[54]

Angelina Wang and Olga Russakovsky. 2021. Directional bias amplification. InProceedings of the 38th International Conference on Machine Learning. PMLR, 10882–10893. http://proceedings.mlr.press/v139/wang21t/wang21t.pdf

work page 2021
[55]

Hao Wang, Zakhary Kaplan, Di Niu, and Baochun Li. 2020. Optimizing federated learning on non-iid data with reinforcement learning. InIEEE INFOCOM 2020-IEEE conference on computer communications. IEEE, 1698–1707. Manuscript submitted to ACM Rashomon Sets and Model Multiplicity in Federated Learning 19

work page 2020
[56]

Matthew Watson, Bashar Awwad Shiekh Hasan, and Noura Al Moubayed. 2022. Agree To Disagree: When Deep Learning Models With Identical Architectures Produce Distinct Explanations. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). 875–884

work page 2022
[57]

Parkes, and Berk Ustun

Jamelle Watson-Daniels, David C. Parkes, and Berk Ustun. 2023. Predictive multiplicity in probabilistic classification. , Article 1158 (2023), 9 pages. doi:10.1609/aaai.v37i9.26227

work page doi:10.1609/aaai.v37i9.26227 2023
[58]

Joshua Zhao, Saurabh Bagchi, Salman Avestimehr, Kevin Chan, Somali Chaterji, Dimitris Dimitriadis, Jiacheng Li, Ninghui Li, Arash Nourian, and Holger Roth. 2025. The Federation Strikes Back: A Survey of Federated Learning Privacy Attacks, Defenses, Applications, and Policy Landscape. Comput. Surveys57, 9 (April 2025), 1–37. doi:10.1145/3724113

work page doi:10.1145/3724113 2025
[59]

neighbouring

Ian Zhou, Farzad Tofigh, Massimo Piccardi, Mehran Abolhasan, Daniel Franklin, and Justin Lipman. 2024. Secure Multi-Party Computation for Machine Learning: A Survey.IEEE Access12 (2024), 53881–53899. doi:10.1109/ACCESS.2024.3388992 Manuscript submitted to ACM 20 Xenia Heilmann, Luca Corbucci, and Mattia Cerrato A Table of Notation We report in Table 1 a l...

work page doi:10.1109/access.2024.3388992 2024