pith. machine review for the scientific record. sign in

arxiv: 2605.09137 · v1 · submitted 2026-05-09 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Evaluating Federated Learning approaches for mammography under breast density heterogeneity

Franco Martin Di Maria, Gonzalo I\~naki Quintana, Laurence Vancamberg

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:50 UTC · model grok-4.3

classification 💻 cs.LG
keywords federated learningmammographybreast densitydata heterogeneityimage classificationmedical imagingprivacy-preserving training
0
0 comments X

The pith

Federated averaging matches or exceeds centralized training accuracy for mammography despite breast density variations across sites.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how breast density differences across medical sites create data imbalances that could hinder collaborative machine learning. It runs controlled tests with sites holding only low-density or high-density cases, plus more realistic population mixes, and compares standard federated learning algorithms to training on pooled data. FedAvg in particular reaches the same or higher accuracy as centralized training, while single-site models and simple averaging fall behind. This result suggests institutions can build stronger shared models without moving private images. The finding matters for clinical workflows because it indicates everyday federated methods already handle this common source of medical data variation.

Core claim

In both extreme and population-based heterogeneity settings defined by BI-RADS density scores, federated learning with FedAvg produces classification accuracy on par with or above that of centralized training on all data combined, while local-only training and naive aggregation methods perform worse.

What carries the argument

Direct comparison of FedAvg, FedProx, and SCAFFOLD federated algorithms against centralized training, local training, ensembling, and weight averaging, using client partitions based on BI-RADS breast density categories.

If this is right

  • Multi-institution mammography models can be trained collaboratively while keeping all images at their original sites.
  • Standard FedAvg already copes with breast-density imbalance, so no extra heterogeneity-correction steps are required for this task.
  • Performance remains stable under both extreme single-density client splits and realistic population-based density distributions.
  • Local training at individual sites produces markedly lower accuracy when density distributions differ strongly between sites.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same density-robustness pattern may appear in other radiology tasks where patient factors create site-specific data shifts.
  • Real deployments would benefit from repeating the tests on live clinical data that includes scanner differences and demographic mixes beyond density scores.
  • Success with unmodified FedAvg could lower the technical barrier for radiology departments to start federated projects.

Load-bearing premise

Grouping mammography cases solely by BI-RADS density scores into client sites accurately represents the main variations found in actual multi-center clinical datasets.

What would settle it

Applying the same FedAvg protocol to a genuine multi-institution mammography collection with naturally occurring density distributions and checking whether its accuracy still equals or exceeds centralized training.

Figures

Figures reproduced from arXiv: 2605.09137 by Franco Martin Di Maria, Gonzalo I\~naki Quintana, Laurence Vancamberg.

Figure 1
Figure 1. Figure 1: BI-RADS Breast composition categories. Each subfigure shows two views of the same [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Centralized and federated learning paradigms. Icons from Flaticon.com [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of the patch-classifier (top) and whole image classifier (bottom), based on [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Folds bootstrap results of the patch-classifiers trained on the population-based heterogeneous setting, evaluated on the original test set (mean ± covariance). Best models for each fold are marked with a star symbol. Models that are not statistically different from the best model according to the Wilcoxon signed￾rank test (p-values > 0.1) are indicated by horizontal dashed lines. 17 [PITH_FULL_IMAGE:figur… view at source ↗
Figure 5
Figure 5. Figure 5: Folds bootstrap results of the whole image classifiers trained on the population-based heterogeneous setting and evaluated on the original test set, in terms of the AUC-ROC (mean ± covariance). Best models for each fold are marked with a star symbol. Models that are not statistically different from the best model according to the Wilcoxon signed-rank test (p-values > 0.1) are indicated by horizontal dashed… view at source ↗
read the original abstract

Breast density is a key factor that influences mammography interpretation and is a major source of heterogeneity in multicenter datasets. Such heterogeneity poses challenges for collaborative machine learning across institutions, particularly in Federated Learning. This study aims to evaluate the impact of breast density-induced heterogeneity on FL for mammography image classification and to assess the robustness of common FL algorithms in realistic clinical settings. We conducted experiments under two scenarios: (1) a strongly heterogeneous setting where each participating site contributed exclusively low- or high-density cases, based on the BI-RADS density score, and (2) a population-based setting simulating breast density distributions in White and Asian populations. For the strongly heterogeneous setting, we evaluated two configurations: one with 2 clients, where the cases were grouped as BI-RADS A-B and C-D, and one with 4 clients, where each site contained cases of a single BI-RADS density. We compared three FL methods (FedAvg, FedProx, SCAFFOLD) against centralized training, local-only training, and naive aggregation approaches, including ensembling and weight averaging. Across both scenarios, FL achieved performance comparable to centralized training, while local models and naive aggregation approaches underperformed in the presence of strong heterogeneity. Notably, FedAvg achieved accuracy on par with or exceeding centralized training, demonstrating resilience to breast density-induced data imbalance without requiring specialized heterogeneity mitigation algorithms. These findings show that FL can address breast density-related heterogeneity, supporting its feasibility for real-world mammography workflows. The demonstrated robustness of FedAvg underscores the potential for broad clinical deployment of FL, enabling collaborative model development while maintaining data privacy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates federated learning (FL) methods for mammography classification under breast density heterogeneity. It simulates two scenarios: strongly heterogeneous settings with clients having only low-density (BI-RADS A-B) or high-density (C-D) cases, or per-density clients, and population-based distributions mimicking White and Asian populations. Comparing FedAvg, FedProx, and SCAFFOLD to centralized training, local training, and naive aggregation, it claims that FL approaches achieve performance comparable to centralized training, with FedAvg notably matching or exceeding it, thus demonstrating resilience to density-induced data imbalance.

Significance. If the results hold under more realistic conditions, this work would indicate that standard FL algorithms like FedAvg can effectively handle breast density heterogeneity in mammography without additional mitigation strategies. This has potential significance for enabling privacy-preserving collaborative AI development across medical institutions, facilitating larger and more diverse training datasets for breast cancer detection models while complying with data protection regulations.

major comments (2)
  1. [Abstract and Experimental Setup] The central claim that FL matched centralized performance relies on simulated partitions using only BI-RADS density scores. However, this may understate real multicenter heterogeneity arising from scanner-specific factors (e.g., kVp/mAs, detector types, compression paddles, post-processing), as these are not modeled. If these factors dominate domain shift, the observed robustness of FedAvg could be an artifact of the simulation, weakening support for the broader conclusion about real-world workflows.
  2. [Results] The abstract reports that FedAvg achieved accuracy on par with or exceeding centralized training but provides no dataset sizes, exact metrics (e.g., accuracy, AUC values with standard deviations), statistical tests, or ablation results. This lack of quantitative detail leaves the claim plausible but insufficiently evidenced, making it difficult to verify the strength of the performance equivalence.
minor comments (2)
  1. [Abstract] The abstract could be strengthened by briefly mentioning the dataset used (e.g., source and size) and key performance numbers to allow readers to immediately gauge the results.
  2. Ensure that all FL algorithms are described with their specific hyperparameters and implementation details in the methods section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review. We appreciate the opportunity to clarify and strengthen our manuscript. Below we respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [Abstract and Experimental Setup] The central claim that FL matched centralized performance relies on simulated partitions using only BI-RADS density scores. However, this may understate real multicenter heterogeneity arising from scanner-specific factors (e.g., kVp/mAs, detector types, compression paddles, post-processing), as these are not modeled. If these factors dominate domain shift, the observed robustness of FedAvg could be an artifact of the simulation, weakening support for the broader conclusion about real-world workflows.

    Authors: We thank the referee for highlighting this important limitation. Our study deliberately focuses on breast density heterogeneity as a clinically significant and measurable source of variation in mammography datasets, using BI-RADS scores to create controlled partitions. While we acknowledge that scanner-specific factors contribute to domain shift in real multicenter settings, our results demonstrate that standard FL methods like FedAvg can handle density-induced heterogeneity without specialized adaptations. To address this, we will revise the discussion section to explicitly note that other sources of heterogeneity were not modeled and may require further investigation or additional techniques. We have also tempered the language in the abstract and conclusions to avoid overgeneralizing to all forms of heterogeneity. revision: partial

  2. Referee: [Results] The abstract reports that FedAvg achieved accuracy on par with or exceeding centralized training but provides no dataset sizes, exact metrics (e.g., accuracy, AUC values with standard deviations), statistical tests, or ablation results. This lack of quantitative detail leaves the claim plausible but insufficiently evidenced, making it difficult to verify the strength of the performance equivalence.

    Authors: We agree that the abstract would benefit from more quantitative details to support the claims. The full manuscript contains comprehensive results in Section 4, including tables with accuracy and AUC metrics, standard deviations from multiple runs, dataset sizes per client, and comparisons to baselines. Statistical tests (e.g., paired t-tests) confirmed no significant differences between FedAvg and centralized training. We will update the abstract to include key quantitative findings and reference the statistical analyses. Ablation results are presented in the main text and supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with no derivations or fitted predictions

full rationale

The paper reports experimental comparisons of FL algorithms (FedAvg, FedProx, SCAFFOLD) against centralized and local baselines under two simulated heterogeneity partitions based on BI-RADS density scores. No equations, ansatzes, uniqueness theorems, or parameter-fitting steps appear in the provided text. All performance claims rest on direct accuracy/F1 measurements from held-out test sets, not on any reduction of outputs to inputs by construction. Self-citations, if present, are not invoked to justify load-bearing premises. The central claim (FedAvg matches or exceeds centralized training) is therefore an empirical observation, not a circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the experimental design choices for simulating heterogeneity and the assumption that BI-RADS-based splits capture the relevant clinical variation; no free parameters, new entities, or non-standard axioms are introduced beyond standard FL setup.

axioms (1)
  • domain assumption BI-RADS density scores can be used to create realistic partitions that represent the main source of multicenter heterogeneity in mammography data.
    Invoked to define the strongly heterogeneous (2-client and 4-client) and population-based scenarios described in the abstract.

pith-pipeline@v0.9.0 · 5592 in / 1259 out tokens · 71319 ms · 2026-05-12T01:50:40.034134+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages

  1. [1]

    Journal of machine learning research , volume=

    A unified theory of diversity in ensemble learning , author=. Journal of machine learning research , volume=

  2. [2]

    2002 , school=

    Contributions to decision tree induction: bias/variance tradeoff and time series classification , author=. 2002 , school=

  3. [3]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  4. [4]

    34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018 , pages=

    Averaging weights leads to wider optima and better generalization , author=. 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018 , pages=. 2018 , organization=

  5. [5]

    Advances in neural information processing systems , volume=

    Loss surfaces, mode connectivity, and fast ensembling of dnns , author=. Advances in neural information processing systems , volume=

  6. [6]

    Cancer Epidemiology, Biomarkers and Prevention , volume=

    Impact of BMI on Prevalence of Dense Breasts by Race and Ethnicity , author=. Cancer Epidemiology, Biomarkers and Prevention , volume=. 2023 , publisher=

  7. [7]

    ACR BI-RADS® Mammography

    Sickles, EA and D’Orsi, CJ and Bassett, LW and et al. ACR BI-RADS® Mammography. ACR BI-RADS® Atlas, Breast Imaging Reporting and Data System. 2013

  8. [8]

    Breast Cancer , publisher=

    An overview of mammographic density and its association with breast cancer , author=. Breast Cancer , publisher=. 2018 , month=. doi:10.1007/s12282-018-0857-5 , pmid=

  9. [9]

    Breast Cancer Research , volume=

    Sensitivity of screening mammography by density and texture: a cohort study from a population-based screening program in Denmark , author=. Breast Cancer Research , volume=. 2019 , doi=

  10. [10]

    International Journal of Cancer , volume=

    Breast cancer screening effect across breast density strata: A case-control study , author=. International Journal of Cancer , volume=. 2017 , month=. doi:10.1002/ijc.30430 , pmid=

  11. [11]

    Journal of Personalized Medicine , volume=

    Automated Breast Cancer Detection in Digital Mammograms of Various Densities via Deep Learning , author=. Journal of Personalized Medicine , volume=. 2020 , doi=

  12. [12]

    Diagnostics , volume=

    Deep Learning Analysis of Mammography for Breast Cancer Risk Prediction in Asian Women , author=. Diagnostics , volume=. 2023 , month=. doi:10.3390/diagnostics13132247 , pmid=

  13. [13]

    American Journal of Roentgenology , volume=

    Impact of Breast Density on Computer-Aided Detection for Breast Cancer , author=. American Journal of Roentgenology , volume=. 2005 , doi=

  14. [14]

    Proceedings of SPIE 11513, 15th International Workshop on Breast Imaging (IWBI 2020) , volume=

    Dustler, Magnus and Dahlblom, Victor and Tingberg, Anders and Zackrisson, Sophia , title=. Proceedings of SPIE 11513, 15th International Workshop on Breast Imaging (IWBI 2020) , volume=. 2020 , month=

  15. [15]

    Journal of Medical Imaging , volume=

    Multi-vendor robustness analysis of a commercial artificial intelligence system for breast cancer detection , author=. Journal of Medical Imaging , volume=. 2023 , month=. doi:10.1117/1.JMI.10.5.051807 , eprint=

  16. [16]

    Science Translational Medicine , volume=

    Toward robust mammography-based models for breast cancer risk , author=. Science Translational Medicine , volume=. 2021 , doi=

  17. [17]

    Radiology: Artificial Intelligence , volume=

    External Evaluation of a Mammography-based Deep Learning Model for Predicting Breast Cancer in an Ethnically Diverse Population , author=. Radiology: Artificial Intelligence , volume=. 2023 , month=. doi:10.1148/ryai.220299 , pmid=

  18. [18]

    2021 , issn =

    An interpretable classifier for high-resolution breast cancer screening images utilizing weakly supervised localization , journal =. 2021 , issn =. doi:https://doi.org/10.1016/j.media.2020.101908 , url =

  19. [19]

    and Rothstein, Joseph H

    Shen, Li and Margolies, Laurie R. and Rothstein, Joseph H. and others , title =. Scientific Reports , volume =. 2019 , doi =

  20. [20]

    2024 IEEE 21st International Symposium on Biomedical Imaging (ISBI) , year=

    Weakly-supervised end-to-end framework for pixel-wise description of micro-calcifications in full-resolution mammograms , author=. 2024 IEEE 21st International Symposium on Biomedical Imaging (ISBI) , year=

  21. [21]

    Bioengineering , volume =

    Quintana, Gonzalo Iñaki and Li, Zhijin and Vancamberg, Laurence and Mougeot, Mathilde and Desolneux, Agnès and Muller, Serge , title =. Bioengineering , volume =. 2023 , publisher =. doi:10.3390/bioengineering10050534 , url =

  22. [22]

    Ieee access , volume=

    Breast cancer diagnosis in two-view mammography using end-to-end trained efficientnet-based convolutional network , author=. Ieee access , volume=. 2022 , publisher=

  23. [23]

    Deep multi-instance networks with sparse label assignment for whole mammogram classification , author=. Medical Image Computing and Computer Assisted Intervention- MICCAI 2017: 20th International Conference, Quebec City, QC, Canada, September 11-13, 2017, Proceedings, Part III 20 , pages=. 2017 , organization=

  24. [24]

    Med Biol Eng Comput , volume =

    Ridhi Arora and Prateek Kumar Rai and Balasubramanian Raman , title =. Med Biol Eng Comput , volume =. 2020 , doi =

  25. [25]

    2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019) , pages=

    Classification and detection in mammograms with weak supervision via dual branch deep neural net , author=. 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019) , pages=. 2019 , organization=

  26. [26]

    Proceedings of the IEEE/CVF Winter Conference on applications of computer vision , pages=

    Deformable gabor feature networks for biomedical image classification , author=. Proceedings of the IEEE/CVF Winter Conference on applications of computer vision , pages=

  27. [27]

    Cancers , volume=

    Patchless Multi-Stage Transfer Learning for Improved Mammographic Breast Mass Classification , author=. Cancers , volume=. 2022 , publisher=. doi:10.3390/cancers14051280 , pmid=

  28. [28]

    Multi-institutional deep learning modeling without sharing patient data: A feasibility study on brain tumor segmentation , author=. Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 4th International Workshop, BrainLes 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Revised Selected Papers, Part ...

  29. [29]

    and Chang, Ken and Singh, Praveer and Neumark, Nir and Li, Wenqi and Gupta, Vikash and Gupta, Sharut and Qu, Liangqiong and Ihsani, Alvin and Bizzo, Bernardo C

    Roth, Holger R. and Chang, Ken and Singh, Praveer and Neumark, Nir and Li, Wenqi and Gupta, Vikash and Gupta, Sharut and Qu, Liangqiong and Ihsani, Alvin and Bizzo, Bernardo C. and et al. , year=. Federated Learning for Breast Density Classification: A Real-World Implementation , ISSN=. doi:10.1007/978-3-030-60548-3_18 , journal=

  30. [30]

    Computer Methods and Programs in Biomedicine , volume=

    Memory-aware curriculum federated learning for breast cancer classification , author=. Computer Methods and Programs in Biomedicine , volume=. 2023 , publisher=

  31. [31]

    Federated Optimization: Distributed Machine Learning for On-Device Intelligence

    Federated optimization: Distributed machine learning for on-device intelligence , author=. arXiv preprint arXiv:1610.02527 , year=

  32. [32]

    Artificial intelligence and statistics , pages=

    Communication-efficient learning of deep networks from decentralized data , author=. Artificial intelligence and statistics , pages=. 2017 , organization=

  33. [33]

    arXiv preprint arXiv:1806.00582 (2018)

    Federated learning with non-iid data , author=. arXiv preprint arXiv:1806.00582 , year=

  34. [34]

    NPJ digital medicine , volume=

    The future of digital health with federated learning , author=. NPJ digital medicine , volume=. 2020 , publisher=

  35. [35]

    Domain generalization in deep learning based mass detection in mammography: A large-scale multi-center study , journal =

    Lidia Garrucho and Kaisar Kushibar and Socayna Jouide and Oliver Diaz and Laura Igual and Karim Lekadir , keywords =. Domain generalization in deep learning based mass detection in mammography: A large-scale multi-center study , journal =. 2022 , issn =. doi:https://doi.org/10.1016/j.artmed.2022.102386 , url =

  36. [37]

    2020 , eprint=

    Federated Optimization in Heterogeneous Networks , author=. 2020 , eprint=. doi:10.48550/arXiv.1812.06127 , url =

  37. [38]

    International conference on machine learning , pages=

    Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author=. International conference on machine learning , pages=. 2022 , organization=

  38. [39]

    2017 , eprint=

    Memory-Efficient Implementation of DenseNets , author=. 2017 , eprint=

  39. [40]

    Scientific Data , volume =

    Lee, Rebecca and Gimenez, Francisco and Hoogi, Assaf and others , title =. Scientific Data , volume =. 2017 , doi =

  40. [41]

    and Halpern, Elkan F

    del Carmen, Marcela G. and Halpern, Elkan F. and Kopans, Daniel B. and Moy, Beverly and Moore, Richard H. and Goss, Paul E. and Hughes, Kevin S. , title =. American Journal of Roentgenology , volume =. 2007 , doi =

  41. [42]

    and Laversanne, Mathieu and Soerjomataram, Isabelle and Jemal, Ahmedin and Bray, Freddie , title =

    Sung, Hyuna and Ferlay, Jacques and Siegel, Rebecca L. and Laversanne, Mathieu and Soerjomataram, Isabelle and Jemal, Ahmedin and Bray, Freddie , title =. CA: A Cancer Journal for Clinicians , volume =. doi:https://doi.org/10.3322/caac.21660 , url =. https://acsjournals.onlinelibrary.wiley.com/doi/pdf/10.3322/caac.21660 , year =

  42. [43]

    and Sickles, Edward A

    D'Orsi, Carl J. and Sickles, Edward A. and Mendelson, Ellen B. and Morris, Elizabeth A. and others , title =. 2013 , publisher =

  43. [44]

    and Miller, Kimberly D

    Siegel, Rebecca L. and Miller, Kimberly D. and Jemal, Ahmedin , title =. CA: A Cancer Journal for Clinicians , volume =. doi:https://doi.org/10.3322/caac.21332 , url =

  44. [45]

    Foundations and trends

    Advances and open problems in federated learning , author=. Foundations and trends. 2021 , publisher=

  45. [46]

    Cancer Epidemiology Biomarkers & Prevention , volume=

    The association of measured breast tissue characteristics with mammographic density and other risk factors for breast cancer , author=. Cancer Epidemiology Biomarkers & Prevention , volume=. 2005 , publisher=

  46. [47]

    New England journal of medicine , volume=

    Mammographic density and the risk and detection of breast cancer , author=. New England journal of medicine , volume=. 2007 , publisher=

  47. [48]

    Cancer Epidemiology Biomarkers & Prevention , volume=

    The relative importance of genetics and environment on mammographic density , author=. Cancer Epidemiology Biomarkers & Prevention , volume=. 2009 , publisher=

  48. [49]

    and Reboussin, Beth A

    Greendale, Gail A. and Reboussin, Beth A. and Slone, Stacey and Wasilauskas, Carol and Pike, Malcolm C. and Ursin, Giske , title =. JNCI: Journal of the National Cancer Institute , volume =. 2003 , month =. doi:10.1093/jnci/95.1.30 , url =

  49. [50]

    arXiv preprint arXiv:2206.05575 , year=

    MammoFL: Mammographic Breast Density Estimation using Federated Learning , author=. arXiv preprint arXiv:2206.05575 , year=

  50. [51]

    Roth, Holger R. and Cheng, Yan and Wen, Yuhong and Yang, Isaac and Xu, Ziyue and Hsieh, Yuan-Ting and Kersten, Kristopher and Harouni, Ahmed and Zhao, Can and Lu, Kevin and Zhang, Zhihong and Li, Wenqi and Myronenko, Andriy and Yang, Dong and Yang, Sean and Rieke, Nicola and Quraini, Abood and Chen, Chester and Xu, Daguang and Ma, Nic and Dogra, Prerna an...