pith. sign in

arxiv: 2605.29587 · v1 · pith:3TXASJLVnew · submitted 2026-05-28 · 🧬 q-bio.QM · cs.LG

FPLIER: Federated Pathway-Level Information Extractor

Pith reviewed 2026-06-28 23:59 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.LG
keywords federated learningPLIERtranscriptomicspathway analysisprivacymembership inferencegene expression
0
0 comments X

The pith

FPLIER produces PLIER training updates algebraically identical to centralized pooling while keeping expression data local to each site.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FPLIER as a federated extension of the Pathway Level Information Extractor that lets multiple data holders train a shared model without pooling their private gene expression matrices. Secure aggregation combines local updates so the result matches exactly what a single centralized run on all data would produce. Public datasets can be added to the training process. Systematic membership inference tests show that attack success depends on the rank of the combined expression matrix, with higher ranks driving performance toward random guessing.

Core claim

Through secure aggregation, FPLIER produces training updates algebraically equivalent to those of a centralized pooled-data approach while keeping expression data local. Evaluation across simulated consortia from the K-CLIER and MultiPLIER studies shows stable convergence. Privacy risk is governed by the rank of the training expression matrix; incorporating public data or reducing dimensionality increases this rank and moves the system toward a full-rank regime in which training and non-training samples become indistinguishable to the attacker.

What carries the argument

Secure aggregation protocol that enforces algebraic equivalence between federated and centralized PLIER updates.

If this is right

  • Data holders can jointly train pathway-aware models without moving raw expression matrices.
  • Adding public datasets increases matrix rank and thereby reduces membership inference risk.
  • Dimensionality reduction steps can be applied to push the system toward the full-rank privacy regime.
  • The same secure-aggregation approach yields stable results across the two tested consortia structures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same secure-aggregation wrapper could be applied to other gene-set factorization methods that rely on large compendia.
  • Consortia might deliberately design data collection to maximize effective rank as a privacy control.
  • The rank dependence points to a possible general lever for privacy in matrix-factorization models beyond PLIER.

Load-bearing premise

Secure aggregation can be realized in practice without leakage beyond the analyzed intermediate statistics and the federated updates remain numerically stable.

What would settle it

A membership inference attack that identifies training samples with accuracy significantly above chance on a full-rank expression matrix, or observed numerical divergence between federated and centralized runs on real multi-site data.

Figures

Figures reproduced from arXiv: 2605.29587 by Christian Berchtold, Daniele Malpetti, Francesca Mangili, Francesco Gualdi, Laura Azzimonti, Marco Scutari.

Figure 1
Figure 1. Figure 1: Overview of the PLIER and FPLIER decompositions. PLIER takes as input a gene expression matrix Y and a gene-set membership matrix C encoding prior biological knowledge. The method learns matrices Z, U, and B such that Y ≈ ZB and Z ≈ CU. Here, B provides a low-dimensional representation of samples in terms of latent variables (LVs), Z contains the corresponding gene loadings, and U links LVs to gene sets. I… view at source ↗
Figure 2
Figure 2. Figure 2: Membership inference attack performance during training (left) and after model release (right) as a function of the sample-to-feature dimensionality (n/p) ratio for the K–CLIER, K–CLIER (2k genes), and MultiPLIER settings. The dashed horizontal line indicates random guessing (AUROC = 0.5). As n/p increases, attack performance decreases toward random guessing, indicating reduced privacy risk in higher-rank … view at source ↗
read the original abstract

In transcriptomics, gene-set-aware factorization methods such as the Pathway Level Information Extractor (PLIER) are most effective when trained on large, heterogeneous expression compendia. Yet, many clinically relevant cohorts cannot be pooled into a single dataset due to privacy and governance constraints. We present FPLIER, a federated extension of PLIER that enables distributed training across multiple data holders while incorporating publicly available datasets. Through secure aggregation, FPLIER produces training updates algebraically equivalent to those of a centralized pooled-data approach while keeping expression data local. We evaluate FPLIER across multiple scenarios in two simulated consortia (from the K-CLIER and MultiPLIER studies) and demonstrate stable convergence. We further conduct a systematic analysis of membership inference attacks targeting both intermediate training statistics and the released model. Our results show that privacy risk is governed by the rank of the training expression matrix. Incorporating public data or reducing data dimensionality increases this rank, moving the system toward a full-rank regime in which training and non-training samples become indistinguishable to the attacker, and membership-inference performance approaches random guessing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper presents FPLIER, a federated extension of the PLIER gene-set-aware factorization method for transcriptomics. It claims that secure aggregation enables training updates that are algebraically equivalent to those from centralized pooled-data training while keeping expression data local to each site; evaluates the method on simulated consortia derived from the K-CLIER and MultiPLIER studies, reporting stable convergence; and conducts a membership-inference analysis showing that privacy risk is governed by the rank of the training expression matrix, with incorporation of public data or dimensionality reduction increasing rank and driving attack performance toward random guessing.

Significance. If the algebraic equivalence and rank-governed privacy result are substantiated, the work would enable privacy-preserving collaborative training of pathway-level models on distributed clinical cohorts that cannot be pooled, a practically important capability in transcriptomics. The use of simulated multi-site consortia and systematic membership-inference evaluation provides a concrete starting point for privacy analysis in federated bioinformatics methods.

major comments (3)
  1. [Abstract] Abstract: the central claim that FPLIER produces training updates 'algebraically equivalent' to centralized PLIER via secure aggregation is asserted without any derivation, equations, or proof sketch showing how the PLIER objective or updates are preserved under aggregation.
  2. [Abstract] Abstract: the membership-inference analysis concludes that 'privacy risk is governed by the rank of the training expression matrix' and that performance 'approaches random guessing' in the full-rank regime, yet reports no quantitative attack metrics (AUC, success rate, etc.), no description of the attack implementation, and no comparison across rank regimes.
  3. [Abstract] Abstract: evaluation is limited to the statement of 'stable convergence' on simulated data from two consortia, with no quantitative convergence metrics, no direct comparison of final model parameters or reconstruction error against a centralized baseline, and no implementation details on the secure-aggregation protocol or numerical stability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive review. The comments correctly note that the abstract is highly condensed and would benefit from additional detail to support its claims. We address each point below, with revisions to the abstract and cross-references to the full derivations and results already present in the manuscript body.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that FPLIER produces training updates 'algebraically equivalent' to centralized PLIER via secure aggregation is asserted without any derivation, equations, or proof sketch showing how the PLIER objective or updates are preserved under aggregation.

    Authors: The abstract summarizes the result; the full algebraic derivation appears in Methods Section 3.2, where we demonstrate that secure aggregation of the PLIER gradient updates (which are linear in the data matrix) yields exactly the same parameter updates as the centralized objective. We will revise the abstract to include a one-sentence reference to this equivalence and the relevant methods section. revision: yes

  2. Referee: [Abstract] Abstract: the membership-inference analysis concludes that 'privacy risk is governed by the rank of the training expression matrix' and that performance 'approaches random guessing' in the full-rank regime, yet reports no quantitative attack metrics (AUC, success rate, etc.), no description of the attack implementation, and no comparison across rank regimes.

    Authors: Quantitative attack metrics (AUC values across rank regimes), attack implementation details, and comparisons are reported in Results Section 4.3 and Figure 5. The abstract condenses the governing relationship; we will add example AUC figures (e.g., approaching 0.5 in the full-rank case) to the abstract to make the claim self-contained. revision: yes

  3. Referee: [Abstract] Abstract: evaluation is limited to the statement of 'stable convergence' on simulated data from two consortia, with no quantitative convergence metrics, no direct comparison of final model parameters or reconstruction error against a centralized baseline, and no implementation details on the secure-aggregation protocol or numerical stability.

    Authors: Quantitative convergence metrics, parameter and reconstruction-error comparisons to the centralized baseline, and secure-aggregation implementation details (including numerical stability via fixed-point arithmetic) are provided in Section 4.2, Table 1, and Figure 3. We will revise the abstract to reference these key quantitative results and the methods section. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's derivation chain consists of applying standard secure aggregation to PLIER updates to achieve algebraic equivalence with centralized training (a direct property of the external primitive) and analyzing privacy via the rank of the expression matrix (a standard linear-algebra fact). No equations, fitted parameters, or self-citations are presented that reduce these claims to definitions or prior fits within the paper; the equivalence and rank-based indistinguishability follow from the cited external mechanisms without internal reduction. The evaluation on simulated consortia is empirical and does not rely on self-referential predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the implicit domain assumption of perfect secure aggregation is noted below.

axioms (1)
  • domain assumption Secure aggregation can be implemented to deliver updates algebraically identical to centralized training without extra leakage.
    Required for the central equivalence claim.

pith-pipeline@v0.9.1-grok · 5738 in / 1068 out tokens · 26471 ms · 2026-06-28T23:59:19.279816+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    RNA-Seq: a revolutionary tool for transcriptomics

    Z. Wang, M. Gerstein, and M. Snyder. “RNA-Seq: a revolutionary tool for transcriptomics”. In:Nature reviews genetics10.1 (2009), pp. 57–63

  2. [2]

    Single-cell transcriptomics to explore the immune system in health and dis- ease

    M. J. Stubbington, O. Rozenblatt-Rosen, A. Regev, and S. A. Teichmann. “Single-cell transcriptomics to explore the immune system in health and dis- ease”. In:Science358.6359 (2017), pp. 58–63

  3. [3]

    The properties of high-dimensional data spaces: implications for exploring gene and protein expression data

    R. Clarke, H. W. Ressom, A. Wang, J. Xuan, M. C. Liu, E. A. Gehan, and Y. Wang. “The properties of high-dimensional data spaces: implications for exploring gene and protein expression data”. In: Nature reviews cancer8.1 (2008), pp. 37–49

  4. [4]

    Systematic RNA in- terference reveals that oncogenic KRAS-driven can- cers require TBK1

    D. A. Barbie, P. Tamayo, J. S. Boehm, S. Y. Kim, S. E. Moody, I. F. Dunn, A. C. Schinzel, P. Sandy, E. Meylan, C. Scholl, et al. “Systematic RNA in- terference reveals that oncogenic KRAS-driven can- cers require TBK1”. In:Nature462.7269 (2009), pp. 108–112

  5. [5]

    GSVA: gene set variation analysis for microarray and RNA- seq data

    S.Hänzelmann,R.Castelo,andJ.Guinney.“GSVA: gene set variation analysis for microarray and RNA- seq data”. In:BMC bioinformatics14 (2013), pp. 1– 15

  6. [6]

    Pathway-level information extractor (PLIER) for gene expression data

    W. Mao, E. Zaslavsky, B. M. Hartmann, S. C. Seal- fon, and M. Chikina. “Pathway-level information extractor (PLIER) for gene expression data”. In: Nature methods16.7 (2019), pp. 607–610

  7. [7]

    Multi- PLIER: a transfer learning framework for transcrip- tomics reveals systemic features of rare disease

    J. N. Taroni, P. C. Grayson, Q. Hu, S. Eddy, M. Kretzler, P. A. Merkel, and C. S. Greene. “Multi- PLIER: a transfer learning framework for transcrip- tomics reveals systemic features of rare disease”. In: Cell systems8.5 (2019), pp. 380–394

  8. [8]

    Reproducible RNA-seq analysis using recount2

    L. Collado-Torres, A. Nellore, K. Kammers, S. E. Ellis, M. A. Taub, K. D. Hansen, A. E. Jaffe, B. Langmead, and J. T. Leek. “Reproducible RNA-seq analysis using recount2”. In:Nature biotechnology 35.4 (2017), pp. 319–321

  9. [9]

    The cancer genome atlas pan-cancer analysis project

    J. N. Weinstein, E. A. Collisson, G. B. Mills, K. R. Shaw, B. A. Ozenberger, K. Ellrott, I. Shmulevich, C. Sander, and J. M. Stuart. “The cancer genome atlas pan-cancer analysis project”. In:Nature ge- netics45.10 (2013), pp. 1113–1120

  10. [10]

    Communication-efficient learning of deep networks from decentralized data,

    H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. Arcas. “Communication-efficient learning of deep networks from decentralized data. arXiv”. In:arXiv preprint arXiv:1602.05629 (2016)

  11. [11]

    Technical and legal aspects of federated learning in bioinformatics: applications, challenges and opportunities

    D. Malpetti, M. Scutari, F. Gualdi, J. Van Setten, S. W. van der Laan, S. Haitjema, A. Lee, I. Her- ing, and F. Mangili. “Technical and legal aspects of federated learning in bioinformatics: applications, challenges and opportunities”. In:Frontiers in Dig- ital Health7 (2025), p. 1644291

  12. [12]

    Protocol for interpretable and context- specific single-cell-informed deconvolution of bulk RNA-seq data

    D. Malpetti, F. Mangili, M. Bolis, A. Rinaldi, D. Legouis, L. Ruinelli, P. Cippà, and L. Azz- imonti. “Protocol for interpretable and context- specific single-cell-informed deconvolution of bulk RNA-seq data”. In:STAR protocols6.1 (2025), p. 103670

  13. [13]

    A transfer learning frame- work to elucidate the clinical relevance of altered proximal tubule cell states in kidney disease

    D. Legouis, A. Rinaldi, D. Malpetti, G. Arnoux, T. Verissimo, A. Faivre, F. Mangili, A. Rinaldi, L. Ruinelli, J. Pugin, et al. “A transfer learning frame- work to elucidate the clinical relevance of altered proximal tubule cell states in kidney disease”. In: Iscience27.3 (2024). 10 Malpetti et al

  14. [14]

    MousiPLIER: A Mouse Pathway-Level Information Extractor Model

    S. Zhang, B. J. Heil, W. Mao, M. Chikina, C. S. Greene, and E. A. Heller. “MousiPLIER: A Mouse Pathway-Level Information Extractor Model”. In: eneuro11.6 (2024)

  15. [15]

    Patient privacy in AI-driven omics methods

    J. Zhou, C. Huang, and X. Gao. “Patient privacy in AI-driven omics methods”. In:Trends in Genetics 40.5 (2024), pp. 383–386

  16. [16]

    Secure single-server aggregation with (poly) logarithmic overhead

    J. H. Bell, K. A. Bonawitz, A. Gascón, T. Lepoint, and M. Raykova. “Secure single-server aggregation with (poly) logarithmic overhead”. In:Proceedings of the 2020 ACM SIGSAC conference on computer and communications security. 2020, pp. 1253–1269

  17. [17]

    Secure aggregation for federated learning in flower

    K. H. Li, P. P. B. de Gusmão, D. J. Beutel, and N. D. Lane. “Secure aggregation for federated learning in flower”. In:Proceedings of the 2nd ACM Interna- tional Workshop on Distributed Machine Learning. 2021, pp. 8–14

  18. [18]

    The algorithmic founda- tions of differential privacy

    C. Dwork and A. Roth. “The algorithmic founda- tions of differential privacy”. In:Foundations and trends in theoretical computer science9.3-4 (2014), pp. 211–487

  19. [19]

    Genome-wide association studies

    E. Uffelmann, Q. Q. Huang, N. S. Munung, J. De Vries, Y. Okada, A. R. Martin, H. C. Martin, T. Lappalainen, and D. Posthuma. “Genome-wide association studies”. In:Nature Reviews Methods Primers1.1 (2021), p. 59

  20. [20]

    Federated learning algo- rithms for generalized mixed-effects model (GLMM) on horizontally partitioned data from distributed sources

    W. Li, J. Tong, M. M. Anjum, N. Mohammed, Y. Chen, and X. Jiang. “Federated learning algo- rithms for generalized mixed-effects model (GLMM) on horizontally partitioned data from distributed sources”. In:BMC Medical Informatics and Deci- sion Making22.1 (2022), p. 269

  21. [21]

    Privacy-preserving federated neural network learning for disease- associated cell classification

    S. Sav, J.-P. Bossuat, J. R. Troncoso-Pastoriza, M. Claassen, and J.-P. Hubaux. “Privacy-preserving federated neural network learning for disease- associated cell classification”. In:Patterns3.5 (2022)

  22. [22]

    scFed: federated learning for cell type classification with scRNA-seq

    S. Wang, B. Shen, L. Guo, M. Shang, J. Liu, Q. Sun, and B. Shen. “scFed: federated learning for cell type classification with scRNA-seq”. In:Briefings in Bioinformatics25.1 (2024), bbad507

  23. [23]

    Flimma: a federated and privacy-aware tool for differential gene expression analysis

    O. Zolotareva, R. Nasirigerdeh, J. Matschinske, R. Torkzadehmahani,M.Bakhtiari,T.Frisch,J.Späth, D. B. Blumenthal, A. Abbasinejad, P. Tieri, et al. “Flimma: a federated and privacy-aware tool for differential gene expression analysis”. In:Genome biology22.1 (2021), p. 338

  24. [24]

    Federated deep learning enables cancer subtyping by proteomics

    Z. Cai, E. L. Boys, Z. Noor, A. T. Aref, D. Xavier, N. Lucas, S. G. Williams, J. M. Koh, R. C. Poulos, Y. Wu, et al. “Federated deep learning enables cancer subtyping by proteomics”. In:Cancer Discovery 15.9 (2025), pp. 1803–1818

  25. [25]

    Flower: A Friendly Federated Learning Research Framework

    D. J. Beutel, T. Topal, A. Mathur, X. Qiu, J. Fernandez-Marques, Y. Gao, L. Sani, K. H. Li, T. Parcollet, P. P. B. de GusmÃG, o, et al. “Flower: A friendly federated learning research framework”. In:arXiv preprint arXiv:2007.14390(2020)

  26. [26]

    Comparative analy- sis of open-source federated learning frameworks-a literature-based survey and review

    P. Riedel, L. Schick, R. von Schwerin, M. Reichert, D. Schaudt, and A. Hafner. “Comparative analy- sis of open-source federated learning frameworks-a literature-based survey and review”. In:Interna- tional Journal of Machine Learning and Cybernetics 15.11 (2024), pp. 5257–5278

  27. [27]

    Membership inference attack against princi- pal component analysis

    O. Zari, J. Parra-Arnau, A. Ünsal, T. Strufe, and M. Önen. “Membership inference attack against princi- pal component analysis”. In:International Confer- ence on Privacy in Statistical Databases. Springer. 2022, pp. 269–282

  28. [28]

    The reactome pathway knowledgebase

    A. Fabregat, S. Jupe, L. Matthews, K. Sidiropoulos, M. Gillespie, P. Garapati, R. Haw, B. Jassal, F. Korninger, B. May, et al. “The reactome pathway knowledgebase”. In:Nucleic acids research46.D1 (2018), pp. D649–D655

  29. [29]

    Bayesian method to predict individual SNP genotypes from gene expression data

    E. E. Schadt, S. Woo, and K. Hao. “Bayesian method to predict individual SNP genotypes from gene expression data”. In:Nature genetics44.5 (2012), pp. 603–608

  30. [30]

    A decade of research on genetic privacy: the findings of the GetPreCiSe Center at Vanderbilt University

    C. Slobogin, K. Tellis, E. W. Clayton, J. Clayton, A. Eilmus, and B. A. Malin. “A decade of research on genetic privacy: the findings of the GetPreCiSe Center at Vanderbilt University”. In:Frontiers in Genetics16 (2025), p. 1629386

  31. [31]

    Federated horizon- tally partitioned principal component analysis for biomedical applications

    A. Hartebrodt and R. Röttger. “Federated horizon- tally partitioned principal component analysis for biomedical applications”. In:Bioinformatics Ad- vances2.1 (2022), vbac026

  32. [32]

    G. H. Golub and C. F. Van Loan.Matrix compu- tations (3rd ed.)USA: Johns Hopkins University Press, 1996.isbn: 0801854148

  33. [33]

    Find- ing structure with randomness: Probabilistic algo- rithms for constructing approximate matrix decom- positions

    N. Halko, P.-G. Martinsson, and J. A. Tropp. “Find- ing structure with randomness: Probabilistic algo- rithms for constructing approximate matrix decom- positions”. In:SIAM review53.2 (2011), pp. 217– 288

  34. [34]

    Practical secure aggregation for privacy- preserving machine learning

    K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth. “Practical secure aggregation for privacy- preserving machine learning”. In:proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 2017, pp. 1175–1191. 11 Malpetti et al. A. Standard PLIER training For completeness, ...