On Reliability of Efficient Membership Inference Vulnerability Evaluation

Antti Honkela; Gauri Pradhan; Joonas J\"alk\"o; Ossi R\"ais\"a

arxiv: 2605.25819 · v1 · pith:RMDSY25Pnew · submitted 2026-05-25 · 💻 cs.LG · cs.CR

On Reliability of Efficient Membership Inference Vulnerability Evaluation

Joonas J\"alk\"o , Gauri Pradhan , Ossi R\"ais\"a , Antti Honkela This is my paper

Pith reviewed 2026-06-29 23:10 UTC · model grok-4.3

classification 💻 cs.LG cs.CR

keywords membership inference attacksdifferential privacy auditingfalse positive rate calibrationvulnerability evaluationLiRA attackprivacy leakage measurement

0 comments

The pith

Concatenating membership inference scores across samples fails to produce calibrated per-sample false positive rates at low thresholds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how researchers evaluate membership inference attack vulnerabilities in trained models while keeping computational costs manageable. It demonstrates that the common practice of pooling or concatenating attack scores from many individuals and models distorts the false positive rate distribution, so that a single threshold does not correspond to the intended per-sample error rate. This distortion is especially problematic when auditing models that claim differential privacy guarantees, because those audits focus on extremely low false positive regimes. The authors also document a separate positive bias in the standard efficient implementation of the likelihood-ratio attack that arises from finite-population effects. They introduce a simple post-processing step that restores calibration across samples.

Core claim

Evaluating the TPR based on MIA scores concatenated across multiple individuals is not calibrated across the per-sample FPRs. This makes the approach unreliable as a tool for auditing differential privacy. In addition, the commonly used efficient LiRA implementation carries a finite-population bias that produces a positive bias in per-sample vulnerability estimates. A post-processing calibration procedure restores consistent per-sample FPRs when scores from different individuals are combined.

What carries the argument

A post-processing calibration procedure that adjusts pooled MIA scores so that a single decision threshold yields the intended false-positive rate for each individual sample.

If this is right

Differential privacy audits that rely on uncalibrated concatenated scores can misrepresent the actual leakage at low false-positive rates.
The finite-population bias in efficient LiRA means that previously reported per-sample vulnerabilities are systematically inflated.
Applying the proposed post-processing step produces TPR estimates that correctly reflect per-sample behavior even when scores are pooled for efficiency.
Any vulnerability comparison across models or training methods must either use per-sample calibration or account for the bias introduced by pooling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Prior published privacy evaluations that used the uncalibrated concatenation method may need re-examination with the corrected procedure.
The calibration step could be added to existing membership-inference toolkits without changing the underlying attack models.
Similar finite-population corrections might apply to other efficient attack implementations that reuse the same reference models across many targets.

Load-bearing premise

Per-sample false positive rates remain correctly calibrated when attack scores from different individuals and models are pooled or concatenated.

What would settle it

Compute the empirical FPR for each sample separately on a large collection of target models, then compare that value to the FPR obtained from the single concatenated score distribution at the same nominal threshold; a systematic mismatch would confirm the calibration failure.

Figures

Figures reproduced from arXiv: 2605.25819 by Antti Honkela, Gauri Pradhan, Joonas J\"alk\"o, Ossi R\"ais\"a.

**Figure 1.** Figure 1: Comparison of MIA evaluation strategies across varying M and FPRs for TabPFN models trained on UCI Adult (top), and UCI Credit (bottom) datasets. Naive concatenation consistently yields higher TPRs, overestimating average privacy risk relative to post-processed concatenation which produces lower and more stable estimates. At small M, Average TPR/Sample is noisy; as M increases, it converges toward Concate… view at source ↗

**Figure 2.** Figure 2: FPC correction can help recover the analytical σ. (a) depicts empirical σ with and without FPC. We observe that post-FPC σx closely track the expected analytical σx. In (b), we observe that FPC recovers the analytical σx but with some residual scatter as depicted by the spread of σemp/σana around √ FPC. The plots use data from running the simulation described in Section 6.2 with M = 2048, N+ = 1000, d = 50… view at source ↗

read the original abstract

Membership inference attacks (MIAs) are popular methods for empirically assessing the leakage of sensitive information in the training data through models or statistics learned from the data. The MIA vulnerability is often evaluated through false positive rate (FPR) and true positive rate (TPR) of a binary classifier that tries to predict whether a particular sample was in the training data. However, in order to reliably estimate the TPR especially for low FPR values, a lot of observations are needed, which in case of MIA translates to many target models, leading to large computational cost. To avoid excessive compute requirements, the MIA scores are often averaged over multiple individuals and multiple targeted models. We demonstrate two key weaknesses in this efficient MIA evaluation pipeline. First, we show that evaluating the TPR based on MIA scores concatenated across multiple individuals, commonly used to study vulnerabilities in the very low FPR regime, is not calibrated across the per-sample FPRs. This makes it unreliable as a tool for auditing differential privacy. To solve this, we propose a post-processing method to effectively calibrate the FPR across different samples. Second, we identify a finite population bias in the commonly used efficient likelihood-ratio attack (LiRA) implementation proposed by Carlini et al. 2022, leading to a positive bias in the per-sample vulnerability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that concatenating MIA scores across samples miscalibrates low-FPR TPR estimates and flags a finite-population bias in efficient LiRA.

read the letter

The main point is that pooling MIA scores across individuals to get TPR at low FPR does not stay calibrated to the per-sample rates, because the score distributions differ across samples. This directly affects how reliable the method is for auditing differential privacy. The paper also identifies a positive bias in the efficient LiRA setup from Carlini et al. due to finite population effects.

The work does a clear job laying out why the pooled ROC curve does not equal the average of the individual per-sample TPRs at the same nominal FPR. The proposed post-processing calibration is a practical step that could be adopted without changing the core attack. The bias point is stated precisely and follows from the way the reference models are constructed.

The softer part is that the abstract gives no numbers on how large these effects are in practice or how well the calibration holds up across models and datasets. Without those details it is hard to judge whether the problems change published conclusions by a lot or only a little. The paper rests on demonstrations rather than closed-form derivations, which is reasonable here but means the empirical checks need to be solid.

This is for researchers who run or rely on MIA-based audits in privacy-preserving ML. Anyone using low-FPR TPR or the efficient LiRA implementation will want to see the details. It is worth sending to peer review because the issues are specific, the logic is consistent, and a validated fix would be useful to the community.

Referee Report

2 major / 2 minor

Summary. The paper claims that concatenating MIA scores across multiple individuals to estimate TPR at low FPR is not calibrated across per-sample FPRs, rendering it unreliable for auditing differential privacy, and proposes a post-processing calibration method to address this. It also identifies a finite population bias in the efficient LiRA implementation that produces a positive bias in per-sample vulnerability estimates.

Significance. If the central claims hold, the work would strengthen empirical privacy evaluation practices by correcting miscalibration in pooled MIA evaluations used for low-FPR analysis and by flagging a bias in a widely adopted LiRA variant. The proposed calibration method represents a constructive, fixable contribution that could be adopted for more reliable DP auditing.

major comments (2)

[Section describing the concatenation weakness and calibration method] The claim that pooled TPR estimation via concatenation is uncalibrated across per-sample FPRs is load-bearing for the unreliability conclusion in DP auditing. The manuscript should include an explicit derivation or counter-example (e.g., in the section introducing the calibration method) showing how heterogeneous per-sample score distributions produce a pooled operating point at nominal low FPR that differs from the average per-sample TPR at that FPR.
[Section on LiRA finite population bias] The finite-population bias in the efficient LiRA implementation is presented as leading to positive bias in per-sample vulnerability. To support this as a practical concern, the paper should quantify the bias magnitude as a function of the number of target models (or provide the exact implementation detail from Carlini et al. 2022 that induces it) in the relevant experimental or theoretical section.

minor comments (2)

[Abstract] The abstract states the two weaknesses and the calibration proposal but supplies no equations, experimental details, or error analysis; a one-sentence summary of the calibration method's effect on low-FPR TPR would improve the abstract.
Notation for MIA scores, per-sample FPR, and pooled ROC should be introduced consistently before the first use of the calibration procedure to avoid ambiguity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which will help strengthen the clarity and empirical grounding of our claims. We address each major point below and will incorporate the requested additions in the revised manuscript.

read point-by-point responses

Referee: [Section describing the concatenation weakness and calibration method] The claim that pooled TPR estimation via concatenation is uncalibrated across per-sample FPRs is load-bearing for the unreliability conclusion in DP auditing. The manuscript should include an explicit derivation or counter-example (e.g., in the section introducing the calibration method) showing how heterogeneous per-sample score distributions produce a pooled operating point at nominal low FPR that differs from the average per-sample TPR at that FPR.

Authors: We agree that an explicit counter-example would make the uncalibration claim more transparent. In the revised manuscript we will add a short counter-example (with two heterogeneous score distributions) immediately before the calibration method, demonstrating that the pooled threshold at nominal FPR=10^{-3} yields a TPR that deviates from the average per-sample TPR at the same per-sample FPR. This will be placed in the section introducing the calibration procedure. revision: yes
Referee: [Section on LiRA finite population bias] The finite-population bias in the efficient LiRA implementation is presented as leading to positive bias in per-sample vulnerability. To support this as a practical concern, the paper should quantify the bias magnitude as a function of the number of target models (or provide the exact implementation detail from Carlini et al. 2022 that induces it) in the relevant experimental or theoretical section.

Authors: We will add both the requested quantification and the precise implementation detail. In the revised version we include (i) the exact line from the Carlini et al. 2022 code that produces the finite-population bias (the reuse of the same shadow-model logits without leave-one-out adjustment) and (ii) a new plot and accompanying analysis showing bias magnitude versus number of target models (for 10, 50, 100, and 500 models) on the same datasets used in the paper. This will appear in the LiRA-bias section. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper critiques existing MIA evaluation pipelines by identifying two statistical issues: (1) uncalibrated TPR at low FPR when scores are concatenated across samples due to heterogeneous per-sample distributions, and (2) finite-population bias in the LiRA estimator from Carlini et al. 2022. These follow from direct analysis of ROC operating points and bias in likelihood ratios, without any fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations. The proposed post-processing calibration is an explicit correction derived from the identified mismatch, not a renaming or ansatz smuggled via prior work. The derivation chain is self-contained against external benchmarks and does not reduce claims to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced or required by the claims in the abstract; the work critiques statistical properties of existing MIA evaluation procedures.

pith-pipeline@v0.9.1-grok · 5776 in / 1026 out tokens · 20201 ms · 2026-06-29T23:10:03.371906+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 3 canonical work pages

[1]

Aerni, J

M. Aerni, J. Zhang, and F. Tramèr. Evaluations of Machine Learning Privacy Defenses are Misleading. InProceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security, (CCS), pages 1271–1284,

2024
[3]

Carlini, S

N. Carlini, S. Chien, M. Nasr, S. Song, A. Terzis, and F. Tramèr. Membership Inference Attacks From First Principles. In43rd IEEE Symposium on Security and Privacy, SP, pages 1897–1914,

1914
[4]

S. Garg, D. Tsipras, P. Liang, and G. Valiant. What Can Transformers Learn In-Context? A Case Study of Simple Function Classes. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS,

2022
[5]

DOI: https://doi.org/10.24432/C5NC77. N. Hollmann, S. Müller, L. Purucker, A. Krishnakumar, M. Körfer, S. B. Hoo, R. T. Schirrmeister, and F. Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637 (8044):319–326,

work page doi:10.24432/c5nc77
[6]

Keinan, M

A. Keinan, M. Shenfeld, and K. Ligett. How Well Can Differential Privacy Be Audited in One Run? InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, NeurIPS,

2025
[7]

Y . Liu, Z. Zhao, M. Backes, and Y . Zhang. Membership Inference Attacks by Exploiting Loss Trajec- tory. InProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, CCS, pages 2085–2098,

2022
[8]

10 S. K. Murakonda and R. Shokri. ML Privacy Meter: Aiding Regulatory Compliance by Quantifying the Privacy Risks of Machine Learning.CoRR, abs/2007.09339,

work page arXiv 2007
[9]

Srivastava, G

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting.Journal of Machine Learning Research, 15(56): 1929–1958,

1929
[10]

Steinke, M

T. Steinke, M. Nasr, and M. Jagielski. Privacy Auditing with One (1) Training Run. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS,

2023
[11]

Zarifzadeh, P

S. Zarifzadeh, P. Liu, and R. Shokri. Low-cost High-power Membership Inference Attacks. In Forty-first International Conference on Machine Learning, ICML 2024,

2024
[12]

Becker, B

A Appendix A.1 MIA Evaluation With Oracle Access In this setting, the attacker uses all 4095 models except the target model to compute per-sample statistics for Sout x and Sin x against M target models. It helps isolate the behavior of the MIA evaluation procedure from finite-M estimation error of the in- and out-distributions. In particular, it lets us s...

work page doi:10.24432/c5xw20 1996

[1] [1]

Aerni, J

M. Aerni, J. Zhang, and F. Tramèr. Evaluations of Machine Learning Privacy Defenses are Misleading. InProceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security, (CCS), pages 1271–1284,

2024

[2] [3]

Carlini, S

N. Carlini, S. Chien, M. Nasr, S. Song, A. Terzis, and F. Tramèr. Membership Inference Attacks From First Principles. In43rd IEEE Symposium on Security and Privacy, SP, pages 1897–1914,

1914

[3] [4]

S. Garg, D. Tsipras, P. Liang, and G. Valiant. What Can Transformers Learn In-Context? A Case Study of Simple Function Classes. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS,

2022

[4] [5]

DOI: https://doi.org/10.24432/C5NC77. N. Hollmann, S. Müller, L. Purucker, A. Krishnakumar, M. Körfer, S. B. Hoo, R. T. Schirrmeister, and F. Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637 (8044):319–326,

work page doi:10.24432/c5nc77

[5] [6]

Keinan, M

A. Keinan, M. Shenfeld, and K. Ligett. How Well Can Differential Privacy Be Audited in One Run? InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, NeurIPS,

2025

[6] [7]

Y . Liu, Z. Zhao, M. Backes, and Y . Zhang. Membership Inference Attacks by Exploiting Loss Trajec- tory. InProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, CCS, pages 2085–2098,

2022

[7] [8]

10 S. K. Murakonda and R. Shokri. ML Privacy Meter: Aiding Regulatory Compliance by Quantifying the Privacy Risks of Machine Learning.CoRR, abs/2007.09339,

work page arXiv 2007

[8] [9]

Srivastava, G

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting.Journal of Machine Learning Research, 15(56): 1929–1958,

1929

[9] [10]

Steinke, M

T. Steinke, M. Nasr, and M. Jagielski. Privacy Auditing with One (1) Training Run. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS,

2023

[10] [11]

Zarifzadeh, P

S. Zarifzadeh, P. Liu, and R. Shokri. Low-cost High-power Membership Inference Attacks. In Forty-first International Conference on Machine Learning, ICML 2024,

2024

[11] [12]

Becker, B

A Appendix A.1 MIA Evaluation With Oracle Access In this setting, the attacker uses all 4095 models except the target model to compute per-sample statistics for Sout x and Sin x against M target models. It helps isolate the behavior of the MIA evaluation procedure from finite-M estimation error of the in- and out-distributions. In particular, it lets us s...

work page doi:10.24432/c5xw20 1996