pith. sign in

arxiv: 2606.10217 · v1 · pith:USMXHT2Rnew · submitted 2026-06-08 · 💻 cs.LG · cs.CR

Alignment Defends LLMs from Property Inference Attacks

Pith reviewed 2026-06-27 17:03 UTC · model grok-4.3

classification 💻 cs.LG cs.CR
keywords property inference attacksLLM alignmentDPOGRPOmodel defensesconfidentialityfine-tuning
0
0 comments X

The pith

Alignment-based defenses mitigate property inference attacks on LLMs by reshaping output distributions after training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that post-training alignment can reduce the success of attacks that extract sensitive dataset-level properties from fine-tuned language models. Existing defenses require changing the original training data or retraining from scratch, which is impractical for deployed models. Instead, the approach adapts Direct Preference Optimization and Group Relative Policy Optimization to steer outputs toward a chosen target property ratio. Experiments indicate that attack performance drops while model utility on standard tasks remains largely intact. This enables confidentiality protections without data access or full retraining.

Core claim

By adapting DPO and GRPO frameworks, the model’s output distribution can be reshaped towards a target property ratio via post-training alignment, effectively mitigating property inference attacks without modifying the training data or requiring retraining.

What carries the argument

Adaptation of RLHF frameworks (DPO and GRPO) to construct preference pairs and rewards that enforce a target property ratio in outputs.

If this is right

  • Property inference attacks achieve lower success rates after applying the defense.
  • Models maintain utility on standard tasks despite the alignment adjustments.
  • Defenses apply to already fine-tuned and deployed models without data access.
  • Both DPO and GRPO adaptations provide effective mitigation options.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar alignment strategies might apply to other inference attacks beyond property inference.
  • The method could extend to protecting against membership inference if target ratios are defined appropriately.
  • Choosing the target ratio might require domain knowledge but avoids revealing the sensitive property itself.

Load-bearing premise

A suitable target property ratio can be chosen and preference pairs or rewards constructed without knowledge of the actual sensitive property in the dataset.

What would settle it

An experiment where after applying the DPO or GRPO defense, a property inference attack still achieves high success rate comparable to the undefended model.

Figures

Figures reproduced from arXiv: 2606.10217 by Chhavi Yadav, Kamalika Chaudhuri, Pengrun Huang, Ruihan Wu.

Figure 2
Figure 2. Figure 2: Effect of alignment on word-frequency distribution. After defense, word frequencies become less reflective of the underlying training distribution. Notably, DPO exhibits more abrupt changes in word frequency compared to GRPO, consistent with its stronger generalization to adver￾sarial prompts. C Related Work Property Inference Attacks. Property inference attacks, also referred to as distribution inference … view at source ↗
read the original abstract

Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets that may contain sensitive, dataset-level properties. Recent work has shown that such dataset-level information can be effectively extracted through property inference attacks, posing a confidentiality risk. Existing defenses against these attacks primarily operate by modifying the training data distribution and hence require access to the original data and retraining the model, limiting their applicability to settings where data is unavailable or models are already deployed. In this work, we propose alignment-based defenses for mitigating property inference attacks in LLMs. Our approach reshapes the model's output distribution towards a target property ratio via post-training alignment, without modifying the training data. In particular, we adapt two widely used RLHF frameworks--Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO)--as our defenses by constructing preference pairs and defining a specific reward function respectively. Through comprehensive experiments, we show that our alignment based defenses effectively mitigate property inference attacks while maintaining a strong utility confidentiality tradeoff.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that post-training alignment via adapted DPO (preference pairs) and GRPO (reward function) can reshape an LLM's output distribution to a chosen target property ratio, thereby mitigating property inference attacks on dataset-level sensitive properties while preserving a strong utility-confidentiality tradeoff, without requiring access to or modification of the original training data.

Significance. If the central mechanism can be realized without presupposing knowledge of the secret property, the result would be significant for practical deployment of LLMs: it offers a defense applicable to already-trained models, unlike prior data-distribution defenses that mandate retraining. The reuse of standard RLHF frameworks is a practical strength that could facilitate adoption.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (defense construction): The DPO adaptation constructs preference pairs differentiated by the target property ratio, and the GRPO adaptation defines a reward function that likewise requires property-specific labeling or generation of responses; both steps presuppose the defender possesses or can access the sensitive property to create the necessary data, which directly contradicts the threat model in which the property is unknown to the defender and is precisely the information the attack seeks to extract.
  2. [Experiments] Experiments section: The abstract asserts that 'comprehensive experiments demonstrate effective mitigation and a good utility tradeoff,' yet the high-level description provides no attack success rates, baseline comparisons, utility metrics, or details on how the target ratio was selected and validated; without these quantitative anchors the central empirical claim cannot be assessed.
minor comments (1)
  1. [Abstract] Abstract: Including one or two headline quantitative results (e.g., attack success rate reduction and utility delta) would strengthen the summary of the experimental findings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, indicating where revisions to the manuscript are warranted.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (defense construction): The DPO adaptation constructs preference pairs differentiated by the target property ratio, and the GRPO adaptation defines a reward function that likewise requires property-specific labeling or generation of responses; both steps presuppose the defender possesses or can access the sensitive property to create the necessary data, which directly contradicts the threat model in which the property is unknown to the defender and is precisely the information the attack seeks to extract.

    Authors: The referee correctly notes that constructing the adapted DPO preference pairs and GRPO reward function requires the ability to label or generate responses according to the target property. This assumption is implicit in the current defense design. We will revise the abstract, threat model section, and §3 to explicitly state that the defender is assumed to have (or be able to obtain) sufficient access to the property for the purpose of alignment data creation—e.g., when the defender wishes to enforce a specific ratio for a known sensitive attribute. This clarifies rather than contradicts the setting and removes any implication that the defense applies to completely unknown properties. revision: yes

  2. Referee: [Experiments] Experiments section: The abstract asserts that 'comprehensive experiments demonstrate effective mitigation and a good utility tradeoff,' yet the high-level description provides no attack success rates, baseline comparisons, utility metrics, or details on how the target ratio was selected and validated; without these quantitative anchors the central empirical claim cannot be assessed.

    Authors: We agree that the abstract and any high-level overview omit the specific quantitative results. The experiments section of the manuscript contains the requested details (attack success rates before/after defense, baseline comparisons, utility metrics such as downstream task accuracy and perplexity, and target-ratio selection via validation sweeps). To improve accessibility, we will expand the abstract with key numerical results and ensure the experiments section foregrounds these metrics with explicit tables and selection methodology. revision: yes

Circularity Check

0 steps flagged

Empirical defense paper with no derivation chain or self-referential predictions

full rationale

This paper proposes an empirical defense method adapting DPO and GRPO for post-training alignment to mitigate property inference attacks. It reports experimental results on attack mitigation and utility tradeoffs without any mathematical derivation, first-principles predictions, fitted parameters presented as outputs, or load-bearing self-citations. The central claims rest on experimental outcomes rather than reducing to inputs by construction, satisfying the criteria for a self-contained empirical contribution with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of alignment for controlling dataset-property leakage; no free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.1-grok · 5704 in / 991 out tokens · 22143 ms · 2026-06-27T17:03:36.627330+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 2 canonical work pages

  1. [1]

    In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp

    M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS’16, page 308–318. ACM, Oct. 2016. doi: 10.1145/2976749.2978318. URLhttp://dx.doi.org/10.1145/2976749.2978318

  2. [2]

    Achiam, S

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    Ateniese, L

    G. Ateniese, L. V . Mancini, A. Spognardi, A. Villani, D. Vitali, and G. Felici. Hacking smart machines with smarter ones: How to extract meaningful data from machine learning classifiers. International Journal of Security and Networks, 10(3):137–150, 2015

  4. [4]

    Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatfield- Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Kaplan....

  5. [5]

    Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirho- seini, C. McKinnon, et al. Constitutional ai: Harmlessness from ai feedback, 2022.URL https://arxiv. org/abs/2212.08073, 2212, 2022

  6. [6]

    Chen and O

    M. Chen and O. Ohrimenko. Protecting global properties of datasets with distribution privacy mechanisms, 2023. URLhttps://arxiv.org/abs/2207.08367

  7. [7]

    Ganju, Q

    K. Ganju, Q. Wang, W. Yang, C. A. Gunter, and N. Borisov. Property inference attacks on fully connected neural networks using permutation invariant representations. InProceedings of the 2018 ACM SIGSAC conference on computer and communications security, pages 619–633, 2018

  8. [8]

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022. 10

  9. [9]

    Huang, C

    P. Huang, C. Yadav, K. Chaudhuri, and R. Wu. Can we infer confidential properties of training data from llms?arXiv preprint arXiv:2506.10364, 2025

  10. [10]

    Hurst, A

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  11. [11]

    Khandekar, Q

    N. Khandekar, Q. Jin, G. Xiong, S. Dunn, S. S. Applebaum, Z. Anwar, M. Sarfo-Gyamfi, C. W. Safranek, A. A. Anwar, A. Zhang, A. Gilson, M. B. Singer, A. Dave, A. Taylor, A. Zhang, Q. Chen, and Z. Lu. Medcalc-bench: Evaluating large language models for medical calculations,

  12. [12]

    URLhttps://arxiv.org/abs/2406.12036

  13. [13]

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention,

  14. [14]

    URLhttps://arxiv.org/abs/2309.06180

  15. [15]

    J. Lai, W. Gan, J. Wu, Z. Qi, and P. S. Yu. Large language models in law: A survey.AI Open, 5: 181–196, 2024

  16. [16]

    Y . Li, Z. Li, K. Zhang, R. Dan, S. Jiang, and Y . Zhang. Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge, 2023. URLhttps://arxiv.org/abs/2303.14070

  17. [17]

    Y . Li, S. Wang, H. Ding, and H. Chen. Large language models in finance: A survey. In Proceedings of the fourth ACM international conference on AI in finance, pages 374–382, 2023

  18. [18]

    X. Ma, B. Li, Q. Jiang, Y . Chen, S. Gao, and J. Ma. Nosnoop: An effective collaborative meta-learning scheme against property inference attack.IEEE Internet of Things Journal, 9(9): 6778–6789, 2021

  19. [19]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

  20. [20]

    Rafailov, A

    R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model, 2024. URL https://arxiv. org/abs/2305.18290

  21. [21]

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models,

  22. [22]

    URLhttps://arxiv.org/abs/2402.03300

  23. [23]

    Proceedings of the 20th International Conference on Security and Cryptography - SECRYPT , year=

    J. Stock, J. Wettlaufer, D. Demmler, and H. Federrath. Lessons learned: Defending against property inference attacks. InProceedings of the 20th International Conference on Security and Cryptography, page 312–323. SCITEPRESS - Science and Technology Publications, 2023. doi: 10.5220/0012049200003555. URLhttp://dx.doi.org/10.5220/0012049200003555

  24. [24]

    Suri and D

    A. Suri and D. Evans. Formalizing and estimating distribution inference risks, 2022. URL https://arxiv.org/abs/2109.06024

  25. [25]

    A. Suri, Y . Lu, Y . Chen, and D. Evans. Dissecting distribution inference, 2024. URLhttps: //arxiv.org/abs/2212.07591

  26. [26]

    Taori, I

    R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto. Stanford alpaca: an instruction-following llama model (2023), 2023

  27. [27]

    Touvron, T

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  28. [28]

    Wang and W

    X. Wang and W. H. Wang. Group property inference attacks against graph neural networks. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 2871–2884, 2022

  29. [29]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 11

  30. [30]

    Zhang, V

    T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi. Bertscore: Evaluating text generation with bert, 2020. URLhttps://arxiv.org/abs/1904.09675

  31. [31]

    Zhang, O

    W. Zhang, O. Ohrimenko, and R. Cummings. Attribute privacy: Framework and mechanisms,

  32. [32]

    URLhttps://arxiv.org/abs/2009.04013

  33. [33]

    Zhang, S

    W. Zhang, S. Tople, and O. Ohrimenko. Leakage of dataset properties in{Multi-Party} machine learning. In30th USENIX security symposium (USENIX Security 21), pages 2687–2704, 2021

  34. [34]

    If you are a doctor, please answer the medical questions based on the patient’s description

    J. Zhou, Y . Chen, C. Shen, and Y . Zhang. Property inference attacks against gans, 2021. URL https://arxiv.org/abs/2111.07608. 12 A Experiment Setup Dataset construction and training data size.For each dataset, we construct fine-tuning sets with controlled property ratios. For ChatDoctor, each dataset contains 6,500 samples, created by subsampling from t...

  35. [35]

    If the text describes a patient’s main concern about a mental disorder, such as severe depression, anxiety, or bipolar disorder, output: Mental-Disorder

    Digestion 2) Mental-Disorder 3) Others If the text describes the patient’s main concerns about digestive issues, including but not limited to problems related to the stomach, intestine, pancreas, gallbladder, or liver, or describes symptoms such as bloating, diarrhea, constipation, or abdominal pain, output: Digestion. If the text describes a patient’s ma...

  36. [36]

    It does not need to be computed correctly

    CKD-EPI 2) Other-Medical 3) Not-Medical Definitions: A) CKD-EPI: The text explicitly mentions CKD-EPI, or states that the task is to compute CKD-EPI, references the Chronic Kidney Disease Epidemiology Collabo- ration equation, or contains the characteristic CKD-EPI equation structure (e.g., 142 × (Scr/A)B × 0.9938age × ...). It does not need to be compute...