arxiv: 2605.06596 · v1 · submitted 2026-05-07 · 💻 cs.CR · cs.LG

Recognition: unknown

FedAttr: Towards Privacy-preserving Client-Level Attribution in Federated LLM Fine-tuning

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:52 UTC · model grok-4.3

classification 💻 cs.CR cs.LG

keywords federated learningwatermarkingclient attributionsecure aggregationlarge language modelsdata ownershipprivacy preservation

0 comments

The pith

FedAttr identifies which clients trained on watermarked documents in federated LLM fine-tuning by differencing paired secure-aggregation queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a protocol that lets a server attribute watermarked data usage to individual clients even when updates are protected by secure aggregation. It estimates each client's update through the difference of two aggregate queries on chosen client subsets, scores the estimates with a watermark detector, and combines round scores using the Stouffer method. The approach keeps the privacy properties of secure aggregation intact and adds only modest overhead to training. A sympathetic reader would care because it solves the tension between collaborative model improvement and verifiable data ownership in distributed settings.

Core claim

FedAttr is a client-level attribution protocol for federated learning of large language models. It produces an unbiased estimator of each client's update via paired-subset differencing of secure aggregates, scores the estimates with differential watermark detection, and aggregates scores across rounds with the Stouffer method. The protocol is shown to bound mutual information leakage at O(d*/N) per round while preserving federated learning performance. Experiments report 100% true-positive rate and 0% false-positive rate for attribution, with 6.3% overhead relative to standard training time.

What carries the argument

The paired-subset-difference mechanism that estimates an individual client's update from the difference of two secure-aggregation results on complementary client subsets.

Load-bearing premise

The watermark detector still works when fed the noisy estimates obtained by subset differencing instead of clean individual client updates.

What would settle it

Run FedAttr on a controlled federation in which a known subset of clients receive watermarked documents and verify whether the protocol correctly flags exactly those clients across multiple random seeds and subset choices.

Figures

Figures reproduced from arXiv: 2605.06596 by Heng Huang, Junfeng Guo, Su Zhang.

**Figure 1.** Figure 1: Overview of Data Attribution in Federated Learning. view at source ↗

**Figure 2.** Figure 2: The effect for each protocol component. (a) Differential scoring removes the accumulated bias. (b) Per-round watermarked mean stays above mˆ =3.3, benign mean within ±ϵˆ=1.2, validating Assumption 1(i). (c) Centered residuals z (t) i − µˆi match a Gaussian (νˆ=0.85), supporting Assumption 1(ii). (d) Stouffer statistic crosses γ=4 from round 2. At round 5, the statistic achieves margin Zi − γ = 4.0. 1 3 5 … view at source ↗

**Figure 3.** Figure 3: Protocol parameter sensitivity. FedAttr achieves 100% TPR / 0% FPR across all tested view at source ↗

**Figure 4.** Figure 4: Robustness to training configurations. FedAttr achieves 100% TPR and 0% FPR across view at source ↗

read the original abstract

Watermark radioactivity testing type of methods can detect whether a model was trained on watermarked documents, and have become key tools for protecting data ownership in the fine-tuning of large language models (LLMs). Existing works have proved their effectiveness in centralized LLM fine-tuning. However, this type of method faces several challenges and remains underexplored in federated learning (FL), a widely-applied paradigm for fine-tuning LLMs collaboratively on private data across different users. FL mainly ensures privacy through secure aggregation (SA), which allows the server to aggregate updates while keeping clients' updates private. This mechanism preserves privacy but makes it difficult to identify which client trained on watermarked documents. In this work, we propose FedAttr, a new client-level attribution protocol for FL. FedAttr identifies which clients trained on watermarked data via a paired-subset-difference mechanism, while preserving the privacy guarantees of SA and FL performance. FedAttr proceeds in three steps: (i) estimate each client's update by differencing two SA queries, (ii) score the estimate with the watermark detector via differential scoring, and (iii) combine scores across rounds via Stouffer method. We theoretically show that FedAttr produces an unbiased estimator of each client's update with bounded mutual information leakage (i.e., $O(d^*/N)$ per-round update). Moreover, FedAttr empirically achieves 100% TPR and 0% FPR, outperforming all baselines by at least 44.4% in TPR or 19.1% in FPR, with only 6.3% overhead relative to FL training time. Ablation studies confirm that FedAttr is robust to protocol parameters and configurations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FedAttr gives a workable protocol for attributing watermarked clients in federated LLM training under secure aggregation, but its perfect detection rates rest on unproven robustness to estimation noise.

read the letter

The main thing to know is that this paper proposes FedAttr to identify which clients used watermarked data in federated LLM fine-tuning while keeping secure aggregation intact. It estimates each client's update by differencing two secure aggregates from paired subsets, scores the estimate with the watermark detector, and combines scores over rounds with the Stouffer method. The protocol claims to produce an unbiased estimator with mutual information leakage bounded by O(d*/N) per round and only 6.3% overhead on training time. Experiments report 100% TPR and 0% FPR, beating baselines by at least 44.4% in TPR or 19.1% in FPR, plus ablations on parameters. This is new in the sense that prior watermark radioactivity work stayed in centralized settings, and the paired-subset approach plus combination step is not in the referenced literature. The design keeps the privacy guarantees of FL and SA while adding attribution, which is a practical step forward for data ownership in collaborative training. The soft spot is the noise introduced by subset differencing. The estimator is unbiased in expectation, but the variance from other clients' updates scaled by inclusion probabilities could weaken the watermark signal before detection. The abstract asserts perfect empirical rates, yet the provided details do not isolate how detection power holds up specifically on these noisy estimates rather than clean updates. If the full paper shows targeted experiments varying subset size or client count and still getting reliable scoring, that would close the gap; otherwise the central claim depends on an assumption that needs more direct evidence. The leakage bound follows standard mutual-information arguments without circularity. The work is for researchers in privacy-preserving federated learning and LLM provenance. A reader working on deployment of attribution tools in FL would get concrete protocol details and numbers to build on. It deserves a serious referee because the idea fills a real gap and the claims are testable with the right experiments.

Referee Report

2 major / 2 minor

Summary. The paper presents FedAttr, a protocol for privacy-preserving client-level attribution in federated LLM fine-tuning. It estimates each client's update by differencing two secure aggregate queries using a paired-subset mechanism, scores the estimates using a watermark radioactivity detector with differential scoring, and aggregates the scores across multiple rounds using the Stouffer method. The authors claim that this produces an unbiased estimator of client updates with mutual information leakage bounded by O(d*/N) per round, and that it achieves 100% true positive rate and 0% false positive rate in experiments, outperforming baselines with only 6.3% overhead.

Significance. Should the central claims be substantiated, this contribution would be significant for enabling attribution of data usage in federated settings without violating privacy, which is crucial for LLM fine-tuning on sensitive or copyrighted data. The approach bridges watermarking techniques with secure aggregation in FL, potentially setting a new standard for accountable collaborative training.

major comments (2)

[Attribution pipeline and empirical evaluation] The attribution pipeline (described in the abstract and §3): The empirical claim of 100% TPR and 0% FPR depends on the watermark detector remaining effective on the noisy estimates obtained via paired-subset differencing rather than clean per-client updates. While the estimator is unbiased in expectation, the variance injected by other clients' updates (scaled by subset inclusion probabilities) is not shown to preserve detection power; no targeted analysis or ablation confirms that the signal survives this noise, which is load-bearing for the perfect-detection result.
[Theoretical analysis] Theoretical analysis (abstract): The O(d*/N) mutual-information leakage bound is asserted to follow from the protocol construction via standard arguments, but without an explicit derivation, proof sketch, or statement of assumptions (e.g., on subset selection probabilities or the effect of the differencing noise), it cannot be verified whether the bound is tight or independent of protocol parameters.

minor comments (2)

[Abstract] Abstract: The opening phrase 'Watermark radioactivity testing type of methods' is grammatically awkward and should be revised for clarity (e.g., 'Watermark radioactivity testing methods').
[Experimental results] The reported 6.3% overhead is given relative to FL training time, but the measurement basis (wall-clock time, communication volume, or FLOPs) is not specified, making it difficult to assess practicality.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which help clarify key aspects of the attribution pipeline and theoretical analysis. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation.

read point-by-point responses

Referee: [Attribution pipeline and empirical evaluation] The attribution pipeline (described in the abstract and §3): The empirical claim of 100% TPR and 0% FPR depends on the watermark detector remaining effective on the noisy estimates obtained via paired-subset differencing rather than clean per-client updates. While the estimator is unbiased in expectation, the variance injected by other clients' updates (scaled by subset inclusion probabilities) is not shown to preserve detection power; no targeted analysis or ablation confirms that the signal survives this noise, which is load-bearing for the perfect-detection result.

Authors: The reported 100% TPR and 0% FPR results are obtained by running the full FedAttr protocol, which includes the paired-subset differencing step to produce the noisy per-client estimates before applying the watermark detector and Stouffer combination. This provides direct empirical evidence that detection power is preserved under the noise levels present in the evaluated configurations. We agree, however, that a more targeted analysis would strengthen the claim. In the revision we will add an ablation subsection that isolates the effect of differencing noise, reporting signal-to-noise ratios and detection curves as functions of subset size and client count. revision: yes
Referee: [Theoretical analysis] Theoretical analysis (abstract): The O(d*/N) mutual-information leakage bound is asserted to follow from the protocol construction via standard arguments, but without an explicit derivation, proof sketch, or statement of assumptions (e.g., on subset selection probabilities or the effect of the differencing noise), it cannot be verified whether the bound is tight or independent of protocol parameters.

Authors: We concur that an explicit derivation would make the bound easier to verify. The O(d*/N) bound arises because each client's update appears in the differenced aggregate with probability scaling as O(1/N) under uniform random subset selection, and the resulting noisy linear combination leaks bounded mutual information by standard information-theoretic arguments. In the revised manuscript we will include a concise proof sketch in the appendix that states the assumptions (uniform random subset selection without replacement, round independence) and derives the bound step by step. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper's core theoretical claim of an unbiased estimator with MI leakage bound O(d*/N) is derived directly from the paired-subset-difference protocol using standard information-theoretic arguments on the secure aggregation mechanism. This does not reduce to a fitted parameter, self-citation, or definitional equivalence. The empirical TPR/FPR results are presented as experimental outcomes on the noisy estimates rather than predictions forced by construction. No load-bearing step matches any enumerated circularity pattern, and the derivation remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Only abstract available; ledger populated from stated steps and claims.

axioms (2)

domain assumption Watermark detector remains reliable on differenced estimates rather than raw client updates
Invoked in the differential scoring step of the protocol.
standard math Stouffer method correctly aggregates independent p-values across rounds
Used for final score combination.

pith-pipeline@v0.9.0 · 5608 in / 1251 out tokens · 35229 ms · 2026-05-08T08:52:12.883281+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 11 canonical work pages · 1 internal anchor

[1]

The Llama 3 Herd of Models

Llama Team , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2407.21783 , eprinttype =. 2407.21783 , timestamp =

work page internal anchor Pith review doi:10.48550/arxiv.2407.21783 2024
[2]

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Ning Ding and Yulin Chen and Bokai Xu and Yujia Qin and Shengding Hu and Zhiyuan Liu and Maosong Sun and Bowen Zhou , editor =. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations , booktitle =. 2023 , url =. doi:10.18653/V1/2023.EMNLP-MAIN.183 , timestamp =

work page doi:10.18653/v1/2023.emnlp-main.183 2023
[3]

A Watermark for Large Language Models , booktitle =

John Kirchenbauer and Jonas Geiping and Yuxin Wen and Jonathan Katz and Ian Miers and Tom Goldstein , editor =. A Watermark for Large Language Models , booktitle =. 2023 , url =

2023
[4]

Robust Data Watermarking in Language Models by Injecting Fictitious Knowledge , booktitle =

Xinyue Cui and Johnny Tian. Robust Data Watermarking in Language Models by Injecting Fictitious Knowledge , booktitle =. 2025 , url =

2025
[5]

Towards building the federated GPT: Federated instruction tuning.arXiv preprint arXiv:2305.05644, 2023

Jianyi Zhang and Saeed Vahidian and Martin Kuo and Chunyuan Li and Ruiyi Zhang and Guoyin Wang and Yiran Chen , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2305.05644 , eprinttype =. 2305.05644 , timestamp =

work page doi:10.48550/arxiv.2305.05644 2023
[6]

FLoRA: Federated Fine-Tuning Large Language Models with Heterogeneous Low-Rank Adaptations , booktitle =

Ziyao Wang and Zheyu Shen and Yexiao He and Guoheng Sun and Hongyi Wang and Lingjuan Lyu and Ang Li , editor =. FLoRA: Federated Fine-Tuning Large Language Models with Heterogeneous Low-Rank Adaptations , booktitle =. 2024 , url =

2024
[7]

OpenFedLLM: Training Large Language Models on Decentralized Private Data via Federated Learning , booktitle =

Rui Ye and Wenhao Wang and Jingyi Chai and Dihan Li and Zexi Li and Yinda Xu and Yaxin Du and Yanfeng Wang and Siheng Chen , editor =. OpenFedLLM: Training Large Language Models on Decentralized Private Data via Federated Learning , booktitle =. 2024 , url =. doi:10.1145/3637528.3671582 , timestamp =

work page doi:10.1145/3637528.3671582 2024
[8]

arXiv preprint arXiv:2310.10049 , year=

Tao Fan and Yan Kang and Guoqiang Ma and Weijing Chen and Wenbin Wei and Lixin Fan and Qiang Yang , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2310.10049 , eprinttype =. 2310.10049 , timestamp =

work page doi:10.48550/arxiv.2310.10049 2023
[9]

Watermarking Makes Language Models Radioactive , booktitle =

Tom Sander and Pierre Fernandez and Alain Durmus and Matthijs Douze and Teddy Furon , editor =. Watermarking Makes Language Models Radioactive , booktitle =. 2024 , url =

2024
[10]

Radioactive data: tracing through training , booktitle =

Alexandre Sablayrolles and Matthijs Douze and Cordelia Schmid and Herv. Radioactive data: tracing through training , booktitle =. 2020 , url =

2020
[11]

Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen

Edward J. Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen. LoRA: Low-Rank Adaptation of Large Language Models , booktitle =. 2022 , url =

2022
[12]

Bonawitz and Vladimir Ivanov and Ben Kreuter and Antonio Marcedone and H

Kallista A. Bonawitz and Vladimir Ivanov and Ben Kreuter and Antonio Marcedone and H. Brendan McMahan and Sarvar Patel and Daniel Ramage and Aaron Segal and Karn Seth , title =. 2017 , url =

2017
[13]

1949 , publisher=

The American Soldier: Adjustment During Army Life , author=. 1949 , publisher=

1949
[14]

Communication-Efficient Learning of Deep Networks from Decentralized Data , booktitle =

Brendan McMahan and Eider Moore and Daniel Ramage and Seth Hampson and Blaise Ag. Communication-Efficient Learning of Deep Networks from Decentralized Data , booktitle =. 2017 , url =

2017
[15]

BatchCrypt: Efficient Homomorphic Encryption for Cross-Silo Federated Learning , booktitle =

Chengliang Zhang and Suyi Li and Junzhe Xia and Wei Wang and Feng Yan and Yang Liu , editor =. BatchCrypt: Efficient Homomorphic Encryption for Cross-Silo Federated Learning , booktitle =. 2020 , url =

2020
[16]

FLDetector: Defending Federated Learning Against Model Poisoning Attacks via Detecting Malicious Clients , booktitle =

Zaixi Zhang and Xiaoyu Cao and Jinyuan Jia and Neil Zhenqiang Gong , editor =. FLDetector: Defending Federated Learning Against Model Poisoning Attacks via Detecting Malicious Clients , booktitle =. 2022 , url =. doi:10.1145/3534678.3539231 , timestamp =

work page doi:10.1145/3534678.3539231 2022
[17]

arXiv preprint arXiv:2407.07221 , year=

Tracing back the malicious clients in poisoning attacks to federated learning , author=. arXiv preprint arXiv:2407.07221 , year=

work page arXiv
[18]

1977 , publisher=

Sampling Techniques , author=. 1977 , publisher=

1977
[19]

Bonawitz and Adri

James Henry Bell and Kallista A. Bonawitz and Adri. Secure Single-Server Aggregation with (Poly)Logarithmic Overhead , booktitle =. 2020 , url =. doi:10.1145/3372297.3417885 , timestamp =

work page doi:10.1145/3372297.3417885 2020
[20]

Ezzeldin and Konstantinos Psounis and Salman Avestimehr , title =

Ahmed Roushdy Elkordy and Jiang Zhang and Yahya H. Ezzeldin and Konstantinos Psounis and Salman Avestimehr , title =. Proc. Priv. Enhancing Technol. , volume =. 2023 , url =. doi:10.56553/POPETS-2023-0030 , timestamp =

work page doi:10.56553/popets-2023-0030 2023
[21]

USENIX Security Symposium , year=

Extracting Training Data from Large Language Models , author=. USENIX Security Symposium , year=
[22]

MELLODDY: Cross-pharma Federated Learning at Unprecedented Scale Unlocks Benefits in QSAR without Compromising Proprietary Information , journal =

Heyndrickx, Wouter and Mervin, Lewis and Morawietz, Tobias and Sturm, No. MELLODDY: Cross-pharma Federated Learning at Unprecedented Scale Unlocks Benefits in QSAR without Compromising Proprietary Information , journal =. 2024 , doi =

2024
[23]

MIMIC-IV.PhysioNet, October 2024

Johnson, Alistair and Bulgarelli, Lucas and Pollard, Tom and Gow, Brian and Moody, Benjamin and Horng, Steven and Celi, Leo Anthony and Mark, Roger , title =. 2024 , month = oct, note =. doi:10.13026/kpb9-mt58 , url =

work page doi:10.13026/kpb9-mt58 2024
[24]

The future of digital health with federated learning , volume =

Rieke, Nicola and Hancox, Jonny and Li, Wenqi and Milletari, Fausto and Roth, Holger and Albarqouni, Shadi and Bakas, Spyridon and Galtier, Mathieu and Landman, Bennett and Maier-Hein, Klaus and Ourselin, Sébastien and Sheller, Micah and Summers, Ronald and Trask, Andrew and Xu, Daguang and Baust, Maximilian and Cardoso, Manuel Jorge , year =. The future ...
[25]

Ali and Basak G

Jinhyun So and Ramy E. Ali and Basak G. Securing Secure Aggregation: Mitigating Multi-Round Privacy Leakage in Federated Learning , booktitle =. 2023 , url =. doi:10.1609/AAAI.V37I8.26177 , timestamp =

work page doi:10.1609/aaai.v37i8.26177 2023
[26]

and Goetze, Friedrich , year =

Bobkov, Sergey and Chistyakov, G. and Goetze, Friedrich , year =. Berry-Esseen bounds in the entropic central limit theorem , volume =. Probability Theory and Related Fields , doi =
[27]

ArXiv , year=

A Survey on Federated Fine-tuning of Large Language Models , author=. ArXiv , year=
[28]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025