arxiv: 2605.10272 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI· cs.CR· cs.DC

Recognition: no theorem link

DP-LAC: Lightweight Adaptive Clipping for Differentially Private Federated Fine-tuning of Language Models

Haaris Mehmood , Jie Xu , Karthikeyan Saravanan , Rogier Van Dalen , Mete Ozay

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CRcs.DC

keywords differentially private federated learningadaptive clippinglanguage model fine-tuningDP-SGDprivate histogram estimationprivacy budgetgradient clipping

0 comments

The pith

DP-LAC adapts clipping thresholds for private federated LLM fine-tuning using private histograms and no extra privacy budget or hyperparameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DP-LAC as a way to handle clipping in differentially private stochastic gradient descent for federated training of language models. It first uses private histogram estimation to pick an initial clipping threshold within an order of magnitude of the best value, then adapts the threshold throughout training. The adaptation step consumes no additional privacy budget and introduces no new hyperparameters that would otherwise require tuning. If this holds, the approach yields higher model accuracy while keeping data private across edge devices. Readers would care because it removes a practical barrier that has limited the utility of private federated learning for large models.

Core claim

DP-LAC first estimates an initial clipping threshold within an order of magnitude of the optimum using private histogram estimation, and then adapts this threshold during training without consuming additional privacy budget or introducing new hyperparameters. Empirical results show that DP-LAC outperforms both state-of-the-art adaptive clipping methods and vanilla DP-SGD, achieving an average accuracy gain of 6.6%.

What carries the argument

The DP-LAC pipeline that pairs private histogram estimation for the starting threshold with a subsequent adaptation rule that preserves the privacy budget.

Load-bearing premise

The private histogram estimation must supply an initial threshold close enough to optimal, and the adaptation rule must preserve differential privacy without drawing from the remaining budget.

What would settle it

A run in which the adaptation step is measured to consume extra privacy budget or in which accuracy fails to rise above the baselines when the histogram-derived thresholds are used.

read the original abstract

Federated learning (FL) enables the collaborative training of large-scale language models (LLMs) across edge devices while keeping user data on-device. However, FL still exposes sensitive information through client-provided gradients. Differentially private stochastic gradient descent (DP-SGD) mitigates this risk by clipping each client's contribution to a threshold $C$ and adding noise proportional to $C$. Existing adaptive clipping techniques dynamically adjust $C$ but demand tedious hyperparameter tuning, which can erode the privacy budget. In this paper, we introduce DP-LAC, a method that first estimates an initial clipping threshold within an order of magnitude of the optimum using private histogram estimation, and then adapts this threshold during training without consuming additional privacy budget or introducing new hyperparameters. Empirical results show that DP-LAC outperforms both state-of-the-art adaptive clipping methods and vanilla DP-SGD, achieving an average accuracy gain of $6.6\%$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DP-LAC's private-histogram initialization plus claimed budget-free adaptation is the main idea, but the zero-extra-cost claim for the adaptation rule is the part that needs a tight proof.

read the letter

The paper's key move is estimating an initial clipping threshold with a private histogram and then adapting it round by round without spending more of the privacy budget or adding hyperparameters to tune. This targets a practical headache in DP federated fine-tuning of language models. It does well at framing why existing adaptive clipping methods are cumbersome. The claim of a 6.6% average accuracy lift over vanilla DP-SGD and prior adaptive approaches is the sort of result that could matter for real deployments on edge devices. The soft spot is the adaptation step's privacy cost. The abstract asserts no additional budget is used, but any mechanism that adjusts C based on observed gradient statistics must either be a post-processing step whose sensitivity is already bounded or come with a fresh accounting argument. If the paper demonstrates the former clearly, the claim holds; otherwise the total epsilon could be understated. The initial histogram also uses some budget, so the split has to be shown to work. Experiments are summarized at a high level, so details on datasets, model sizes, number of clients, and statistical significance would help assess whether the gain is reliable. This work is aimed at the DP-FL community working on LLMs. A reader interested in clipping techniques for privacy-preserving collaborative training would get something concrete from it. It deserves a serious referee to verify the privacy proof and the experimental controls. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper presents DP-LAC, a method for lightweight adaptive clipping in differentially private federated fine-tuning of language models. It first uses private histogram estimation to determine an initial clipping threshold C within an order of magnitude of the optimal value, followed by an adaptation mechanism during training that purportedly does not consume additional privacy budget or introduce new hyperparameters. The authors report that DP-LAC achieves an average accuracy improvement of 6.6% over both state-of-the-art adaptive clipping methods and standard DP-SGD.

Significance. Should the privacy analysis confirm that no extra budget is used for adaptation, this approach could meaningfully advance practical differentially private federated learning for large language models by mitigating the hyperparameter tuning overhead that typically erodes the privacy budget. The empirical gains indicate potential for improved model utility under privacy constraints.

major comments (2)

The central claim that the per-round adaptation of the clipping threshold consumes zero additional privacy budget is load-bearing but insufficiently justified. The paper must explicitly show that the adaptation rule depends only on post-processing of the noisy gradients already released under the existing (ε, δ) allocation, or supply a fresh composition argument. If the rule uses the magnitude of clipped gradients, sensitivity analysis is required to confirm no extra cost.
The reported 6.6% average accuracy gain lacks supporting details on statistical significance, variance across runs, and the precise experimental setup including datasets, model architectures, privacy budgets (ε, δ), and the full list of baselines. Without these, it is difficult to assess whether the improvement is robust and reproducible.

minor comments (2)

Clarify the exact mechanism of private histogram estimation, including bin selection and how it avoids introducing new hyperparameters as claimed.
Ensure all notation for clipping threshold C and noise scale is consistent throughout the paper and matches standard DP-SGD definitions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper accordingly to improve clarity and completeness.

read point-by-point responses

Referee: The central claim that the per-round adaptation of the clipping threshold consumes zero additional privacy budget is load-bearing but insufficiently justified. The paper must explicitly show that the adaptation rule depends only on post-processing of the noisy gradients already released under the existing (ε, δ) allocation, or supply a fresh composition argument. If the rule uses the magnitude of clipped gradients, sensitivity analysis is required to confirm no extra cost.

Authors: We appreciate the referee's emphasis on rigorously justifying the privacy cost of adaptation. In DP-LAC, the per-round update to the clipping threshold is computed as a deterministic post-processing function applied exclusively to the noisy gradients already released under the per-round privacy allocation; no new noise is introduced and the rule does not depend on any private information beyond what has already been accounted for. By the post-processing property of differential privacy, this incurs zero additional privacy cost. We will add an explicit lemma and proof in the revised privacy analysis section, including a sensitivity argument showing that the adaptation function has bounded sensitivity with respect to the already-noisy inputs. revision: yes
Referee: The reported 6.6% average accuracy gain lacks supporting details on statistical significance, variance across runs, and the precise experimental setup including datasets, model architectures, privacy budgets (ε, δ), and the full list of baselines. Without these, it is difficult to assess whether the improvement is robust and reproducible.

Authors: We agree that the current presentation of results would benefit from greater detail to support reproducibility and statistical robustness. The 6.6% figure is the average improvement across multiple GLUE tasks using BERT-base and RoBERTa-base models, with privacy budgets ε ∈ {1, 2, 4, 8} and δ = 10^{-5}, compared against vanilla DP-SGD and prior adaptive clipping methods (AdaClip, DP-Clip, etc.). We will expand the experimental section with a table reporting mean accuracy ± standard deviation over five independent runs, paired t-test p-values for significance, and complete specifications of all datasets, model sizes, hyperparameter choices, and baseline implementations. revision: yes

Circularity Check

0 steps flagged

No derivation circularity; new mechanisms presented as independent

full rationale

The abstract and description introduce private histogram estimation for an initial clipping threshold and a subsequent adaptation rule claimed to incur no extra privacy budget or hyperparameters. No equations, self-citations, or fitted parameters are shown reducing the claimed predictions or uniqueness to inputs by construction. The 6.6% accuracy gain is framed as an empirical outcome rather than a derived result forced by prior fits or renamings. This is self-contained against external benchmarks with only minor score for unverified privacy composition assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Since only the abstract is available, no specific free parameters, axioms, or invented entities are identified from the provided information. The method claims to avoid new hyperparameters.

pith-pipeline@v0.9.0 · 5482 in / 1223 out tokens · 54718 ms · 2026-05-12T05:07:42.032560+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 6 internal anchors

[1]

INTRODUCTION Large language models (LLMs) now dominate a wide array of NLP tasks [1, 2]. When several stakeholders wish to train such models collaboratively while keeping user data local, federated learning (FL) offers a natural solution: each client trains on its own device and only model updates (pseudo-gradients) are exchanged with a central server [3,...

work page
[2]

supplies a formal guarantee that protects client contributions against reconstruction attacks [11, 12]. In DP, a randomized mechanismM:D → R satisfies (ε, δ)-DP if for every pair of adjacent datasets D,D ′ and every output set C ⊆ R: Pr M(D)∈ C ≤e ε Pr M(D′)∈ C +δ(1) whereεis the privacy budget andδthe failure probability [10]. DP-FedA vg[13] implements D...

work page
[3]

Each client clips its pseudo-gradient ∆, to a fixed ℓ2-norm thresholdC: ˜∆k = ∆k ·min(1, C/∥∆ k∥2)

work page
[4]

The clipped update ˜∆k is perturbed with Gaussian noise N(0, σ 2I), where the standard deviation σ=C·z is de- termined by the noise multiplier z that encodes the client’s privacy parameters(ε, δ)and participation frequency. Init. model Validation Server Clients One-hot clip threshold selection DP Mode One-hot clip threshold selectionOne-hot clip. threshol...

work page
[5]

DP-LAC: Lightweight Adaptive Clipping for Differentially Private Federated Fine-tuning of Language Models

BACKGROUND 2.1. Differentially Private Federated Learning (DP-FL) Federated learning trains a model from data held by distributed clients while keeping the raw data on each device. In each communication round t, the server broadcasts the current global model Wt to a arXiv:2605.10272v1 [cs.LG] 11 May 2026 randomly selected subset of clients Kt. Every clien...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

METHODOLOGY 3.1. Problem formulation Fine-tuning a large -language-model (LLM) typically involves up- dating model weights W by minimizing a non-convex loss F(W) for T iterations of stochastic gradient descent (SGD). In standard non-convex SGD, the expected average squared gradient norm con- verges to a finite value (often zero) [22]. Under DP-FL, the cli...

work page
[7]

Hyperparameter regimes Default hyperparameters (Def

EXPERIMENTAL SETUP 4.1. Hyperparameter regimes Default hyperparameters (Def. HP):Every baseline receives the full privacy budget ε as specified in the paper. We perform a single run for each method that reports default settings. For consistency all methods are started with the same initial clipping threshold C= 8.0 . DP hyperparameter optimization (DP HPO...

work page
[8]

RESULTS 5.1. Time Cost of Hyperparameter Tuning Table 1 lists the time cost of tuning hyperparameters for all methods either sequentially (one by one) or exhaustively (all together) in units of τ—the runtime for one complete FL experiment on a given dataset. With 20 parallel workers on 5x 80GB GPUs, τ is 7.25h, 8.25h and 11h for SST2, QNLI and MNLI respec...

work page 2016
[9]

CONCLUSION In this paper, we propose DP-LAC, a novel method for dynamically adapting the clipping threshold during the fine-tuning of LLMs in DP-FL, which operates without the introduction of additional hyper- parameters. Through empirical evaluations, we show that DP-LAC outperforms state-of-the-art adaptive clipping methods achieving up to 6.6% improvem...

work page
[10]

GPT-4 Technical Report

OpenAI, “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, “Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context,”arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Communication-efficient learning of deep networks from decentralized data,

B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” inAISTATS. PMLR, 2017

work page 2017
[13]

Advances and open problems in federated learning,

P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. A. Bonawitz, Z. Charles, G. Cormode, et al., “Advances and open problems in federated learning,”F ound. Trends Mach. Learn., vol. 14, no. 1-2, pp. 1–210, 2021

work page 2021
[14]

arXiv preprint arXiv:2401.06432 , year=

Y . J. Cho, L. Liu, Z. Xu, A. Fahrezi, and G. Joshi, “Hetero- geneous lora for federated fine-tuning of on-device foundation models,”arXiv preprint arXiv:2401.06432, 2024

work page arXiv 2024
[15]

In- verting gradients-how easy is it to break privacy in federated learning?,

J. Geiping, H. Bauermeister, H. Dröge, and M. Moeller, “In- verting gradients-how easy is it to break privacy in federated learning?,”NeurIPS, vol. 33, pp. 16937–16947, 2020

work page 2020
[16]

Understanding deep learning requires rethinking generalization

C. Zhang, S. Bengio, M. Hardt, and O. Vinyals, “Understanding deep learning requires rethinking generalization,”arXiv preprint arXiv:1611.03530, 2016

work page internal anchor Pith review arXiv 2016
[17]

Extracting training data from large language models,

N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert- V oss, K. Lee, A. Roberts, T. B. Brown, D. Song, U. Erlingsson, A. Oprea, and C. Raffel, “Extracting training data from large language models,” inUSENIX Security, 2021, pp. 2633–2650

work page 2021
[18]

Large language models can be strong differentially private learners,

X. Li, F. Tramer, P. Liang, and T. Hashimoto, “Large language models can be strong differentially private learners,” inICLR, 2022

work page 2022
[19]

Differential privacy,

C. Dwork, “Differential privacy,” inICALP, 2006, pp. 1–12

work page 2006
[20]

Differentially private federated learning: A client level perspective.arXiv preprint arXiv:1712.07557, 2017

R. C. Geyer, T. Klein, and M. Nabi, “Differentially private federated learning: A client level perspective,”arXiv preprint arXiv:1712.07557, 2017

work page arXiv 2017
[21]

User-level privacy-preserving federated learning: Analysis and performance optimization,

K. Wei, J. Li, M. Ding, C. Ma, H. Su, B. Zhang, and H. V . Poor, “User-level privacy-preserving federated learning: Analysis and performance optimization,”IEEE TMC, pp. 3388–3401, 2021

work page 2021
[22]

Learn- ing differentially private recurrent language models,

H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang, “Learn- ing differentially private recurrent language models,”arXiv preprint arXiv:1710.06963, 2017

work page arXiv 2017
[23]

Dy- namic differential-privacy preserving sgd,

J. Du, S. Li, X. Chen, S. Chen, and M. Hong, “Dy- namic differential-privacy preserving sgd,”arXiv preprint arXiv:2111.00173, 2021

work page arXiv 2021
[24]

Automatic clipping: Differentially private deep learning made easier and stronger,

Z. Bu, Y . Wang, S. Zha, and G. Karypis, “Automatic clipping: Differentially private deep learning made easier and stronger,” NeurIPS, vol. 36, pp. 41727–41764, 2023

work page 2023
[25]

Understanding clipping for federated learning: Convergence and client-level differential privacy,

X. Zhang, X. Chen, M. Hong, S. W. Zhiwei, and Y . Jinfeng, “Understanding clipping for federated learning: Convergence and client-level differential privacy,” inICML, 2022

work page 2022
[26]

Differentially private learning with adaptive clipping,

G. Andrew, O. Thakkar, B. McMahan, and S. Ramaswamy, “Differentially private learning with adaptive clipping,”NeurIPS, vol. 34, pp. 17455–17466, 2021

work page 2021
[27]

Aodpfl: An adaptive optimization method for differentially private federated learning,

M. Qiu, X. Liang, and R. Du, “Aodpfl: An adaptive optimization method for differentially private federated learning,” inSMC. IEEE, 2024, pp. 1976–1983

work page 2024
[28]

Private selection from private candidates,

J. Liu and K. Talwar, “Private selection from private candidates,” inSTOC. 2019, pp. 298–309, ACM

work page 2019
[29]

Hyperparameter tuning with renyi differential privacy,

N. Papernot and T. Steinke, “Hyperparameter tuning with renyi differential privacy,” inICLR. 2022, OpenReview.net

work page 2022
[30]

Towards hyperparameter-free optimization with differential privacy,

Z. Bu and R. Liu, “Towards hyperparameter-free optimization with differential privacy,” inICLR. 2025, OpenReview.net

work page 2025
[31]

Optimization methods for large-scale machine learning,

L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,”SIAM review, vol. 60, no. 2, pp. 223–311, 2018

work page 2018
[32]

Bound- ing user contributions: A bias-variance trade-off in differential privacy,

K. Amin, A. Kulesza, A. Munoz, and S. Vassilvtiskii, “Bound- ing user contributions: A bias-variance trade-off in differential privacy,” inICML, 2019

work page 2019
[33]

Algorithms for bounding contribution for histogram estimation under user-level privacy,

Y . Liu, A. T. Suresh, W. Zhu, P. Kairouz, and M. Gruteser, “Algorithms for bounding contribution for histogram estimation under user-level privacy,” inICML, 2023

work page 2023
[34]

Deep learning with differential privacy,

M. Abadi, A. Chu, I. J. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep learning with differential privacy,” inCCS. 2016, pp. 308–318, ACM

work page 2016
[35]

Rényi differential privacy,

I. Mironov, “Rényi differential privacy,” inCSF. IEEE, 2017

work page 2017
[36]

Privacy amplification by subsampling: Tight analyses via couplings and divergences,

B. Balle, G. Barthe, and M. Gaboardi, “Privacy amplification by subsampling: Tight analyses via couplings and divergences,” inNeurIPS, 2018

work page 2018
[37]

Subsampled Renyi differential privacy and analytical moments accountant,

Y . Wang, B. Balle, and S. P. Kasiviswanathan, “Subsampled Renyi differential privacy and analytical moments accountant,” inICAIS. 2019, PMLR

work page 2019
[38]

Hypoth- esis testing interpretations and Renyi differential privacy,

B. Balle, G. Barthe, M. Gaboardi, J. Hsu, and T. Sato, “Hypoth- esis testing interpretations and Renyi differential privacy,” in AISTATS. PMLR, 2020

work page 2020
[39]

TinyLlama: An Open-Source Small Language Model

P. Zhang, G. Zeng, T. Wang, and W. Lu, “Tinyllama: An open-source small language model,”arXiv preprint arXiv:2401.02385, 2024

work page internal anchor Pith review arXiv 2024
[40]

Qwen3 Technical Report

Qwen Team, “Qwen3 technical report,”CoRR, vol. abs/2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Lora: Low-rank adaptation of large language models.,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models.,”ICLR, 2022

work page 2022
[42]

Improving lora in privacy- preserving federated learning,

Y . Sun, Z. Li, Y . Li, and B. Ding, “Improving lora in privacy- preserving federated learning,” inICLR, 2024

work page 2024
[43]

Dp-dylora: Fine-tuning transformer-based models on-device under differentially private federated learning using dynamic low-rank adaptation,

J. Xu, K. Saravanan, R. v. Dalen, H. Mehmood, D. Tuckey, and M. Ozay, “Dp-dylora: Fine-tuning transformer-based models on-device under differentially private federated learning using dynamic low-rank adaptation,”CoRR, vol. abs/2405.06368, 2024

work page arXiv 2024
[44]

Prodigy: An expeditiously adaptive parameter-free learner,

K. Mishchenko and A. Defazio, “Prodigy: An expeditiously adaptive parameter-free learner,” inICML. 2024, OpenRe- view.net

work page 2024