pith. machine review for the scientific record. sign in

arxiv: 2604.06833 · v1 · submitted 2026-04-08 · 💻 cs.CR · cs.LG

Recognition: unknown

FedDetox: Robust Federated SLM Alignment via On-Device Data Sanitization

Hideya Ochiai, Jiawei Chen, Shunan Zhu, Yonghao Yu

Pith reviewed 2026-05-10 17:17 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords federated learningsmall language modelssafety alignmentdata sanitizationknowledge distillationunintended data poisoning
0
0 comments X

The pith

On-device sanitization with distilled classifiers lets federated SLM alignment preserve safety despite toxic user data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FedDetox to address unintended data poisoning in federated learning for small language models. It distills safety knowledge from large aligned models into lightweight classifiers that run directly on edge devices. During the federated alignment process these classifiers identify unsafe samples at the client and replace them with refusal templates, converting potential poisons into safety training signals. Experiments show the resulting global models achieve safety levels comparable to centralized baselines while retaining general utility.

Core claim

The central claim is that distilling safety alignment capabilities from large teacher models into lightweight student classifiers enables on-device detection and replacement of unsafe samples with refusal templates, thereby transforming potential unintended poisons into positive safety signals during federated human preference alignment for SLMs and preserving model safety at levels comparable to centralized baselines.

What carries the argument

Lightweight student classifiers from knowledge distillation that run on resource-constrained devices to identify unsafe samples and replace them with refusal templates before they enter federated training.

If this is right

  • Federated SLM training can incorporate real private user data for alignment without safety degradation from toxic content.
  • General model capabilities on standard tasks remain intact after the sanitization step.
  • Safety enforcement happens locally at each client rather than after data aggregation at the server.
  • Potential poisoning is neutralized at the data source by turning unsafe examples into refusal signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Local sanitization reduces the privacy exposure that would occur if raw client data were sent to a central filter.
  • The same distillation technique could transfer other complex capabilities beyond safety to devices for federated tasks.

Load-bearing premise

The distilled lightweight classifiers must identify unsafe samples accurately enough on edge devices to stop toxic data from reaching the global model.

What would settle it

A controlled experiment in which the student classifiers miss a measurable fraction of unsafe samples and the resulting federated model exhibits lower safety metrics than a centralized baseline would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.06833 by Hideya Ochiai, Jiawei Chen, Shunan Zhu, Yonghao Yu.

Figure 1
Figure 1. Figure 1: An illustrative example of unintended data poisoning, where daily [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Problem Definition: The risk of Unintended Data Poisoning in Federated Alignment. Left: In centralized settings, raw data is aggregated and sanitized via a central Safety Filter before training, ensuring a safe model. Right: In federated settings, raw data remains local and private. “Unaware” clients, possessing a mixture of benign and toxic data like private chat history, perform local tuning without sani… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the knowledge distillation pipeline for the lightweight Guardian. We construct a mixed dataset of benign and toxic samples to train the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Hazard category distribution of the distilled dataset. The dataset is [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of poisoning and defense efficacy across models. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation Study: Effectiveness of sanitization strategies. Simply [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

As high quality public data becomes scarce, Federated Learning (FL) provides a vital pathway to leverage valuable private user data while preserving privacy. However, real-world client data often contains toxic or unsafe information. This leads to a critical issue we define as unintended data poisoning, which can severely damage the safety alignment of global models during federated alignment. To address this, we propose FedDetox, a robust framework tailored for Small Language Models (SLMs) on resource-constrained edge devices. We first employ knowledge distillation to transfer sophisticated safety alignment capabilities from large scale safety aligned teacher models into light weight student classifiers suitable for resource constrained edge devices. Specifically, during federated learning for human preference alignment, the edge client identifies unsafe samples at the source and replaces them with refusal templates, effectively transforming potential poisons into positive safety signals. Experiments demonstrate that our approach preserves model safety at a level comparable to centralized baselines without compromising general utility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes FedDetox, a framework for federated alignment of small language models (SLMs) on resource-constrained edge devices. It applies knowledge distillation to transfer safety capabilities from large teacher models to lightweight student classifiers, which run on-device to detect unsafe samples in client data during federated human preference alignment; detected unsafe samples are replaced with refusal templates to convert potential poisons into positive safety signals. The central claim is that this on-device sanitization preserves model safety at levels comparable to centralized baselines without compromising general utility.

Significance. If the empirical claims hold, the work would provide a practical mechanism for mitigating unintended data poisoning in federated SLM alignment while respecting device constraints and privacy, addressing a growing concern as public data scarcity pushes reliance on private client data.

major comments (2)
  1. [Abstract] Abstract: the assertion that 'Experiments demonstrate that our approach preserves model safety at a level comparable to centralized baselines without compromising general utility' supplies no datasets, metrics (e.g., safety scores such as toxicity rates or refusal rates), baselines, or ablation results, rendering the central empirical claim unverifiable from the manuscript.
  2. [Method] Method description (on-device sanitization via student classifiers): the safety-preservation argument rests on the assumption that the distilled lightweight student classifiers achieve sufficient recall on unsafe samples under realistic client distributions; no precision/recall/F1 numbers, comparison to the teacher, or analysis of distillation-induced coverage loss on nuanced toxicity are reported, leaving the load-bearing detection step unsubstantiated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's feedback on improving the verifiability of our claims in the abstract and the substantiation of the on-device sanitization method. We provide point-by-point responses below and will incorporate revisions accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that 'Experiments demonstrate that our approach preserves model safety at a level comparable to centralized baselines without compromising general utility' supplies no datasets, metrics (e.g., safety scores such as toxicity rates or refusal rates), baselines, or ablation results, rendering the central empirical claim unverifiable from the manuscript.

    Authors: We agree that the abstract, constrained by length, omits specific details on datasets, metrics, baselines, and ablations. The full manuscript's experimental section reports these elements, including evaluations on safety benchmarks with toxicity and refusal rates alongside centralized comparisons. To enhance verifiability directly from the abstract, we will revise it to incorporate key quantitative results and references to the supporting experiments. revision: yes

  2. Referee: [Method] Method description (on-device sanitization via student classifiers): the safety-preservation argument rests on the assumption that the distilled lightweight student classifiers achieve sufficient recall on unsafe samples under realistic client distributions; no precision/recall/F1 numbers, comparison to the teacher, or analysis of distillation-induced coverage loss on nuanced toxicity are reported, leaving the load-bearing detection step unsubstantiated.

    Authors: We concur that explicit metrics for the student classifiers are needed to substantiate the detection performance. The method section outlines the distillation process, but we will add a dedicated analysis in the revised manuscript reporting precision, recall, and F1 scores on unsafe sample detection, comparisons to the teacher model, and evaluation of any coverage loss for nuanced toxicity under client-like distributions. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical framework with no derivations or self-referential claims

full rationale

The paper describes an applied system (knowledge distillation to on-device classifiers, replacement of detected unsafe samples with refusal templates during federated alignment) and reports experimental comparisons to centralized baselines. No equations, parameter-fitting steps, predictions derived from fitted values, or first-principles derivations appear in the abstract or described content. The central claim rests on empirical outcomes rather than any chain that reduces to its own inputs by construction. Self-citations, if present in the full text, are not load-bearing for any definitional or predictive step.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond standard federated learning and distillation assumptions; the framework itself is the primary contribution.

axioms (2)
  • domain assumption Knowledge distillation from large safety-aligned models transfers usable safety detection capability to lightweight classifiers suitable for edge devices.
    Invoked in the description of the student classifier creation step.
  • domain assumption On-device identification of unsafe samples is accurate enough that replacement with refusal templates converts poisons into positive signals without introducing new harms.
    Core mechanism of the sanitization process.

pith-pipeline@v0.9.0 · 5464 in / 1166 out tokens · 33477 ms · 2026-05-10T17:17:27.831172+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 21 canonical work pages · 8 internal anchors

  1. [1]

    Position: Will we run out of data? limits of LLM scaling based on human-generated data,

    P. Villalobos, A. Ho, J. Sevilla, T. Besiroglu, L. Heim, and M. Hobbhahn, “Position: Will we run out of data? limits of LLM scaling based on human-generated data,” inProc. of the 41st International Conference on Machine Learning (ICML), 2024, pp. 5170–5192

  2. [2]

    Communication-efficient learning of deep networks from decentralized data,

    B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” inArtificial Intelligence and Statistics (AISTATS). PMLR, 2017, pp. 1273–1282

  3. [3]

    A survey of small language models,

    C. V . Nguyen, X. Shen, R. Aponte, Y . Xia, S. Basu, Z. Hu, J. Chenet al., “A survey of small language models,”arXiv preprint arXiv:2410.20011, 2024

  4. [4]

    Training language models to follow instructions with human feedback

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”arXiv preprint arXiv:2203.02155, 2022

  5. [5]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y . Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine, and M. Khabsa, “Llama Guard: LLM-based input-output safeguard for human-AI conversations,”arXiv preprint arXiv:2312.06674, 2023

  6. [6]

    Towards building the federated GPT: Federated instruction tuning.arXiv preprint arXiv:2305.05644, 2023

    J. Zhang, S. Vahidian, M. Kuo, C. Li, R. Zhang, G. Wang, and Y . Chen, “Towards building the federated GPT: Federated instruction tuning,” arXiv preprint arXiv:2305.05644, 2023

  7. [7]

    Improving lora in privacy-preserving federated learning.arXiv preprint arXiv:2403.12313, 2024

    Y . Sun, Z. Li, Y . Li, and B. Ding, “Improving LoRA in privacy- preserving federated learning,”arXiv preprint arXiv:2403.12313, 2024

  8. [8]

    LoRA-Fair: Federated LoRA fine-tuning with aggregation and initialization refinement,

    J. Bian, L. Wang, L. Zhang, and J. Xu, “LoRA-Fair: Federated LoRA fine-tuning with aggregation and initialization refinement,”arXiv preprint arXiv:2411.14961, 2024

  9. [9]

    PluralLLM: Pluralistic alignment in LLMs via federated learning,

    M. Srewa, T. Zhao, and S. Elmalaki, “PluralLLM: Pluralistic alignment in LLMs via federated learning,”arXiv preprint arXiv:2503.09925, 2025

  10. [10]

    Direct preference optimization: Your language model is secretly a reward model,

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023, pp. 53 728–53 741

  11. [11]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

  12. [12]

    Challenges and approaches for mitigating byzantine attacks in federated learning,

    J. Shi, W. Wan, S. Hu, J. Lu, and L. Y . Zhang, “Challenges and approaches for mitigating byzantine attacks in federated learning,” inIEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), 2022, pp. 139–146

  13. [13]

    Byzantine-tolerant machine learning,

    P. Blanchard, E. M. El Mhamdi, R. Guerraoui, and J. Stainer, “Byzantine-tolerant machine learning,”arXiv preprint arXiv:1703.02757, 2017

  14. [14]

    Backdoor attacks and defenses in federated learning: Survey, challenges and future research directions,

    T. D. Nguyen, T. Nguyen, P. L. Nguyen, H. H. Pham, K. D. Doan, and K.-S. Wong, “Backdoor attacks and defenses in federated learning: Survey, challenges and future research directions,”Engineering Appli- cations of Artificial Intelligence, vol. 127, p. 107166, 2024

  15. [15]

    Revisiting backdoor threat in federated instruction tuning from a signal aggregation perspective,

    H. Zhao, J. Hu, and G. Liu, “Revisiting backdoor threat in federated instruction tuning from a signal aggregation perspective,”arXiv preprint arXiv:2602.15671, 2026

  16. [16]

    Emerging safety attack and defense in federated instruction tuning of large language models,

    R. Ye, J. Chai, X. Liu, Y . Yang, Y . Wang, and S. Chen, “Emerging safety attack and defense in federated instruction tuning of large language models,”arXiv preprint arXiv:2406.10630, 2024

  17. [17]

    Is poisoning a real threat to llm alignment? maybe more so than you think.arXiv preprint arXiv:2406.12091, 2024

    P. Pathmanathan, S. Chakraborty, X. Liu, Y . Liang, and F. Huang, “Is poisoning a real threat to LLM alignment? maybe more so than you think,”arXiv preprint arXiv:2406.12091, 2024

  18. [18]

    A holistic approach to undesired content detection in the real world,

    T. Markov, C. Zhang, S. Agarwal, F. E. Nekoul, T. Lee, S. Adler, A. Jiang, and L. Weng, “A holistic approach to undesired content detection in the real world,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 12, 2023, pp. 15 009–15 018

  19. [19]

    Qwen2.5 Technical Report

    A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huanget al., “Qwen2.5 technical report,”arXiv preprint arXiv:2412.15115, 2025

  20. [20]

    Mukherjee, A

    S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, H. Palangi, and A. Awadallah, “Orca: Progressive learning from complex explanation traces of GPT-4,”arXiv preprint arXiv:2306.02707, 2023

  21. [21]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y . Bai, S. Kadavath, B. Mann, E. Perezet al., “Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,”arXiv preprint arXiv:2209.07858, 2022

  22. [22]

    SORRY-Bench: Systematically evaluat- ing large language model safety refusal,

    T. Xie, X. Qi, Y . Zeng, Y . Huang, U. M. Sehwag, K. Huang, L. He, B. Wei, D. Li, Y . Shenget al., “SORRY-Bench: Systematically evaluat- ing large language model safety refusal,” inProc. of the International Conference on Learning Representations (ICLR), 2025

  23. [23]

    The Llama 3 Herd of Models

    A. Dubey, A. Jauhri, A. Pandey, A. Keshvamurthyet al., “The Llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  24. [24]

    MobileBERT: a compact task-agnostic BERT for resource-limited devices,

    Z. Sun, H. Yu, X. Song, R. Liu, Y . Yang, and D. Zhou, “MobileBERT: a compact task-agnostic BERT for resource-limited devices,” inProc. of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020, pp. 2158–2170

  25. [25]

    Pku-saferlhf: Towards multi-level safety alignment for llms with human preference.arXiv preprint arXiv:2406.15513, 2024

    J. Ji, D. Hong, B. Zhang, B. Chen, J. Dai, B. Zheng, T. Qiu, B. Li, and Y . Yang, “PKU-SafeRLHF: Towards multi-level safety alignment for LLMs with human preference,”arXiv preprint arXiv:2406.15513, 2024

  26. [26]

    arXiv:2310.17389 (2023), https://arxiv.org/abs/2310.17389

    Z. Lin, Z. Wang, Y . Tong, Y . Wang, Y . Guo, Y . Wang, and J. Shang, “ToxicChat: Unveiling hidden challenges of toxicity detection in real- world user-AI conversation,”arXiv preprint arXiv:2310.17389, 2023

  27. [27]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,”arXiv preprint arXiv:2307.15043, 2023

  28. [28]

    do anything now

    X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang, ““Do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models,”arXiv preprint arXiv:2308.03825, 2024

  29. [29]

    Tree of attacks: jailbreaking black-box LLMs automatically,

    A. Mehrotra, M. Zampetakis, P. Kassianik, B. Nelson, H. Anderson, Y . Singer, and A. Karbasi, “Tree of attacks: jailbreaking black-box LLMs automatically,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024

  30. [30]

    XSTest: A test suite for identifying exaggerated safety behaviours in large language models,

    P. R ¨ottger, H. Kirk, B. Vidgen, G. Attanasio, F. Bianchi, and D. Hovy, “XSTest: A test suite for identifying exaggerated safety behaviours in large language models,” inProc. of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024, pp. 5377–5400

  31. [31]

    Mitigating the alignment tax of RLHF,

    Y . Lin, H. Lin, W. Xiong, S. Diao, J. Liu, J. Zhang, R. Panet al., “Mitigating the alignment tax of RLHF,” inProc. of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 580–606

  32. [32]

    Measuring massive multitask language understanding,

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” inProc. of the International Conference on Learning Representations (ICLR), 2021

  33. [33]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworeket al., “Training verifiers to solve math word problems,”arXiv preprint arXiv:2110.14168, 2021

  34. [34]

    TruthfulQA: Measuring how models mimic human falsehoods,

    S. Lin, J. Hilton, and O. Evans, “TruthfulQA: Measuring how models mimic human falsehoods,” inProc. of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022, pp. 3214–3252