pith. sign in

arxiv: 2605.22373 · v2 · pith:7G47O2GGnew · submitted 2026-05-21 · 💻 cs.LG · cs.CL

Boundary-targeted Membership Inference Attacks on Safety Classifiers

Pith reviewed 2026-05-25 05:41 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords membership inference attackssafety classifiersprivacy leakagemental health conversationsboundary examplesmachine learning privacyemotional support detection
0
0 comments X

The pith

Safety classifiers leak training data when attacked on low-confidence boundary examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Safety classifiers trained on sensitive conversations about distress and mental health can reveal which examples were used in training. The work tests the idea that examples where the model shows least confidence are especially revealing because the classifier falls back on memorization to resolve them. A new selection method that targets these boundary cases recovers 19 percent of flagged distress conversations at a 5 percent false-positive rate, which is 3.5 times the rate achieved by prior membership inference techniques alone. The authors also show that simple content filtering does not hide these examples and that noise-based defenses can reduce the leak.

Core claim

The paper claims that a boundary-targeted selection strategy, which prioritizes low-confidence examples, amplifies the membership signal enough to let an adversary recover 19 percent of the conversations a safety classifier flagged as indicating user distress, at a 5 percent false-positive rate. This holds for a fine-tuned classifier that detects users who may require emotional support, and the improvement is 3.5 times over state-of-the-art membership inference methods alone. The authors further characterize these boundary examples and report that content-based filtering fails to protect them while existing noise strategies reduce their susceptibility.

What carries the argument

boundary-targeted selection strategy that identifies low-confidence examples to amplify membership signals

If this is right

  • Content-based filtering leaves boundary examples exposed to membership inference.
  • Noise addition strategies reduce the leakage from low-confidence examples.
  • The attack succeeds because of localized memorization failures on ambiguous training cases.
  • The 19 percent recovery rate at 5 percent false-positive rate applies specifically to emotional-support safety classifiers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar low-confidence targeting may expose training data in other moderation or safety models beyond emotional support detection.
  • Auditing or removing ambiguous examples from training sets could shrink the attack surface.
  • Uncertainty estimates themselves may become a new privacy signal if released or observable by adversaries.

Load-bearing premise

Low-confidence examples mark places where the model relies on memorization instead of generalization.

What would settle it

An experiment in which attacks on high-confidence examples recover as many or more training samples as the low-confidence boundary attack would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.22373 by Adam Perer, Alexander Goldberg, Anthony Hughes, Nikolaos Aletras, Niloofar Mireshghallah, Prince Jha.

Figure 1
Figure 1. Figure 1: Overview of the threat model. An LLM provider deploys a safety classifier [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: MIA performance on LiRA and boundary-targeted LiRA across model scales and training [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (Left) Boundary-targeted LiRA MI-AUC as a function of the classifier’s true-label confidence PS(yi | xi), binned into deciles. Each line corresponds to a model. Error bars span the min and max across the two classifiers. (Right) Boundary-targeted LiRA MI-AUC across the harm categories assigned to BeaverTails. Bars show the mean MI-AUC averaged over all training regimes for each model. (Both) A dashed grey … view at source ↗
Figure 4
Figure 4. Figure 4: (Left) t-SNE projection of the fine-tuned classifier’s hidden-state representations (Llama￾3.2-1-8B-IT under full fine-tuning on single-turn data). Red triangles denote boundary members (training set), blue circles denote boundary non-members, and grey points denote randomly sampled non-boundary examples. (Right) Privacy–utility trade-off under Laplace output perturbation. Each curve traces a single model … view at source ↗
read the original abstract

Safety classifiers are essential safeguards within generative AI systems, filtering harmful content or identifying at-risk users when interacting with large language models. Despite their necessity, these models are trained on sensitive datasets including discussions of self-harm and mental health, raising important, yet poorly understood, privacy concerns. Membership inference attacks (MIAs) allow adversaries to infer membership of examples used to train models. In this work, we hypothesize that identifying the examples on which the classifier is least confident are informative for an adversary to infer membership. This reflects a localized failure of generalization, where the model relies on memorization to resolve ambiguity in the training set. To investigate this, we introduce a new boundary-targeted selection strategy that identifies low confidence examples that amplify the signal of an examples membership within a training set. Our experimental results show that an adversary can recover 19% of the conversations a safety classifier flagged as indicating user distress, at a 5% false-positive rate, on a classifier fine-tuned for detecting a user who may require emotional support. This is $3.5$ times more than attacking using state-of-the-art MIA methods alone. Finally, we characterize the boundary laying examples and show that content-based filtering is ineffective for protection, and existing noise strategies can effectively mitigate susceptibility of these examples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a boundary-targeted membership inference attack on safety classifiers for detecting user distress or need for emotional support. It hypothesizes that low-confidence predictions reflect localized memorization failures that can be exploited by selecting boundary examples. The central empirical claim is that this strategy recovers 19% of flagged conversations at a 5% false-positive rate on a fine-tuned classifier, achieving a 3.5× improvement over state-of-the-art MIA methods alone. The work also characterizes boundary examples and evaluates mitigation via content filtering (ineffective) and noise addition (effective).

Significance. If the reported lift holds after proper controls, the result would indicate that standard MIAs underestimate privacy leakage in safety classifiers trained on subjective, sensitive mental-health data, particularly near decision boundaries. This could motivate targeted regularization or auditing for such models. The paper does not ship machine-checked proofs or parameter-free derivations.

major comments (2)
  1. [Abstract] Abstract: The 19% recovery at 5% FPR and 3.5× lift are stated without any description of the experimental setup, dataset splits, baseline MIA implementations, or how the low-confidence subset was constructed, making it impossible to verify whether the performance supports the memorization hypothesis or reduces to a task-specific heuristic.
  2. [Abstract] Abstract and § (method description): The central hypothesis—that low-confidence examples indicate localized memorization failures rather than intrinsic label ambiguity or class overlap in distress detection—is not isolated by any control experiment comparing member enrichment in low- vs. high-confidence subsets while holding task properties fixed; without this, the 3.5× gain cannot be attributed to amplified membership signal.
minor comments (1)
  1. [Abstract] The abstract uses 'conversations a safety classifier flagged' without clarifying whether this refers to the training set or a held-out set, which affects interpretation of the recovery rate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the need to better isolate the memorization hypothesis. We address each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The 19% recovery at 5% FPR and 3.5× lift are stated without any description of the experimental setup, dataset splits, baseline MIA implementations, or how the low-confidence subset was constructed, making it impossible to verify whether the performance supports the memorization hypothesis or reduces to a task-specific heuristic.

    Authors: We agree the abstract is too high-level for a result of this nature. In the revision we will expand it to briefly state the dataset (distress-flagged conversations), the fine-tuning setup, how the low-confidence boundary subset is selected (bottom 10% confidence on the training distribution), the exact baseline MIA implementations (LiRA and LOSS), and the 5% FPR operating point. Full experimental details remain in Sections 3–4. revision: yes

  2. Referee: [Abstract] Abstract and § (method description): The central hypothesis—that low-confidence examples indicate localized memorization failures rather than intrinsic label ambiguity or class overlap in distress detection—is not isolated by any control experiment comparing member enrichment in low- vs. high-confidence subsets while holding task properties fixed; without this, the 3.5× gain cannot be attributed to amplified membership signal.

    Authors: The existing experiments already compare the boundary-targeted attack against standard MIAs on the identical classifier and data distribution, isolating the contribution of the low-confidence selection. Nevertheless, we acknowledge that an explicit low- vs. high-confidence member-enrichment control (holding label distribution and task fixed) would further strengthen the claim. We will add this control experiment in the revision, reporting membership inference AUC on both subsets. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack success is measured, not derived by construction

full rationale

The paper presents a hypothesis about low-confidence examples and reports an empirical attack result (19% recovery at 5% FPR, 3.5× over SOTA MIA baselines) obtained by running the proposed boundary-targeted selection on a fine-tuned safety classifier. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The performance number is a direct experimental measurement on held-out data rather than a quantity that reduces to the hypothesis or to any input by definition. The interpretive claim that low confidence signals memorization is an assumption, not a load-bearing derivation step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities; all fields left empty due to lack of detail.

pith-pipeline@v0.9.0 · 5769 in / 1034 out tokens · 45173 ms · 2026-05-25T05:41:06.639708+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 5 internal anchors

  1. [1]

    Brendan, Mironov Ilya, Talwar Kunal, Zhang Li

    Abadi Martin, Chu Andy, Goodfellow Ian, McMahan H. Brendan, Mironov Ilya, Talwar Kunal, Zhang Li. Deep Learning with Differential Privacy // Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. Vienna Austria: ACM, X

  2. [2]

    Carlini Nicholas, Chien Steve, Nasr Milad, Song Shuang, Terzis Andreas, Tramer Florian

    308–318. Carlini Nicholas, Chien Steve, Nasr Milad, Song Shuang, Terzis Andreas, Tramer Florian. Member- ship inference attacks from first principles // 2022 IEEE symposium on security and privacy (SP). 2022a. 1897–1914. Carlini Nicholas, Ippolito Daphne, Jagielski Matthew, Lee Katherine, Tramer Florian, Zhang Chiyuan. Quantifying memorization across neur...

  3. [3]

    Chang Hongyan, Shahin Shamsabadi Ali, Katevas Kleomenis, Haddadi Hamed, Shokri Reza

    2633–2650. Chang Hongyan, Shahin Shamsabadi Ali, Katevas Kleomenis, Haddadi Hamed, Shokri Reza. Context- Aware Membership Inference Attacks against Pre-trained Large Language Models // Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Suzhou, China: Association for Computational Linguistics, XI

  4. [4]

    Chaudhari Harsh, Abascal John, Oprea Alina, Jagielski Matthew, Tramer Florian, Ullman Jonathan

    55005–55029. Chaudhari Harsh, Abascal John, Oprea Alina, Jagielski Matthew, Tramer Florian, Ullman Jonathan. SNAP: Efficient extraction of private properties with poisoning // 2023 IEEE Symposium on Security and Privacy (SP)

  5. [5]

    Cheng Myra, Lee Cinoo, Khadpe Pranav, Yu Sunny, Han Dyllan, Jurafsky Dan

    22854–22874. Cheng Myra, Lee Cinoo, Khadpe Pranav, Yu Sunny, Han Dyllan, Jurafsky Dan. Sycophantic AI decreases prosocial intentions and promotes dependence // arXiv preprint arXiv:2510.01395

  6. [6]

    (Proceedings of Machine Learning Research)

    1964–1974. (Proceedings of Machine Learning Research). 11 Cohan Arman, Desmet Bart, Yates Andrew, Soldaini Luca, MacAvaney Sean, Goharian Nazli. SMHD: a large-scale resource for exploring online language usage for multiple mental health conditions // Proceedings of the 27th international conference on computational linguistics

  7. [7]

    Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks // arXiv preprint arXiv:2601.04603

    Cunningham Hoagy, Wei Jerry, Wang Zihan, Persic Andrew, Peng Alwin, Abderrachid Jordan, Agarwal Raj, Chen Bobby, Cohen Austin, Dau Andy, others. Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks // arXiv preprint arXiv:2601.04603

  8. [8]

    Farinhas António, Guerreiro Nuno M, Pombal José, Martins Pedro Henrique, Melton Laura, Conway Alex, Dochat Cara, D’Eon Maya, Rei Ricardo

    143–158. Farinhas António, Guerreiro Nuno M, Pombal José, Martins Pedro Henrique, Melton Laura, Conway Alex, Dochat Cara, D’Eon Maya, Rei Ricardo. MindGuard: Guardrail Classifiers for Multi-Turn Mental Health Support // arXiv preprint arXiv:2602.00950

  9. [9]

    Fleisig Eve, Abebe Rediet, Klein Dan

    954–959. Fleisig Eve, Abebe Rediet, Klein Dan. When the majority is wrong: Modeling annotator disagreement for subjective tasks // Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

  10. [10]

    Hakim Joe B., Painter Jeffery L., Ramcharran Darmendra, Kara Vijay, Powell Greg, Sobczak Paulina, Sato Chiho, Bate Andrew, Beam Andrew

    arXiv:2409.17190 [cs]. Hakim Joe B., Painter Jeffery L., Ramcharran Darmendra, Kara Vijay, Powell Greg, Sobczak Paulina, Sato Chiho, Bate Andrew, Beam Andrew. The Need for Guardrails with Large Language Models in Medical Safety-Critical Settings: An Artificial Intelligence Application in the Pharmacovigilance Ecosystem. IX

  11. [11]

    Hallinan Skyler, Jung Jaehun, Sclar Melanie, Lu Ximing, Ravichander Abhilasha, Ramnath Sahana, Choi Yejin, Karimireddy Sai Praneeth, Mireshghallah Niloofar, Ren Xiang

    arXiv:2407.18322 [cs]. Hallinan Skyler, Jung Jaehun, Sclar Melanie, Lu Ximing, Ravichander Abhilasha, Ramnath Sahana, Choi Yejin, Karimireddy Sai Praneeth, Mireshghallah Niloofar, Ren Xiang. The surprising effec- tiveness of membership inference with simple n-gram coverage // arXiv preprint arXiv:2508.09603

  12. [12]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Inan Hakan, Upasani Kartikeya, Chi Jianfeng, Rungta Rashi, Iyer Krithika, Mao Yuning, Tontchev Michael, Hu Qing, Fuller Brian, Testuggine Davide, others. Llama guard: Llm-based input-output safeguard for human-ai conversations // arXiv preprint arXiv:2312.06674

  13. [13]

    Kramár János, Engels Joshua, Wang Zheng, Chughtai Bilal, Shah Rohin, Nanda Neel, Conmy Arthur

    10697–10707. Kramár János, Engels Joshua, Wang Zheng, Chughtai Bilal, Shah Rohin, Nanda Neel, Conmy Arthur. Building Production-Ready Probes For Gemini // arXiv preprint arXiv:2601.11516

  14. [14]

    Leonardelli Elisa, Menini Stefano, Aprosio Alessio Palmero, Guerini Marco, Tonelli Sara

    83–94. Leonardelli Elisa, Menini Stefano, Aprosio Alessio Palmero, Guerini Marco, Tonelli Sara. Agreeing to disagree: Annotating offensive language datasets with annotators’ disagreement // Proceedings of the 2021 conference on empirical methods in natural language processing

  15. [15]

    Lermen Simon, Paleka Daniel, Swanson Joshua, Aerni Michael, Carlini Nicholas, Tramèr Florian

    10528–10539. Lermen Simon, Paleka Daniel, Swanson Joshua, Aerni Michael, Carlini Nicholas, Tramèr Florian. Large-scale online deanonymization with LLMs // arXiv preprint arXiv:2602.16800

  16. [16]

    Agentic LLMs as Powerful Deanonymizers: Re-identification of Participants in the Anthropic Interviewer Dataset // arXiv preprint arXiv:2601.05918

    Li Tianshi. Agentic LLMs as Powerful Deanonymizers: Re-identification of Participants in the Anthropic Interviewer Dataset // arXiv preprint arXiv:2601.05918

  17. [17]

    Lukas Nils, Salem Ahmed, Sim Robert, Tople Shruti, Wutschitz Lukas, Zanella-Béguelin Santiago

    1–24. Lukas Nils, Salem Ahmed, Sim Robert, Tople Shruti, Wutschitz Lukas, Zanella-Béguelin Santiago. Analyzing Leakage of Personally Identifiable Information in Language Models // 2023 IEEE Symposium on Security and Privacy (SP). San Francisco, CA, USA: IEEE, V

  18. [18]

    Lv Lijia, Zhao Yuanshu, Wang Guan, Tang Xuehai, Jie Wen, Han Jizhong, Hu Songlin

    346–363. Lv Lijia, Zhao Yuanshu, Wang Guan, Tang Xuehai, Jie Wen, Han Jizhong, Hu Songlin. Gamma-Guard: Lightweight Residual Adapters for Robust Guardrails in Large Language Models // Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

  19. [19]

    Mireshghallah Fatemehsadat, Goyal Kartik, Uniyal Archit, Berg-Kirkpatrick Taylor, Shokri Reza

    61065–61105. Mireshghallah Fatemehsadat, Goyal Kartik, Uniyal Archit, Berg-Kirkpatrick Taylor, Shokri Reza. Quantifying Privacy Risks of Masked Language Models Using Membership Inference Attacks // Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, XII

  20. [20]

    Naseem Usman, Shiwakoti Shuvam, Shah Siddhant Bikram, Thapa Surendrabikram, Zhang Qi. GameTox: A Comprehensive Dataset and Analysis for Enhanced Toxicity Detection in Online Gaming Communities // Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume ...

  21. [21]

    Red teaming language models with language models // Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

    Perez Ethan, Huang Saffron, Song Francis, Cai Trevor, Ring Roman, Aslanides John, Glaese Amelia, McAleese Nat, Irving Geoffrey. Red teaming language models with language models // Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

  22. [22]

    Reimers Nils, Gurevych Iryna. Sentence-bert: Sentence embeddings using siamese bert-networks // Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP)

  23. [23]

    Rossi Lorenzo, Aerni Michael, Zhang Jie, Tramèr Florian

    3982–3992. Rossi Lorenzo, Aerni Michael, Zhang Jie, Tramèr Florian. Membership Inference Attacks on Sequence Models // 2025 IEEE Security and Privacy Workshops (SPW)

  24. [24]

    Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

    98–110. Sharma Mrinank, Tong Meg, Mu Jesse, Wei Jerry, Kruthoff Jorrit, Goodfriend Scott, Ong Euan, Peng Alwin, Agarwal Raj, Anil Cem, others. Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming // arXiv preprint arXiv:2501.18837

  25. [25]

    Membership Inference Attacks Against NLP Classification Models // NeurIPS 2021 Workshop Privacy in Machine Learning

    Shejwalkar Virat, Inan Huseyin A., Houmansadr Amir, Sim Robert. Membership Inference Attacks Against NLP Classification Models // NeurIPS 2021 Workshop Privacy in Machine Learning

  26. [26]

    Membership Inference Attacks Against Machine Learning Models // 2017 IEEE Symposium on Security and Privacy (SP)

    Shokri Reza, Stronati Marco, Song Congzheng, Shmatikov Vitaly. Membership Inference Attacks Against Machine Learning Models // 2017 IEEE Symposium on Security and Privacy (SP). San Jose, CA, USA: IEEE, V

  27. [27]

    Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming // arXiv preprint arXiv:2602.19948

    Steenstra Ian, Pedrelli Paola, Shi Weiyan, Marsella Stacy, Bickmore Timothy W. Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming // arXiv preprint arXiv:2602.19948

  28. [28]

    Olmo 3

    Team Gemma, Kamath Aishwarya, Ferret Johan, Pathak Shreya, Vieillard Nino, Merhej Ramona, Perrin Sarah, Matejovicova Tatiana, Ramé Alexandre, Others. Gemma 3 Technical Report. 2025a. Team Olmo, Ettinger A, Bertsch A, Kuehl B, Graham D, Heineman D, Groeneveld D, Brahman F , Timbers F , Ivison H, others. Olmo 3 // arXiv preprint arXiv:2512.13961. 2025b. 23–...

  29. [29]

    Wang Zezhong, Yang Fangkai, Wang Lu, Zhao Pu, Wang Hongru, Chen Liang, Lin Qingwei, Wong Kam-Fai

    240–254. Wang Zezhong, Yang Fangkai, Wang Lu, Zhao Pu, Wang Hongru, Chen Liang, Lin Qingwei, Wong Kam-Fai. Self-guard: Empower the llm to safeguard itself // Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers)

  30. [30]

    ReCaLL: Membership Inference via Relative Conditional Log-Likelihoods // Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

    Xie Roy, Wang Junlin, Huang Ruomin, Zhang Minxing, Ge Rong, Pei Jian, Gong Neil Zhenqiang, Dhingra Bhuwan. ReCaLL: Membership Inference via Relative Conditional Log-Likelihoods // Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Miami, Florida, USA: Association for Computational Linguistics, XI

  31. [31]

    Yeom Samuel, Giacomelli Irene, Fredrikson Matt, Jha Somesh

    8671–8689. Yeom Samuel, Giacomelli Irene, Fredrikson Matt, Jha Somesh. Privacy risk in machine learning: An- alyzing the connection to overfitting // 2018 IEEE 31st computer security foundations symposium (CSF)

  32. [32]

    Zeng Wenjun, Kurniawan Dana, Mullins Ryan, Liu Yuchi, Saha Tamoghna, Ike-Njoku Dirichi, Gu Jindong, Song Yiwen, Xu Cai, Zhou Jingjing, others

    40306–40320. Zeng Wenjun, Kurniawan Dana, Mullins Ryan, Liu Yuchi, Saha Tamoghna, Ike-Njoku Dirichi, Gu Jindong, Song Yiwen, Xu Cai, Zhou Jingjing, others. Shieldgemma 2: Robust and tractable image content moderation // arXiv preprint arXiv:2504.01081

  33. [33]

    ShieldGemma: Generative AI Content Moderation Based on Gemma

    Zeng Wenjun, Liu Yuchi, Mullins Ryan, Peran Ludovic, Fernandez Joe, Harkous Hamza, Narasimhan Karthik, Proud Drew, Kumar Piyush, Radharapu Bhaktipriya, others. Shieldgemma: Generative ai content moderation based on gemma // arXiv preprint arXiv:2407.21772

  34. [34]

    Membership and Memorization in LLM Knowledge Distillation // Proceedings of the 2025 Conference on Empir- ical Methods in Natural Language Processing

    Zhang Ziqi, Shahin Shamsabadi Ali, Lu Hanxiao, Cai Yifeng, Haddadi Hamed. Membership and Memorization in LLM Knowledge Distillation // Proceedings of the 2025 Conference on Empir- ical Methods in Natural Language Processing. Suzhou, China: Association for Computational Linguistics, XI

  35. [35]

    Qwen3Guard Technical Report

    20074–20084. 15 Zhao Haiquan, Yuan Chenhan, Huang Fei, Hu Xiaomeng, Zhang Yichang, Yang An, Yu Bowen, Liu Dayiheng, Zhou Jingren, Lin Junyang, others. Qwen3guard technical report // arXiv preprint arXiv:2510.14276

  36. [36]

    You are a mental health screening classifier. Read the following text and determine whether it indicates signs of depression

    Hyperparameter Value Sequence length Single-turn (BeaverTails) 1024 Multi-turn (XGuard) 8192 Multi-session (Emotional Support) 16394 Pooled 16394 Table 5: Sequenc lengths for classifier fine-tuning. K Compute All experiments were conducted on a single compute node equipped with 4× NVIDIA H100 NVL GPUs (94 GB VRAM each), an AMD EPYC 9454 48-Core CPU, and 7...