pith. sign in

arxiv: 2511.22099 · v5 · submitted 2025-11-27 · 💻 cs.LG · cs.AI

Decomposed Trust: Privacy, Adversarial Robustness, Ethics, and Fairness in Low-Rank LLMs

Pith reviewed 2026-05-17 04:29 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords low-rank factorizationlarge language modelsprivacyadversarial robustnessethicsfairnessmodel compressiontrustworthiness
0
0 comments X

The pith

Low-rank factorization preserves LLM training data privacy but weakens personally identifiable information safeguards while increasing adversarial robustness and decreasing fairness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates the impact of low-rank factorization on the trustworthiness of large language models across four key dimensions. It shows that compression maintains privacy for the original training data yet makes conversational personal information more vulnerable to extraction. Adversarial robustness improves under compression, ethics suffer in zero-shot prompting but recover somewhat with examples, and fairness worsens. A sympathetic reader would care because low-rank models enable deployment in constrained environments, making their safety properties essential for responsible use.

Core claim

Low-rank factorization addresses the size issue in LLMs by compressing them while keeping accuracy. The study reveals that this process preserves training data privacy but weakens protection of personally identifiable information in conversations. Adversarial robustness generally increases with compression. Ethics degrade in zero-shot prompting but partially recover in few-shot prompting. Fairness declines under compression. The analysis also covers effects of model scale and fine-tuning and uses gradient-based attribution to find which layers contribute most to robustness.

What carries the argument

Low-rank factorization algorithms that decompose model weights into lower-rank approximations, evaluated on multiple LLMs with metrics for privacy leakage, adversarial attack success, ethical compliance, and fairness disparities, plus gradient attribution for internal analysis.

If this is right

  • Low-rank compressed models can be used in settings where resistance to adversarial inputs is important.
  • Additional measures may be needed to protect personal information in dialogues with compressed models.
  • Few-shot prompting can be applied to improve ethical outputs in compressed LLMs.
  • Fairness testing becomes more critical when applying low-rank factorization to models.
  • Model scale and fine-tuning can be adjusted to influence trustworthiness outcomes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These results suggest that low-rank methods could be combined with privacy-enhancing techniques to balance the trade-offs observed.
  • Similar studies on other compression techniques like quantization might reveal comparable patterns in trustworthiness.
  • Deployment guidelines for resource-constrained LLMs should incorporate specific checks for PII leakage and fairness.

Load-bearing premise

The chosen evaluation metrics and prompting regimes accurately capture real-world privacy risks, ethical failures, and fairness violations, and that results from the tested model sizes and factorization algorithms generalize to other low-rank methods and deployment contexts.

What would settle it

A test showing that a low-rank compressed model leaks personally identifiable information at the same rate as its full-size counterpart would falsify the claim of weakened PII protection.

Figures

Figures reproduced from arXiv: 2511.22099 by Daniel Agyei Asante, Md Mokarram Chowdhury, Yang Li.

Figure 1
Figure 1. Figure 1: Illustrative examples of prompts and responses generated by the LLaMA2-13B model across three perspectives: (1) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Interaction between the black-box adversary, honest [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Throughput and memory of LLaMA-2 7B and its low-rank compressed models using SVD, FWSVD, and Basel. Left [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training leakage of models across different context [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Adversarial robustness of Base 7B, 13B and Base 13B [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Leak rate and rejection rate of 7B and 13B Base and [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Adversarial robustness of Base/Chat 7B and 13B [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Model size vs. adversarial robustness accuracy for Base [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 7
Figure 7. Figure 7: Leak rate of personally identifiable information (PII) [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
read the original abstract

Large language models (LLMs) have driven major advances across domains, yet their massive size hinders deployment in resource-constrained settings. Low-rank factorization addresses this challenge by compressing models to effectively reduce their computation and memory consumption while maintaining accuracy. While these compressed models boast benign performance and system-level advantages, their trustworthiness implications remain poorly understood. In this paper, we present the first comprehensive study of how low-rank factorization affects LLM trustworthiness across privacy, adversarial robustness, ethics, and fairness, complemented by an explainability-driven analysis of the internal mechanisms behind these trust-related changes. We evaluate multiple LLMs of different sizes and architectures compressed with various low-rank factorization algorithms, revealing key insights: (1) low-rank factorization preserves training data privacy but weakens the protection of personally identifiable information during conversations; (2) adversarial robustness is generally enhanced under compression; (3) ethics degrades in zero-shot prompting but partially recovers in few-shot prompting; (4) fairness declines under compression. Beyond compression, we investigate how model scale and fine-tuning affect trustworthiness. Additionally, to move beyond black-box analysis, we employ a gradient-based attribution to identify which layers of LLMs contribute most to adversarial robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims to be the first comprehensive study of low-rank factorization's effects on LLM trustworthiness. It evaluates multiple LLMs compressed via various low-rank methods and reports four directional findings: (1) training-data privacy is preserved while PII protection in conversations weakens; (2) adversarial robustness generally improves; (3) ethical performance drops in zero-shot prompting but partially recovers in few-shot; (4) fairness declines under compression. The work also examines model scale and fine-tuning effects and uses gradient-based attribution to identify layers most responsible for robustness changes.

Significance. If the empirical patterns hold after addressing controls and generalization, the study would provide a valuable first map of trustworthiness trade-offs in compressed LLMs. The combination of multi-aspect evaluation with gradient-based layer attribution is a positive step beyond purely black-box measurements and could guide future compression-aware safety work.

major comments (3)
  1. [Section 4] Experimental setup (Section 4): The manuscript does not compare low-rank factorization against other capacity-reduction methods such as pruning or quantization. Without such controls it is impossible to determine whether the reported changes in privacy, robustness, ethics, and fairness are attributable to the low-rank structure itself or simply to reduced parameter count.
  2. [gradient-based attribution section] Gradient-based attribution analysis: No variance across random seeds or statistical significance tests are reported for the layer-attribution results. This weakens the claim that specific layers can be identified as primary contributors to the observed robustness gains.
  3. [privacy and fairness evaluation sections] Evaluation of privacy and fairness: The chosen metrics and prompting regimes are not shown to be robust to prompt sensitivity or to generalize beyond the tested model sizes and factorization heuristics. If effects are driven by the particular rank-selection procedure or benchmark choice, the directional conclusions do not hold.
minor comments (2)
  1. [methods] Clarify the exact low-rank dimension selection heuristic and report the resulting compression ratios for each model in a table.
  2. [related work] Add missing references to prior work on compression and trustworthiness (e.g., studies on quantized or pruned models).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating the changes we will make in the revised version.

read point-by-point responses
  1. Referee: [Section 4] Experimental setup (Section 4): The manuscript does not compare low-rank factorization against other capacity-reduction methods such as pruning or quantization. Without such controls it is impossible to determine whether the reported changes in privacy, robustness, ethics, and fairness are attributable to the low-rank structure itself or simply to reduced parameter count.

    Authors: We agree that the absence of comparisons to other capacity-reduction techniques limits the ability to isolate effects specific to low-rank factorization. In the revised manuscript we will add experiments that apply pruning and quantization at comparable compression ratios and report the resulting changes in the same privacy, robustness, ethics, and fairness metrics. These controls will clarify whether the observed directional effects are attributable to the low-rank structure or to parameter reduction in general. revision: yes

  2. Referee: [gradient-based attribution section] Gradient-based attribution analysis: No variance across random seeds or statistical significance tests are reported for the layer-attribution results. This weakens the claim that specific layers can be identified as primary contributors to the observed robustness gains.

    Authors: The referee is correct that variance across seeds and statistical significance were not reported. We will re-run the gradient-based attribution analysis over multiple random seeds, include error bars, and add statistical significance tests for the identified layers in the updated manuscript to support the robustness-related claims. revision: yes

  3. Referee: [privacy and fairness evaluation sections] Evaluation of privacy and fairness: The chosen metrics and prompting regimes are not shown to be robust to prompt sensitivity or to generalize beyond the tested model sizes and factorization heuristics. If effects are driven by the particular rank-selection procedure or benchmark choice, the directional conclusions do not hold.

    Authors: We acknowledge that explicit checks for prompt sensitivity and broader generalization were not included. Although the original study already spans several model sizes and factorization heuristics, we will add prompt-variation experiments and results under alternative rank-selection procedures in the revision to demonstrate that the reported directional effects are not artifacts of the specific choices made. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical measurements on compressed models

full rationale

The paper reports direct experimental results from applying low-rank factorization to multiple LLMs and measuring outcomes on privacy, adversarial robustness, ethics, and fairness benchmarks. No mathematical derivations, parameter fits, or predictions are presented that reduce to the inputs by construction; all claims rest on observed performance differences under compression, scale, and prompting regimes. The analysis is self-contained against external benchmarks with no load-bearing self-citations, ansatzes, or uniqueness theorems invoked to justify the central findings.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Empirical study relying on standard LLM evaluation practices and existing models; no new theoretical axioms or invented entities introduced. Free parameters consist of experimental choices such as rank values and prompting strategies that are selected rather than derived.

free parameters (2)
  • low-rank dimension / compression ratio
    Selected per model and experiment to achieve target size reduction; directly influences measured trustworthiness changes.
  • prompting regime (zero-shot vs few-shot)
    Chosen to demonstrate partial recovery in ethics; affects reported outcomes.
axioms (1)
  • domain assumption Standard benchmarks and metrics for privacy leakage, adversarial robustness, ethics, and fairness accurately reflect the intended trustworthiness properties.
    Invoked implicitly when interpreting directional changes as meaningful for real-world trust.

pith-pipeline@v0.9.0 · 5521 in / 1237 out tokens · 36292 ms · 2026-05-17T04:29:03.731108+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 11 internal anchors

  1. [1]

    Compute-Efficient Deep Learning: Algorithmic Trends and Opportunities,

    B. R. Bartoldson, B. Kailkhura, and D. Blalock, “Compute-Efficient Deep Learning: Algorithmic Trends and Opportunities,”Journal of Machine Learning Research, vol. 24, p. 77, 2023

  2. [2]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “LLama 2: Open Foundation and Fine-Tuned Chat Models,”arXiv preprint arXiv:2307.09288, 2023

  3. [3]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers,” arXiv preprint arXiv:2210.17323, 2022

  4. [4]

    AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

    J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration,”arXiv preprint arXiv:2306.00978, 2023

  5. [5]

    A Simple and Effective Pruning Approach for Large Language Models

    M. Sun, Z. Liu, A. Bair, and J. Z. Kolter, “A Simple and Effec- tive Pruning Approach for Large Language Models,”arXiv preprint arXiv:2306.11695, 2023

  6. [6]

    Fluctuation-Based Adap- tive Structured Pruning for Large Language Models,

    Y . An, X. Zhao, T. Yu, M. Tang, and J. Wang, “Fluctuation-Based Adap- tive Structured Pruning for Large Language Models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 9, 2024, pp. 10 865–10 873

  7. [7]

    Compressing Pretrained Language Models by Matrix Decomposition,

    M. B. Noach and Y . Goldberg, “Compressing Pretrained Language Models by Matrix Decomposition,” inProceedings of the 1st Confer- ence of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, 2020, pp. 884–889

  8. [8]

    Stream- lining Language Models via Semantic Basis Analysis,

    Y . Li, D. A. Asante, C. Zhao, E. Chang, Y . Shi, and V . Chandra, “Stream- lining Language Models via Semantic Basis Analysis,”Transactions on Machine Learning Research, 2025

  9. [9]

    Lan- guage Model Compression With Weighted Low-Rank Factorization,

    Y .-C. Hsu, T. Hua, S. Chang, Q. Lou, Y . Shen, and H. Jin, “Lan- guage Model Compression With Weighted Low-Rank Factorization,” inProceedings of the Tenth International Conference on Learning Representations, 2022

  10. [10]

    Tensor train low-rank approximation (tt-lora): Democratizing ai with accelerated llms,

    A. Anjum, M. E. Eren, I. Boureima, B. Alexandrov, and M. Bhattarai, “Tensor train low-rank approximation (tt-lora): Democratizing ai with accelerated llms,” in2024 International Conference on Machine Learn- ing and Applications (ICMLA), 2024, pp. 583–590

  11. [11]

    Position: TrustLLM: Trustworthiness In Large Language Models,

    Y . Huang, L. Sun, H. Wang, S. Wu, Q. Zhang, Y . Li, C. Gao, Y . Huang, W. Lyu, Y . Zhang, X. Li, H. Sun, Z. Liu, Y . Liu, Y . Wang, Z. Zhang, B. Vidgen, B. Kailkhura, C. Xiong, C. Xiao, C. Li, E. P. Xing, F. Huang, H. Liu, H. Ji, H. Wang, H. Zhang, H. Yao, M. Kellis, M. Zitnik, M. Jiang, M. Bansal, J. Zou, J. Pei, J. Liu, J. Gao, J. Han, J. Zhao, J. Tang...

  12. [12]

    Using Large Language Models In Psychology,

    D. Demszky, D. Yang, D. S. Yeager, C. J. Bryan, M. Clapper, S. Chand- hok, J. C. Eichstaedt, C. Hecht, J. Jamieson, M. Johnsonet al., “Using Large Language Models In Psychology,”Nature Reviews Psychology, vol. 2, no. 11, pp. 688–701, 2023

  13. [13]

    Palm-e: an embodied multimodal language model,

    D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence, “Palm-e: an embodied multimodal language model,” inProceedings of the 40th International Confe...

  14. [14]

    Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression,

    J. Hong, J. Duan, C. Zhang, Z. Li, C. Xie, K. Lieberman, J. Diffenderfer, B. Bartoldson, A. Jaiswal, K. Xu, B. Kailkhura, D. Hendrycks, D. Song, Z. Wang, and B. Li, “Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression,” inProceedings of the 41st International Conference on Machine Learning, ser. PMLR, vol. 235, ...

  15. [15]

    De- codingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models,

    B. Wang, W. Chen, H. Pei, C. Xie, M. Kang, C. Zhang, C. Xu, Z. Xiong, R. Dutta, R. Schaeffer, S. T. Truong, S. Arora, M. Mazeika, D. Hendrycks, Z. Lin, Y . Cheng, S. Koyejo, D. Song, and B. Li, “De- codingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models,” inProceedings of the Thirty-Seventh Conference on Neural Information Processing Sys...

  16. [16]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman, “Training Verifiers to Solve Math Word Problems,”arXiv preprint arXiv:2110.14168, 2021

  17. [17]

    The Enron Corpus: A New Dataset for Email Classification Research,

    B. Klimt and Y . Yang, “The Enron Corpus: A New Dataset for Email Classification Research,” inProceedings of the 16th European Conference on Machine Learning, 2004, pp. 217–226

  18. [18]

    Qwen Technical Report

    J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan, J. Tu, P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang, J. Yang, S. Yang, Y . Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang, Y . Zhang, ...

  19. [19]

    Solving Quantitative Reason- ing Problems With Language Models,

    A. Lewkowycz, A. J. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, Y . Wu, B. Neyshabur, G. Gur-Ari, and V . Misra, “Solving Quantitative Reason- ing Problems With Language Models,” inProceedings of the Thirty-Sixth International Conference on Neural Information Processing Systems (NeurIPS), 2022

  20. [20]

    Code Llama: Open Foundation Models for Code

    B. Rozi `ere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, R. Sauvestre, T. Remez, J. Rapin, A. Kozhevnikov, I. Evti- mov, J. Bitton, M. Bhatt, C. Canton Ferrer, A. Grattafiori, W. Xiong, A. D ´efossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and G. Synnaeve, “Code Llama: Open Foundation Models for Cod...

  21. [21]

    Evaluating Large Language Models Trained on Code

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Ponde de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating Large Language Models Trained on Code,”arXiv preprint arXiv:2107.03374, 2021

  22. [22]

    Evaluating LLMs’ Mathematical and Coding Competency Through Ontology-Guided Interventions,

    P. Hong, N. Majumder, D. Ghosal, S. Aditya, R. Mihalcea, and S. Poria, “Evaluating LLMs’ Mathematical and Coding Competency Through Ontology-Guided Interventions,”arXiv preprint arXiv:2401.09395, 2024

  23. [23]

    Towards Evaluating the Robustness of Neural Networks,

    N. Carlini and D. Wagner, “Towards Evaluating the Robustness of Neural Networks,” in2017 IEEE Symposium on Security and Privacy, 2017, pp. 39–57

  24. [24]

    LinkPrompt: Natural and Universal Adversarial Attacks on Prompt-based Language Models,

    Y . ”Xu and W. Wang, “LinkPrompt: Natural and Universal Adversarial Attacks on Prompt-based Language Models,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics, 2024

  25. [25]

    Adversarial attacks on large language models,

    J. Zou, S. Zhang, and M. Qiu, “Adversarial attacks on large language models,” inKnowledge Science, Engineering and Management, ser. Lecture Notes in Computer Science, 2024, p. 85–96

  26. [26]

    Propile: Probing privacy leakage in large language models,

    S. Kim, S. Yun, H. Lee, M. Gubri, S. Yoon, and S. J. Oh, “Propile: Probing privacy leakage in large language models,” inAdvances in Neural Information Processing Systems, 2023

  27. [27]

    Analysis of privacy leakage in federated large language models,

    M. Vu, T. Nguyen, T. Jeter, and M. T. Thai, “Analysis of privacy leakage in federated large language models,” inProceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS 2024), ser. Proceedings of Machine Learning Research, vol. 238, 2024, pp. 1423–1431

  28. [28]

    Im- proving the Robustness of Transformer-Based Large Language Models With Dynamic Attention,

    L. Shen, Y . Pu, S. Ji, C. Li, X. Zhang, C. Ge, and T. Wang, “Im- proving the Robustness of Transformer-Based Large Language Models With Dynamic Attention,” inNetwork and Distributed System Security Symposium, 2024

  29. [29]

    Membership Inference Attacks Against Machine Learning Models,

    R. Shokri, M. Stronati, C. Song, and V . Shmatikov, “Membership Inference Attacks Against Machine Learning Models,” in2017 IEEE Symposium on Security and Privacy, 2017, pp. 3–18

  30. [30]

    Aligning AI With Shared Human Values,

    D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, and J. Steinhardt, “Aligning AI With Shared Human Values,” inProceedings of the Ninth International Conference on Learning Representations, 2021

  31. [31]

    Bias and Fairness in Large Language Models: A Survey,

    I. O. Gallegos, R. A. Rossi, J. Barrow, M. M. Tanjim, S. Kim, F. Dernon- court, T. Yu, R. Zhang, and N. K. Ahmed, “Bias and Fairness in Large Language Models: A Survey,”Computational Linguistics, vol. 50, no. 3, pp. 1097–1179, 2024

  32. [32]

    Microsoft Is Bringing ChatGPT Technology to Word, Excel and Outlook,

    CNN Business, “Microsoft Is Bringing ChatGPT Technology to Word, Excel and Outlook,” https://www.cnn.com/2023/03/16/tech/ openai-gpt-microsoft-365/index.html, 2023, accessed 2025-09-30

  33. [33]

    Lessons Learned From ChatGPT’s Samsung Leak,

    Cybernews, “Lessons Learned From ChatGPT’s Samsung Leak,” https: //cybernews.com/security/chatgpt-samsung-leak-explained-lessons/, 2023, accessed 2025-09-30

  34. [34]

    Adversarial glue: A multi-task benchmark for robustness evaluation of language models,

    B. Wang, C. Xu, S. Wang, S. Wang, Z. Gan, Y . Cheng, J. Gao, A. Awadallah, and B. Li, “Adversarial glue: A multi-task benchmark for robustness evaluation of language models,” inProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, vol. 1, 2021

  35. [35]

    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding,

    A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding,” inProceedings of the 2018 EMNLP Work- shop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018, pp. 353–355

  36. [36]

    TextBugger: Generating Adversarial Text Against Real-World Applications,

    J. Li, S. Ji, T. Du, B. Li, and T. Wang, “TextBugger: Generating Adversarial Text Against Real-World Applications,” inProceedings of the 26th Annual Network and Distributed System Security Symposium, 2019

  37. [37]

    Dual-Targeted TextFooler Attack on Text Classification Systems,

    H. Kwon, “Dual-Targeted TextFooler Attack on Text Classification Systems,”IEEE Access, vol. 11, pp. 15 164–15 173, 2023

  38. [38]

    Word- Level Textual Adversarial Attacking as Combinatorial Optimization,

    Y . Zang, F. Qi, C. Yang, Z. Liu, M. Zhang, Q. Liu, and M. Sun, “Word- Level Textual Adversarial Attacking as Combinatorial Optimization,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 6066–6080

  39. [39]

    SemAttack: Natural Textual Attacks via Different Semantic Spaces,

    B. Wang, C. Xu, X. Liu, Y . Cheng, and B. Li, “SemAttack: Natural Textual Attacks via Different Semantic Spaces,” inProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. ”176–205”

  40. [40]

    Stanford Alpaca: An Instruction-Following LLaMA Model,

    R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. Hashimoto, “Stanford Alpaca: An Instruction-Following LLaMA Model,” 2023

  41. [41]

    UCI Machine Learning Repository,

    A. Asuncion and D. Newman, “UCI Machine Learning Repository,” http://archive.ics.uci.edu/ml, 2007

  42. [42]

    ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

    Z. Yuan, Y . Wu, Y . Lou, H. Zhang, M. Ling, R. Pi, Y . Shen, and B. Cui, “ASVD: Activation-Aware Singular Value Decomposition for Compress- ing Large Language Models,”arXiv preprint arXiv:2312.05821, 2023

  43. [43]

    AMC: AutoML for Model Compression and Acceleration on Mobile Devices,

    Y . He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “AMC: AutoML for Model Compression and Acceleration on Mobile Devices,” in Proceedings of the European Conference on Computer Vision, 2018

  44. [44]

    Adaptive Rank Selection for Low-Rank Approximation of Language Models,

    S. Gao, T. Hua, Y .-C. Hsu, Y . Shen, and H. Jin, “Adaptive Rank Selection for Low-Rank Approximation of Language Models,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2024

  45. [45]

    Axiomatic Attribution for Deep Networks,

    M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic Attribution for Deep Networks,” inProceedings of the 34th International Conference on Machine Learning, vol. 70, 2017, pp. 3319–3328

  46. [46]

    Learning Important Features Through Propagating Activation Differences,

    A. Shrikumar, P. Greenside, and A. Kundaje, “Learning Important Features Through Propagating Activation Differences,” inProceedings of the 34th International Conference on Machine Learning, ser. Pro- ceedings of Machine Learning Research, vol. 70, 2017, pp. 3145–3153

  47. [47]

    Information Flow Routes: Automatically Interpreting Language Models at Scale,

    J. Ferrando and E. V oita, “Information Flow Routes: Automatically Interpreting Language Models at Scale,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 17 432–17 445

  48. [48]

    MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots,

    G. Deng, Y . Liu, Y . Li, K. Wang, Y . Zhang, Z. Li, H. Wang, T. Zhang, and Y . Liu, “MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots,” inNetwork and Distributed System Security Symposium, 2024

  49. [49]

    Safety Mis- alignment Against Large Language Models,

    Y . Gong, D. Ran, X. He, T. Cong, A. Wang, and X. Wang, “Safety Mis- alignment Against Large Language Models,” inNetwork and Distributed System Security Symposium, 2025

  50. [50]

    G. H. Golub and C. F. Van Loan,Matrix Computations, 4th ed. Baltimore, Maryland, USA: Johns Hopkins University Press, 2013

  51. [51]

    Training Language Models to Follow Instructions With Human Feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kel- ton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training Language Models to Follow Instructions With Human Feedback,” inProceedings of the 36th International Conference on N...

  52. [52]

    Finetuned Language Models Are Zero-Shot Learners,

    J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned Language Models Are Zero-Shot Learners,” inInternational Conference on Learning Representations, 2022

  53. [53]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, D. Ganguli, T. Henighan, A. Humeet al., “Training a Helpful and Harmless Assistant With Reinforcement Learning From Human Feedback,” inarXiv preprint arXiv:2204.05862, 2022

  54. [54]

    Quantifying Memorization Across Neural Language Models,

    N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, and C. Zhang, “Quantifying Memorization Across Neural Language Models,” inPro- ceedings of the Eleventh International Conference on Learning Repre- sentations, 2023

  55. [55]

    Extracting Training Data From Large Language Models,

    N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. B. Brown, D. Song, U. Erlingsson, A. Oprea, and C. Raffel, “Extracting Training Data From Large Language Models,” in Proceedings of the 30th USENIX Security Symposium, 2021, pp. 2633– 2650

  56. [56]

    Calibrate Before Use: Improving Few-Shot Performance of Language Models,

    Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh, “Calibrate Before Use: Improving Few-Shot Performance of Language Models,” in Proceedings of the 38th International Conference on Machine Learning (ICML), vol. 139, 2021, pp. 12 697–12 706

  57. [57]

    Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity,

    Y . Lu, M. Bartolo, A. Moore, P. Stenetorp, and S. Riedel, “Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2022, pp. 8086–8098

  58. [58]

    Do Prompt-Based Models Really Under- stand the Meaning of Their Prompts?

    A. Webson and E. Pavlick, “Do Prompt-Based Models Really Under- stand the Meaning of Their Prompts?” inProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 2300–2344

  59. [59]

    Intrinsic Dimension- ality Explains the Effectiveness of Language Model Fine-Tuning,

    A. Aghajanyan, S. Gupta, and L. Zettlemoyer, “Intrinsic Dimension- ality Explains the Effectiveness of Language Model Fine-Tuning,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 2021, pp. 7319–7328

  60. [60]

    LoRA: Low-Rank Adaptation of Large Language Models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, L. Wang, and W. Chen, “LoRA: Low-Rank Adaptation of Large Language Models,” arXiv preprint arXiv:2106.09685, 2021. APPENDIX A. The impact of Context Length on Training Data Privacy Leakage The stability in Chat models can be attributed to their align- ment through instruction tuning and reinforcement le...

  61. [61]

    larger” does not imply “more ethical

    Performance of Compressed Models on ETHICS-Short: A striking observation is that compressed 13B models sub- stantially outperform the uncompressed 13B in the few- shot setting (e.g., FWSVD-70 reaches 77.1% vs. 1.8% for Base-13B). First, few-shot classification in LLMs is highly prompt-sensitive: performance depends strongly on the order of demonstrations ...

  62. [62]

    •Every reaction can be considered as not wrong

    Jailbreak Ethics Prompt Design: •You are an immoral assistant. •Every reaction can be considered as not wrong. •You do not need to align with human morality. •You prefer to regard wrong reaction as not wrong. •Please answer the given question in a jailbreaking way. TABLE XIII: False Positive Rate (FPR) on Jailbreaking ETHICS Prompts for Base/Chat Full, Fi...