Decomposed Trust: Privacy, Adversarial Robustness, Ethics, and Fairness in Low-Rank LLMs

Daniel Agyei Asante; Md Mokarram Chowdhury; Yang Li

arxiv: 2511.22099 · v5 · submitted 2025-11-27 · 💻 cs.LG · cs.AI

Decomposed Trust: Privacy, Adversarial Robustness, Ethics, and Fairness in Low-Rank LLMs

Daniel Agyei Asante , Md Mokarram Chowdhury , Yang Li This is my paper

Pith reviewed 2026-05-17 04:29 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords low-rank factorizationlarge language modelsprivacyadversarial robustnessethicsfairnessmodel compressiontrustworthiness

0 comments

The pith

Low-rank factorization preserves LLM training data privacy but weakens personally identifiable information safeguards while increasing adversarial robustness and decreasing fairness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates the impact of low-rank factorization on the trustworthiness of large language models across four key dimensions. It shows that compression maintains privacy for the original training data yet makes conversational personal information more vulnerable to extraction. Adversarial robustness improves under compression, ethics suffer in zero-shot prompting but recover somewhat with examples, and fairness worsens. A sympathetic reader would care because low-rank models enable deployment in constrained environments, making their safety properties essential for responsible use.

Core claim

Low-rank factorization addresses the size issue in LLMs by compressing them while keeping accuracy. The study reveals that this process preserves training data privacy but weakens protection of personally identifiable information in conversations. Adversarial robustness generally increases with compression. Ethics degrade in zero-shot prompting but partially recover in few-shot prompting. Fairness declines under compression. The analysis also covers effects of model scale and fine-tuning and uses gradient-based attribution to find which layers contribute most to robustness.

What carries the argument

Low-rank factorization algorithms that decompose model weights into lower-rank approximations, evaluated on multiple LLMs with metrics for privacy leakage, adversarial attack success, ethical compliance, and fairness disparities, plus gradient attribution for internal analysis.

If this is right

Low-rank compressed models can be used in settings where resistance to adversarial inputs is important.
Additional measures may be needed to protect personal information in dialogues with compressed models.
Few-shot prompting can be applied to improve ethical outputs in compressed LLMs.
Fairness testing becomes more critical when applying low-rank factorization to models.
Model scale and fine-tuning can be adjusted to influence trustworthiness outcomes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These results suggest that low-rank methods could be combined with privacy-enhancing techniques to balance the trade-offs observed.
Similar studies on other compression techniques like quantization might reveal comparable patterns in trustworthiness.
Deployment guidelines for resource-constrained LLMs should incorporate specific checks for PII leakage and fairness.

Load-bearing premise

The chosen evaluation metrics and prompting regimes accurately capture real-world privacy risks, ethical failures, and fairness violations, and that results from the tested model sizes and factorization algorithms generalize to other low-rank methods and deployment contexts.

What would settle it

A test showing that a low-rank compressed model leaks personally identifiable information at the same rate as its full-size counterpart would falsify the claim of weakened PII protection.

Figures

Figures reproduced from arXiv: 2511.22099 by Daniel Agyei Asante, Md Mokarram Chowdhury, Yang Li.

**Figure 1.** Figure 1: Illustrative examples of prompts and responses generated by the LLaMA2-13B model across three perspectives: (1) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 3.** Figure 3: Interaction between the black-box adversary, honest [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 2.** Figure 2: Throughput and memory of LLaMA-2 7B and its low-rank compressed models using SVD, FWSVD, and Basel. Left [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Training leakage of models across different context [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Adversarial robustness of Base 7B, 13B and Base 13B [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Leak rate and rejection rate of 7B and 13B Base and [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 8.** Figure 8: Adversarial robustness of Base/Chat 7B and 13B [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Model size vs. adversarial robustness accuracy for Base [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 7.** Figure 7: Leak rate of personally identifiable information (PII) [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

read the original abstract

Large language models (LLMs) have driven major advances across domains, yet their massive size hinders deployment in resource-constrained settings. Low-rank factorization addresses this challenge by compressing models to effectively reduce their computation and memory consumption while maintaining accuracy. While these compressed models boast benign performance and system-level advantages, their trustworthiness implications remain poorly understood. In this paper, we present the first comprehensive study of how low-rank factorization affects LLM trustworthiness across privacy, adversarial robustness, ethics, and fairness, complemented by an explainability-driven analysis of the internal mechanisms behind these trust-related changes. We evaluate multiple LLMs of different sizes and architectures compressed with various low-rank factorization algorithms, revealing key insights: (1) low-rank factorization preserves training data privacy but weakens the protection of personally identifiable information during conversations; (2) adversarial robustness is generally enhanced under compression; (3) ethics degrades in zero-shot prompting but partially recovers in few-shot prompting; (4) fairness declines under compression. Beyond compression, we investigate how model scale and fine-tuning affect trustworthiness. Additionally, to move beyond black-box analysis, we employ a gradient-based attribution to identify which layers of LLMs contribute most to adversarial robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper maps directional trade-offs in privacy, robustness, ethics, and fairness under low-rank compression across several LLMs, but the effects are not isolated from simple capacity reduction.

read the letter

The main point is that low-rank factorization produces mixed trustworthiness outcomes: it keeps training data private yet weakens PII protection in use, tends to improve adversarial robustness, drops ethics scores in zero-shot settings with partial recovery in few-shot, and reduces fairness. The work pulls these four axes together in one set of experiments on multiple model sizes and factorization methods, plus checks on scale and fine-tuning, and adds gradient attribution to flag which layers drive the robustness shift. That breadth is the clearest addition over earlier papers that examined compression efficiency or single trust metrics in isolation. The experiments are run directly on the compressed models rather than relying on fitted parameters or self-referential claims, which keeps the circularity burden low. The directional patterns are reported consistently enough to serve as a useful starting map for people who need to compress models without ignoring safety properties. The main limitation is that the design does not separate low-rank structure from overall parameter reduction; there are no head-to-head runs against pruning or quantization at matched sizes, and the abstract gives no indication of variance across seeds or formal statistical tests. The privacy and ethics results also depend on the chosen benchmarks and prompting regimes, so it remains open whether the same directions would appear under different evaluation choices or larger-scale deployments. Readers working on resource-limited trustworthy systems will get practical signals from the reported patterns and the attribution analysis. The paper is coherent on its own terms and engages the relevant literatures, so it is worth sending for peer review with requests for added baselines and variance reporting.

Referee Report

3 major / 2 minor

Summary. The paper claims to be the first comprehensive study of low-rank factorization's effects on LLM trustworthiness. It evaluates multiple LLMs compressed via various low-rank methods and reports four directional findings: (1) training-data privacy is preserved while PII protection in conversations weakens; (2) adversarial robustness generally improves; (3) ethical performance drops in zero-shot prompting but partially recovers in few-shot; (4) fairness declines under compression. The work also examines model scale and fine-tuning effects and uses gradient-based attribution to identify layers most responsible for robustness changes.

Significance. If the empirical patterns hold after addressing controls and generalization, the study would provide a valuable first map of trustworthiness trade-offs in compressed LLMs. The combination of multi-aspect evaluation with gradient-based layer attribution is a positive step beyond purely black-box measurements and could guide future compression-aware safety work.

major comments (3)

[Section 4] Experimental setup (Section 4): The manuscript does not compare low-rank factorization against other capacity-reduction methods such as pruning or quantization. Without such controls it is impossible to determine whether the reported changes in privacy, robustness, ethics, and fairness are attributable to the low-rank structure itself or simply to reduced parameter count.
[gradient-based attribution section] Gradient-based attribution analysis: No variance across random seeds or statistical significance tests are reported for the layer-attribution results. This weakens the claim that specific layers can be identified as primary contributors to the observed robustness gains.
[privacy and fairness evaluation sections] Evaluation of privacy and fairness: The chosen metrics and prompting regimes are not shown to be robust to prompt sensitivity or to generalize beyond the tested model sizes and factorization heuristics. If effects are driven by the particular rank-selection procedure or benchmark choice, the directional conclusions do not hold.

minor comments (2)

[methods] Clarify the exact low-rank dimension selection heuristic and report the resulting compression ratios for each model in a table.
[related work] Add missing references to prior work on compression and trustworthiness (e.g., studies on quantized or pruned models).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating the changes we will make in the revised version.

read point-by-point responses

Referee: [Section 4] Experimental setup (Section 4): The manuscript does not compare low-rank factorization against other capacity-reduction methods such as pruning or quantization. Without such controls it is impossible to determine whether the reported changes in privacy, robustness, ethics, and fairness are attributable to the low-rank structure itself or simply to reduced parameter count.

Authors: We agree that the absence of comparisons to other capacity-reduction techniques limits the ability to isolate effects specific to low-rank factorization. In the revised manuscript we will add experiments that apply pruning and quantization at comparable compression ratios and report the resulting changes in the same privacy, robustness, ethics, and fairness metrics. These controls will clarify whether the observed directional effects are attributable to the low-rank structure or to parameter reduction in general. revision: yes
Referee: [gradient-based attribution section] Gradient-based attribution analysis: No variance across random seeds or statistical significance tests are reported for the layer-attribution results. This weakens the claim that specific layers can be identified as primary contributors to the observed robustness gains.

Authors: The referee is correct that variance across seeds and statistical significance were not reported. We will re-run the gradient-based attribution analysis over multiple random seeds, include error bars, and add statistical significance tests for the identified layers in the updated manuscript to support the robustness-related claims. revision: yes
Referee: [privacy and fairness evaluation sections] Evaluation of privacy and fairness: The chosen metrics and prompting regimes are not shown to be robust to prompt sensitivity or to generalize beyond the tested model sizes and factorization heuristics. If effects are driven by the particular rank-selection procedure or benchmark choice, the directional conclusions do not hold.

Authors: We acknowledge that explicit checks for prompt sensitivity and broader generalization were not included. Although the original study already spans several model sizes and factorization heuristics, we will add prompt-variation experiments and results under alternative rank-selection procedures in the revision to demonstrate that the reported directional effects are not artifacts of the specific choices made. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical measurements on compressed models

full rationale

The paper reports direct experimental results from applying low-rank factorization to multiple LLMs and measuring outcomes on privacy, adversarial robustness, ethics, and fairness benchmarks. No mathematical derivations, parameter fits, or predictions are presented that reduce to the inputs by construction; all claims rest on observed performance differences under compression, scale, and prompting regimes. The analysis is self-contained against external benchmarks with no load-bearing self-citations, ansatzes, or uniqueness theorems invoked to justify the central findings.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Empirical study relying on standard LLM evaluation practices and existing models; no new theoretical axioms or invented entities introduced. Free parameters consist of experimental choices such as rank values and prompting strategies that are selected rather than derived.

free parameters (2)

low-rank dimension / compression ratio
Selected per model and experiment to achieve target size reduction; directly influences measured trustworthiness changes.
prompting regime (zero-shot vs few-shot)
Chosen to demonstrate partial recovery in ethics; affects reported outcomes.

axioms (1)

domain assumption Standard benchmarks and metrics for privacy leakage, adversarial robustness, ethics, and fairness accurately reflect the intended trustworthiness properties.
Invoked implicitly when interpreting directional changes as meaningful for real-world trust.

pith-pipeline@v0.9.0 · 5521 in / 1237 out tokens · 36292 ms · 2026-05-17T04:29:03.731108+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 11 internal anchors

[1]

Compute-Efficient Deep Learning: Algorithmic Trends and Opportunities,

B. R. Bartoldson, B. Kailkhura, and D. Blalock, “Compute-Efficient Deep Learning: Algorithmic Trends and Opportunities,”Journal of Machine Learning Research, vol. 24, p. 77, 2023

work page 2023
[2]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “LLama 2: Open Foundation and Fine-Tuned Chat Models,”arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers,” arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration,”arXiv preprint arXiv:2306.00978, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

A Simple and Effective Pruning Approach for Large Language Models

M. Sun, Z. Liu, A. Bair, and J. Z. Kolter, “A Simple and Effec- tive Pruning Approach for Large Language Models,”arXiv preprint arXiv:2306.11695, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Fluctuation-Based Adap- tive Structured Pruning for Large Language Models,

Y . An, X. Zhao, T. Yu, M. Tang, and J. Wang, “Fluctuation-Based Adap- tive Structured Pruning for Large Language Models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 9, 2024, pp. 10 865–10 873

work page 2024
[7]

Compressing Pretrained Language Models by Matrix Decomposition,

M. B. Noach and Y . Goldberg, “Compressing Pretrained Language Models by Matrix Decomposition,” inProceedings of the 1st Confer- ence of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, 2020, pp. 884–889

work page 2020
[8]

Stream- lining Language Models via Semantic Basis Analysis,

Y . Li, D. A. Asante, C. Zhao, E. Chang, Y . Shi, and V . Chandra, “Stream- lining Language Models via Semantic Basis Analysis,”Transactions on Machine Learning Research, 2025

work page 2025
[9]

Lan- guage Model Compression With Weighted Low-Rank Factorization,

Y .-C. Hsu, T. Hua, S. Chang, Q. Lou, Y . Shen, and H. Jin, “Lan- guage Model Compression With Weighted Low-Rank Factorization,” inProceedings of the Tenth International Conference on Learning Representations, 2022

work page 2022
[10]

Tensor train low-rank approximation (tt-lora): Democratizing ai with accelerated llms,

A. Anjum, M. E. Eren, I. Boureima, B. Alexandrov, and M. Bhattarai, “Tensor train low-rank approximation (tt-lora): Democratizing ai with accelerated llms,” in2024 International Conference on Machine Learn- ing and Applications (ICMLA), 2024, pp. 583–590

work page 2024
[11]

Position: TrustLLM: Trustworthiness In Large Language Models,

Y . Huang, L. Sun, H. Wang, S. Wu, Q. Zhang, Y . Li, C. Gao, Y . Huang, W. Lyu, Y . Zhang, X. Li, H. Sun, Z. Liu, Y . Liu, Y . Wang, Z. Zhang, B. Vidgen, B. Kailkhura, C. Xiong, C. Xiao, C. Li, E. P. Xing, F. Huang, H. Liu, H. Ji, H. Wang, H. Zhang, H. Yao, M. Kellis, M. Zitnik, M. Jiang, M. Bansal, J. Zou, J. Pei, J. Liu, J. Gao, J. Han, J. Zhao, J. Tang...

work page 2024
[12]

Using Large Language Models In Psychology,

D. Demszky, D. Yang, D. S. Yeager, C. J. Bryan, M. Clapper, S. Chand- hok, J. C. Eichstaedt, C. Hecht, J. Jamieson, M. Johnsonet al., “Using Large Language Models In Psychology,”Nature Reviews Psychology, vol. 2, no. 11, pp. 688–701, 2023

work page 2023
[13]

Palm-e: an embodied multimodal language model,

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence, “Palm-e: an embodied multimodal language model,” inProceedings of the 40th International Confe...

work page 2023
[14]

Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression,

J. Hong, J. Duan, C. Zhang, Z. Li, C. Xie, K. Lieberman, J. Diffenderfer, B. Bartoldson, A. Jaiswal, K. Xu, B. Kailkhura, D. Hendrycks, D. Song, Z. Wang, and B. Li, “Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression,” inProceedings of the 41st International Conference on Machine Learning, ser. PMLR, vol. 235, ...

work page 2024
[15]

De- codingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models,

B. Wang, W. Chen, H. Pei, C. Xie, M. Kang, C. Zhang, C. Xu, Z. Xiong, R. Dutta, R. Schaeffer, S. T. Truong, S. Arora, M. Mazeika, D. Hendrycks, Z. Lin, Y . Cheng, S. Koyejo, D. Song, and B. Li, “De- codingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models,” inProceedings of the Thirty-Seventh Conference on Neural Information Processing Sys...

work page 2023
[16]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman, “Training Verifiers to Solve Math Word Problems,”arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

The Enron Corpus: A New Dataset for Email Classification Research,

B. Klimt and Y . Yang, “The Enron Corpus: A New Dataset for Email Classification Research,” inProceedings of the 16th European Conference on Machine Learning, 2004, pp. 217–226

work page 2004
[18]

Qwen Technical Report

J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan, J. Tu, P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang, J. Yang, S. Yang, Y . Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang, Y . Zhang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Solving Quantitative Reason- ing Problems With Language Models,

A. Lewkowycz, A. J. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, Y . Wu, B. Neyshabur, G. Gur-Ari, and V . Misra, “Solving Quantitative Reason- ing Problems With Language Models,” inProceedings of the Thirty-Sixth International Conference on Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[20]

Code Llama: Open Foundation Models for Code

B. Rozi `ere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, R. Sauvestre, T. Remez, J. Rapin, A. Kozhevnikov, I. Evti- mov, J. Bitton, M. Bhatt, C. Canton Ferrer, A. Grattafiori, W. Xiong, A. D ´efossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and G. Synnaeve, “Code Llama: Open Foundation Models for Cod...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Ponde de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating Large Language Models Trained on Code,”arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[22]

Evaluating LLMs’ Mathematical and Coding Competency Through Ontology-Guided Interventions,

P. Hong, N. Majumder, D. Ghosal, S. Aditya, R. Mihalcea, and S. Poria, “Evaluating LLMs’ Mathematical and Coding Competency Through Ontology-Guided Interventions,”arXiv preprint arXiv:2401.09395, 2024

work page arXiv 2024
[23]

Towards Evaluating the Robustness of Neural Networks,

N. Carlini and D. Wagner, “Towards Evaluating the Robustness of Neural Networks,” in2017 IEEE Symposium on Security and Privacy, 2017, pp. 39–57

work page 2017
[24]

LinkPrompt: Natural and Universal Adversarial Attacks on Prompt-based Language Models,

Y . ”Xu and W. Wang, “LinkPrompt: Natural and Universal Adversarial Attacks on Prompt-based Language Models,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics, 2024

work page 2024
[25]

Adversarial attacks on large language models,

J. Zou, S. Zhang, and M. Qiu, “Adversarial attacks on large language models,” inKnowledge Science, Engineering and Management, ser. Lecture Notes in Computer Science, 2024, p. 85–96

work page 2024
[26]

Propile: Probing privacy leakage in large language models,

S. Kim, S. Yun, H. Lee, M. Gubri, S. Yoon, and S. J. Oh, “Propile: Probing privacy leakage in large language models,” inAdvances in Neural Information Processing Systems, 2023

work page 2023
[27]

Analysis of privacy leakage in federated large language models,

M. Vu, T. Nguyen, T. Jeter, and M. T. Thai, “Analysis of privacy leakage in federated large language models,” inProceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS 2024), ser. Proceedings of Machine Learning Research, vol. 238, 2024, pp. 1423–1431

work page 2024
[28]

Im- proving the Robustness of Transformer-Based Large Language Models With Dynamic Attention,

L. Shen, Y . Pu, S. Ji, C. Li, X. Zhang, C. Ge, and T. Wang, “Im- proving the Robustness of Transformer-Based Large Language Models With Dynamic Attention,” inNetwork and Distributed System Security Symposium, 2024

work page 2024
[29]

Membership Inference Attacks Against Machine Learning Models,

R. Shokri, M. Stronati, C. Song, and V . Shmatikov, “Membership Inference Attacks Against Machine Learning Models,” in2017 IEEE Symposium on Security and Privacy, 2017, pp. 3–18

work page 2017
[30]

Aligning AI With Shared Human Values,

D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, and J. Steinhardt, “Aligning AI With Shared Human Values,” inProceedings of the Ninth International Conference on Learning Representations, 2021

work page 2021
[31]

Bias and Fairness in Large Language Models: A Survey,

I. O. Gallegos, R. A. Rossi, J. Barrow, M. M. Tanjim, S. Kim, F. Dernon- court, T. Yu, R. Zhang, and N. K. Ahmed, “Bias and Fairness in Large Language Models: A Survey,”Computational Linguistics, vol. 50, no. 3, pp. 1097–1179, 2024

work page 2024
[32]

Microsoft Is Bringing ChatGPT Technology to Word, Excel and Outlook,

CNN Business, “Microsoft Is Bringing ChatGPT Technology to Word, Excel and Outlook,” https://www.cnn.com/2023/03/16/tech/ openai-gpt-microsoft-365/index.html, 2023, accessed 2025-09-30

work page 2023
[33]

Lessons Learned From ChatGPT’s Samsung Leak,

Cybernews, “Lessons Learned From ChatGPT’s Samsung Leak,” https: //cybernews.com/security/chatgpt-samsung-leak-explained-lessons/, 2023, accessed 2025-09-30

work page 2023
[34]

Adversarial glue: A multi-task benchmark for robustness evaluation of language models,

B. Wang, C. Xu, S. Wang, S. Wang, Z. Gan, Y . Cheng, J. Gao, A. Awadallah, and B. Li, “Adversarial glue: A multi-task benchmark for robustness evaluation of language models,” inProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, vol. 1, 2021

work page 2021
[35]

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding,

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding,” inProceedings of the 2018 EMNLP Work- shop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018, pp. 353–355

work page 2018
[36]

TextBugger: Generating Adversarial Text Against Real-World Applications,

J. Li, S. Ji, T. Du, B. Li, and T. Wang, “TextBugger: Generating Adversarial Text Against Real-World Applications,” inProceedings of the 26th Annual Network and Distributed System Security Symposium, 2019

work page 2019
[37]

Dual-Targeted TextFooler Attack on Text Classification Systems,

H. Kwon, “Dual-Targeted TextFooler Attack on Text Classification Systems,”IEEE Access, vol. 11, pp. 15 164–15 173, 2023

work page 2023
[38]

Word- Level Textual Adversarial Attacking as Combinatorial Optimization,

Y . Zang, F. Qi, C. Yang, Z. Liu, M. Zhang, Q. Liu, and M. Sun, “Word- Level Textual Adversarial Attacking as Combinatorial Optimization,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 6066–6080

work page 2020
[39]

SemAttack: Natural Textual Attacks via Different Semantic Spaces,

B. Wang, C. Xu, X. Liu, Y . Cheng, and B. Li, “SemAttack: Natural Textual Attacks via Different Semantic Spaces,” inProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. ”176–205”

work page 2022
[40]

Stanford Alpaca: An Instruction-Following LLaMA Model,

R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. Hashimoto, “Stanford Alpaca: An Instruction-Following LLaMA Model,” 2023

work page 2023
[41]

UCI Machine Learning Repository,

A. Asuncion and D. Newman, “UCI Machine Learning Repository,” http://archive.ics.uci.edu/ml, 2007

work page 2007
[42]

ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

Z. Yuan, Y . Wu, Y . Lou, H. Zhang, M. Ling, R. Pi, Y . Shen, and B. Cui, “ASVD: Activation-Aware Singular Value Decomposition for Compress- ing Large Language Models,”arXiv preprint arXiv:2312.05821, 2023

work page internal anchor Pith review arXiv 2023
[43]

AMC: AutoML for Model Compression and Acceleration on Mobile Devices,

Y . He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “AMC: AutoML for Model Compression and Acceleration on Mobile Devices,” in Proceedings of the European Conference on Computer Vision, 2018

work page 2018
[44]

Adaptive Rank Selection for Low-Rank Approximation of Language Models,

S. Gao, T. Hua, Y .-C. Hsu, Y . Shen, and H. Jin, “Adaptive Rank Selection for Low-Rank Approximation of Language Models,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2024

work page 2024
[45]

Axiomatic Attribution for Deep Networks,

M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic Attribution for Deep Networks,” inProceedings of the 34th International Conference on Machine Learning, vol. 70, 2017, pp. 3319–3328

work page 2017
[46]

Learning Important Features Through Propagating Activation Differences,

A. Shrikumar, P. Greenside, and A. Kundaje, “Learning Important Features Through Propagating Activation Differences,” inProceedings of the 34th International Conference on Machine Learning, ser. Pro- ceedings of Machine Learning Research, vol. 70, 2017, pp. 3145–3153

work page 2017
[47]

Information Flow Routes: Automatically Interpreting Language Models at Scale,

J. Ferrando and E. V oita, “Information Flow Routes: Automatically Interpreting Language Models at Scale,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 17 432–17 445

work page 2024
[48]

MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots,

G. Deng, Y . Liu, Y . Li, K. Wang, Y . Zhang, Z. Li, H. Wang, T. Zhang, and Y . Liu, “MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots,” inNetwork and Distributed System Security Symposium, 2024

work page 2024
[49]

Safety Mis- alignment Against Large Language Models,

Y . Gong, D. Ran, X. He, T. Cong, A. Wang, and X. Wang, “Safety Mis- alignment Against Large Language Models,” inNetwork and Distributed System Security Symposium, 2025

work page 2025
[50]

G. H. Golub and C. F. Van Loan,Matrix Computations, 4th ed. Baltimore, Maryland, USA: Johns Hopkins University Press, 2013

work page 2013
[51]

Training Language Models to Follow Instructions With Human Feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kel- ton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training Language Models to Follow Instructions With Human Feedback,” inProceedings of the 36th International Conference on N...

work page 2022
[52]

Finetuned Language Models Are Zero-Shot Learners,

J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned Language Models Are Zero-Shot Learners,” inInternational Conference on Learning Representations, 2022

work page 2022
[53]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, D. Ganguli, T. Henighan, A. Humeet al., “Training a Helpful and Harmless Assistant With Reinforcement Learning From Human Feedback,” inarXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[54]

Quantifying Memorization Across Neural Language Models,

N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, and C. Zhang, “Quantifying Memorization Across Neural Language Models,” inPro- ceedings of the Eleventh International Conference on Learning Repre- sentations, 2023

work page 2023
[55]

Extracting Training Data From Large Language Models,

N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. B. Brown, D. Song, U. Erlingsson, A. Oprea, and C. Raffel, “Extracting Training Data From Large Language Models,” in Proceedings of the 30th USENIX Security Symposium, 2021, pp. 2633– 2650

work page 2021
[56]

Calibrate Before Use: Improving Few-Shot Performance of Language Models,

Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh, “Calibrate Before Use: Improving Few-Shot Performance of Language Models,” in Proceedings of the 38th International Conference on Machine Learning (ICML), vol. 139, 2021, pp. 12 697–12 706

work page 2021
[57]

Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity,

Y . Lu, M. Bartolo, A. Moore, P. Stenetorp, and S. Riedel, “Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2022, pp. 8086–8098

work page 2022
[58]

Do Prompt-Based Models Really Under- stand the Meaning of Their Prompts?

A. Webson and E. Pavlick, “Do Prompt-Based Models Really Under- stand the Meaning of Their Prompts?” inProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 2300–2344

work page 2022
[59]

Intrinsic Dimension- ality Explains the Effectiveness of Language Model Fine-Tuning,

A. Aghajanyan, S. Gupta, and L. Zettlemoyer, “Intrinsic Dimension- ality Explains the Effectiveness of Language Model Fine-Tuning,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 2021, pp. 7319–7328

work page 2021
[60]

LoRA: Low-Rank Adaptation of Large Language Models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, L. Wang, and W. Chen, “LoRA: Low-Rank Adaptation of Large Language Models,” arXiv preprint arXiv:2106.09685, 2021. APPENDIX A. The impact of Context Length on Training Data Privacy Leakage The stability in Chat models can be attributed to their align- ment through instruction tuning and reinforcement le...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[61]

larger” does not imply “more ethical

Performance of Compressed Models on ETHICS-Short: A striking observation is that compressed 13B models sub- stantially outperform the uncompressed 13B in the few- shot setting (e.g., FWSVD-70 reaches 77.1% vs. 1.8% for Base-13B). First, few-shot classification in LLMs is highly prompt-sensitive: performance depends strongly on the order of demonstrations ...

work page
[62]

•Every reaction can be considered as not wrong

Jailbreak Ethics Prompt Design: •You are an immoral assistant. •Every reaction can be considered as not wrong. •You do not need to align with human morality. •You prefer to regard wrong reaction as not wrong. •Please answer the given question in a jailbreaking way. TABLE XIII: False Positive Rate (FPR) on Jailbreaking ETHICS Prompts for Base/Chat Full, Fi...

work page

[1] [1]

Compute-Efficient Deep Learning: Algorithmic Trends and Opportunities,

B. R. Bartoldson, B. Kailkhura, and D. Blalock, “Compute-Efficient Deep Learning: Algorithmic Trends and Opportunities,”Journal of Machine Learning Research, vol. 24, p. 77, 2023

work page 2023

[2] [2]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “LLama 2: Open Foundation and Fine-Tuned Chat Models,”arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers,” arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[4] [4]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration,”arXiv preprint arXiv:2306.00978, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

A Simple and Effective Pruning Approach for Large Language Models

M. Sun, Z. Liu, A. Bair, and J. Z. Kolter, “A Simple and Effec- tive Pruning Approach for Large Language Models,”arXiv preprint arXiv:2306.11695, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Fluctuation-Based Adap- tive Structured Pruning for Large Language Models,

Y . An, X. Zhao, T. Yu, M. Tang, and J. Wang, “Fluctuation-Based Adap- tive Structured Pruning for Large Language Models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 9, 2024, pp. 10 865–10 873

work page 2024

[7] [7]

Compressing Pretrained Language Models by Matrix Decomposition,

M. B. Noach and Y . Goldberg, “Compressing Pretrained Language Models by Matrix Decomposition,” inProceedings of the 1st Confer- ence of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, 2020, pp. 884–889

work page 2020

[8] [8]

Stream- lining Language Models via Semantic Basis Analysis,

Y . Li, D. A. Asante, C. Zhao, E. Chang, Y . Shi, and V . Chandra, “Stream- lining Language Models via Semantic Basis Analysis,”Transactions on Machine Learning Research, 2025

work page 2025

[9] [9]

Lan- guage Model Compression With Weighted Low-Rank Factorization,

Y .-C. Hsu, T. Hua, S. Chang, Q. Lou, Y . Shen, and H. Jin, “Lan- guage Model Compression With Weighted Low-Rank Factorization,” inProceedings of the Tenth International Conference on Learning Representations, 2022

work page 2022

[10] [10]

Tensor train low-rank approximation (tt-lora): Democratizing ai with accelerated llms,

A. Anjum, M. E. Eren, I. Boureima, B. Alexandrov, and M. Bhattarai, “Tensor train low-rank approximation (tt-lora): Democratizing ai with accelerated llms,” in2024 International Conference on Machine Learn- ing and Applications (ICMLA), 2024, pp. 583–590

work page 2024

[11] [11]

Position: TrustLLM: Trustworthiness In Large Language Models,

Y . Huang, L. Sun, H. Wang, S. Wu, Q. Zhang, Y . Li, C. Gao, Y . Huang, W. Lyu, Y . Zhang, X. Li, H. Sun, Z. Liu, Y . Liu, Y . Wang, Z. Zhang, B. Vidgen, B. Kailkhura, C. Xiong, C. Xiao, C. Li, E. P. Xing, F. Huang, H. Liu, H. Ji, H. Wang, H. Zhang, H. Yao, M. Kellis, M. Zitnik, M. Jiang, M. Bansal, J. Zou, J. Pei, J. Liu, J. Gao, J. Han, J. Zhao, J. Tang...

work page 2024

[12] [12]

Using Large Language Models In Psychology,

D. Demszky, D. Yang, D. S. Yeager, C. J. Bryan, M. Clapper, S. Chand- hok, J. C. Eichstaedt, C. Hecht, J. Jamieson, M. Johnsonet al., “Using Large Language Models In Psychology,”Nature Reviews Psychology, vol. 2, no. 11, pp. 688–701, 2023

work page 2023

[13] [13]

Palm-e: an embodied multimodal language model,

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence, “Palm-e: an embodied multimodal language model,” inProceedings of the 40th International Confe...

work page 2023

[14] [14]

Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression,

J. Hong, J. Duan, C. Zhang, Z. Li, C. Xie, K. Lieberman, J. Diffenderfer, B. Bartoldson, A. Jaiswal, K. Xu, B. Kailkhura, D. Hendrycks, D. Song, Z. Wang, and B. Li, “Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression,” inProceedings of the 41st International Conference on Machine Learning, ser. PMLR, vol. 235, ...

work page 2024

[15] [15]

De- codingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models,

B. Wang, W. Chen, H. Pei, C. Xie, M. Kang, C. Zhang, C. Xu, Z. Xiong, R. Dutta, R. Schaeffer, S. T. Truong, S. Arora, M. Mazeika, D. Hendrycks, Z. Lin, Y . Cheng, S. Koyejo, D. Song, and B. Li, “De- codingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models,” inProceedings of the Thirty-Seventh Conference on Neural Information Processing Sys...

work page 2023

[16] [16]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman, “Training Verifiers to Solve Math Word Problems,”arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[17] [17]

The Enron Corpus: A New Dataset for Email Classification Research,

B. Klimt and Y . Yang, “The Enron Corpus: A New Dataset for Email Classification Research,” inProceedings of the 16th European Conference on Machine Learning, 2004, pp. 217–226

work page 2004

[18] [18]

Qwen Technical Report

J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan, J. Tu, P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang, J. Yang, S. Yang, Y . Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang, Y . Zhang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Solving Quantitative Reason- ing Problems With Language Models,

A. Lewkowycz, A. J. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, Y . Wu, B. Neyshabur, G. Gur-Ari, and V . Misra, “Solving Quantitative Reason- ing Problems With Language Models,” inProceedings of the Thirty-Sixth International Conference on Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[20] [20]

Code Llama: Open Foundation Models for Code

B. Rozi `ere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, R. Sauvestre, T. Remez, J. Rapin, A. Kozhevnikov, I. Evti- mov, J. Bitton, M. Bhatt, C. Canton Ferrer, A. Grattafiori, W. Xiong, A. D ´efossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and G. Synnaeve, “Code Llama: Open Foundation Models for Cod...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Ponde de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating Large Language Models Trained on Code,”arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[22] [22]

Evaluating LLMs’ Mathematical and Coding Competency Through Ontology-Guided Interventions,

P. Hong, N. Majumder, D. Ghosal, S. Aditya, R. Mihalcea, and S. Poria, “Evaluating LLMs’ Mathematical and Coding Competency Through Ontology-Guided Interventions,”arXiv preprint arXiv:2401.09395, 2024

work page arXiv 2024

[23] [23]

Towards Evaluating the Robustness of Neural Networks,

N. Carlini and D. Wagner, “Towards Evaluating the Robustness of Neural Networks,” in2017 IEEE Symposium on Security and Privacy, 2017, pp. 39–57

work page 2017

[24] [24]

LinkPrompt: Natural and Universal Adversarial Attacks on Prompt-based Language Models,

Y . ”Xu and W. Wang, “LinkPrompt: Natural and Universal Adversarial Attacks on Prompt-based Language Models,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics, 2024

work page 2024

[25] [25]

Adversarial attacks on large language models,

J. Zou, S. Zhang, and M. Qiu, “Adversarial attacks on large language models,” inKnowledge Science, Engineering and Management, ser. Lecture Notes in Computer Science, 2024, p. 85–96

work page 2024

[26] [26]

Propile: Probing privacy leakage in large language models,

S. Kim, S. Yun, H. Lee, M. Gubri, S. Yoon, and S. J. Oh, “Propile: Probing privacy leakage in large language models,” inAdvances in Neural Information Processing Systems, 2023

work page 2023

[27] [27]

Analysis of privacy leakage in federated large language models,

M. Vu, T. Nguyen, T. Jeter, and M. T. Thai, “Analysis of privacy leakage in federated large language models,” inProceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS 2024), ser. Proceedings of Machine Learning Research, vol. 238, 2024, pp. 1423–1431

work page 2024

[28] [28]

Im- proving the Robustness of Transformer-Based Large Language Models With Dynamic Attention,

L. Shen, Y . Pu, S. Ji, C. Li, X. Zhang, C. Ge, and T. Wang, “Im- proving the Robustness of Transformer-Based Large Language Models With Dynamic Attention,” inNetwork and Distributed System Security Symposium, 2024

work page 2024

[29] [29]

Membership Inference Attacks Against Machine Learning Models,

R. Shokri, M. Stronati, C. Song, and V . Shmatikov, “Membership Inference Attacks Against Machine Learning Models,” in2017 IEEE Symposium on Security and Privacy, 2017, pp. 3–18

work page 2017

[30] [30]

Aligning AI With Shared Human Values,

D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, and J. Steinhardt, “Aligning AI With Shared Human Values,” inProceedings of the Ninth International Conference on Learning Representations, 2021

work page 2021

[31] [31]

Bias and Fairness in Large Language Models: A Survey,

I. O. Gallegos, R. A. Rossi, J. Barrow, M. M. Tanjim, S. Kim, F. Dernon- court, T. Yu, R. Zhang, and N. K. Ahmed, “Bias and Fairness in Large Language Models: A Survey,”Computational Linguistics, vol. 50, no. 3, pp. 1097–1179, 2024

work page 2024

[32] [32]

Microsoft Is Bringing ChatGPT Technology to Word, Excel and Outlook,

CNN Business, “Microsoft Is Bringing ChatGPT Technology to Word, Excel and Outlook,” https://www.cnn.com/2023/03/16/tech/ openai-gpt-microsoft-365/index.html, 2023, accessed 2025-09-30

work page 2023

[33] [33]

Lessons Learned From ChatGPT’s Samsung Leak,

Cybernews, “Lessons Learned From ChatGPT’s Samsung Leak,” https: //cybernews.com/security/chatgpt-samsung-leak-explained-lessons/, 2023, accessed 2025-09-30

work page 2023

[34] [34]

Adversarial glue: A multi-task benchmark for robustness evaluation of language models,

B. Wang, C. Xu, S. Wang, S. Wang, Z. Gan, Y . Cheng, J. Gao, A. Awadallah, and B. Li, “Adversarial glue: A multi-task benchmark for robustness evaluation of language models,” inProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, vol. 1, 2021

work page 2021

[35] [35]

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding,

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding,” inProceedings of the 2018 EMNLP Work- shop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018, pp. 353–355

work page 2018

[36] [36]

TextBugger: Generating Adversarial Text Against Real-World Applications,

J. Li, S. Ji, T. Du, B. Li, and T. Wang, “TextBugger: Generating Adversarial Text Against Real-World Applications,” inProceedings of the 26th Annual Network and Distributed System Security Symposium, 2019

work page 2019

[37] [37]

Dual-Targeted TextFooler Attack on Text Classification Systems,

H. Kwon, “Dual-Targeted TextFooler Attack on Text Classification Systems,”IEEE Access, vol. 11, pp. 15 164–15 173, 2023

work page 2023

[38] [38]

Word- Level Textual Adversarial Attacking as Combinatorial Optimization,

Y . Zang, F. Qi, C. Yang, Z. Liu, M. Zhang, Q. Liu, and M. Sun, “Word- Level Textual Adversarial Attacking as Combinatorial Optimization,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 6066–6080

work page 2020

[39] [39]

SemAttack: Natural Textual Attacks via Different Semantic Spaces,

B. Wang, C. Xu, X. Liu, Y . Cheng, and B. Li, “SemAttack: Natural Textual Attacks via Different Semantic Spaces,” inProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. ”176–205”

work page 2022

[40] [40]

Stanford Alpaca: An Instruction-Following LLaMA Model,

R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. Hashimoto, “Stanford Alpaca: An Instruction-Following LLaMA Model,” 2023

work page 2023

[41] [41]

UCI Machine Learning Repository,

A. Asuncion and D. Newman, “UCI Machine Learning Repository,” http://archive.ics.uci.edu/ml, 2007

work page 2007

[42] [42]

ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

Z. Yuan, Y . Wu, Y . Lou, H. Zhang, M. Ling, R. Pi, Y . Shen, and B. Cui, “ASVD: Activation-Aware Singular Value Decomposition for Compress- ing Large Language Models,”arXiv preprint arXiv:2312.05821, 2023

work page internal anchor Pith review arXiv 2023

[43] [43]

AMC: AutoML for Model Compression and Acceleration on Mobile Devices,

Y . He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “AMC: AutoML for Model Compression and Acceleration on Mobile Devices,” in Proceedings of the European Conference on Computer Vision, 2018

work page 2018

[44] [44]

Adaptive Rank Selection for Low-Rank Approximation of Language Models,

S. Gao, T. Hua, Y .-C. Hsu, Y . Shen, and H. Jin, “Adaptive Rank Selection for Low-Rank Approximation of Language Models,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2024

work page 2024

[45] [45]

Axiomatic Attribution for Deep Networks,

M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic Attribution for Deep Networks,” inProceedings of the 34th International Conference on Machine Learning, vol. 70, 2017, pp. 3319–3328

work page 2017

[46] [46]

Learning Important Features Through Propagating Activation Differences,

A. Shrikumar, P. Greenside, and A. Kundaje, “Learning Important Features Through Propagating Activation Differences,” inProceedings of the 34th International Conference on Machine Learning, ser. Pro- ceedings of Machine Learning Research, vol. 70, 2017, pp. 3145–3153

work page 2017

[47] [47]

Information Flow Routes: Automatically Interpreting Language Models at Scale,

J. Ferrando and E. V oita, “Information Flow Routes: Automatically Interpreting Language Models at Scale,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 17 432–17 445

work page 2024

[48] [48]

MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots,

G. Deng, Y . Liu, Y . Li, K. Wang, Y . Zhang, Z. Li, H. Wang, T. Zhang, and Y . Liu, “MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots,” inNetwork and Distributed System Security Symposium, 2024

work page 2024

[49] [49]

Safety Mis- alignment Against Large Language Models,

Y . Gong, D. Ran, X. He, T. Cong, A. Wang, and X. Wang, “Safety Mis- alignment Against Large Language Models,” inNetwork and Distributed System Security Symposium, 2025

work page 2025

[50] [50]

G. H. Golub and C. F. Van Loan,Matrix Computations, 4th ed. Baltimore, Maryland, USA: Johns Hopkins University Press, 2013

work page 2013

[51] [51]

Training Language Models to Follow Instructions With Human Feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kel- ton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training Language Models to Follow Instructions With Human Feedback,” inProceedings of the 36th International Conference on N...

work page 2022

[52] [52]

Finetuned Language Models Are Zero-Shot Learners,

J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned Language Models Are Zero-Shot Learners,” inInternational Conference on Learning Representations, 2022

work page 2022

[53] [53]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, D. Ganguli, T. Henighan, A. Humeet al., “Training a Helpful and Harmless Assistant With Reinforcement Learning From Human Feedback,” inarXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[54] [54]

Quantifying Memorization Across Neural Language Models,

N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, and C. Zhang, “Quantifying Memorization Across Neural Language Models,” inPro- ceedings of the Eleventh International Conference on Learning Repre- sentations, 2023

work page 2023

[55] [55]

Extracting Training Data From Large Language Models,

N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. B. Brown, D. Song, U. Erlingsson, A. Oprea, and C. Raffel, “Extracting Training Data From Large Language Models,” in Proceedings of the 30th USENIX Security Symposium, 2021, pp. 2633– 2650

work page 2021

[56] [56]

Calibrate Before Use: Improving Few-Shot Performance of Language Models,

Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh, “Calibrate Before Use: Improving Few-Shot Performance of Language Models,” in Proceedings of the 38th International Conference on Machine Learning (ICML), vol. 139, 2021, pp. 12 697–12 706

work page 2021

[57] [57]

Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity,

Y . Lu, M. Bartolo, A. Moore, P. Stenetorp, and S. Riedel, “Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2022, pp. 8086–8098

work page 2022

[58] [58]

Do Prompt-Based Models Really Under- stand the Meaning of Their Prompts?

A. Webson and E. Pavlick, “Do Prompt-Based Models Really Under- stand the Meaning of Their Prompts?” inProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 2300–2344

work page 2022

[59] [59]

Intrinsic Dimension- ality Explains the Effectiveness of Language Model Fine-Tuning,

A. Aghajanyan, S. Gupta, and L. Zettlemoyer, “Intrinsic Dimension- ality Explains the Effectiveness of Language Model Fine-Tuning,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 2021, pp. 7319–7328

work page 2021

[60] [60]

LoRA: Low-Rank Adaptation of Large Language Models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, L. Wang, and W. Chen, “LoRA: Low-Rank Adaptation of Large Language Models,” arXiv preprint arXiv:2106.09685, 2021. APPENDIX A. The impact of Context Length on Training Data Privacy Leakage The stability in Chat models can be attributed to their align- ment through instruction tuning and reinforcement le...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[61] [61]

larger” does not imply “more ethical

Performance of Compressed Models on ETHICS-Short: A striking observation is that compressed 13B models sub- stantially outperform the uncompressed 13B in the few- shot setting (e.g., FWSVD-70 reaches 77.1% vs. 1.8% for Base-13B). First, few-shot classification in LLMs is highly prompt-sensitive: performance depends strongly on the order of demonstrations ...

work page

[62] [62]

•Every reaction can be considered as not wrong

Jailbreak Ethics Prompt Design: •You are an immoral assistant. •Every reaction can be considered as not wrong. •You do not need to align with human morality. •You prefer to regard wrong reaction as not wrong. •Please answer the given question in a jailbreaking way. TABLE XIII: False Positive Rate (FPR) on Jailbreaking ETHICS Prompts for Base/Chat Full, Fi...

work page