Decomposed Trust: Privacy, Adversarial Robustness, Ethics, and Fairness in Low-Rank LLMs
Pith reviewed 2026-05-17 04:29 UTC · model grok-4.3
The pith
Low-rank factorization preserves LLM training data privacy but weakens personally identifiable information safeguards while increasing adversarial robustness and decreasing fairness.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Low-rank factorization addresses the size issue in LLMs by compressing them while keeping accuracy. The study reveals that this process preserves training data privacy but weakens protection of personally identifiable information in conversations. Adversarial robustness generally increases with compression. Ethics degrade in zero-shot prompting but partially recover in few-shot prompting. Fairness declines under compression. The analysis also covers effects of model scale and fine-tuning and uses gradient-based attribution to find which layers contribute most to robustness.
What carries the argument
Low-rank factorization algorithms that decompose model weights into lower-rank approximations, evaluated on multiple LLMs with metrics for privacy leakage, adversarial attack success, ethical compliance, and fairness disparities, plus gradient attribution for internal analysis.
If this is right
- Low-rank compressed models can be used in settings where resistance to adversarial inputs is important.
- Additional measures may be needed to protect personal information in dialogues with compressed models.
- Few-shot prompting can be applied to improve ethical outputs in compressed LLMs.
- Fairness testing becomes more critical when applying low-rank factorization to models.
- Model scale and fine-tuning can be adjusted to influence trustworthiness outcomes.
Where Pith is reading between the lines
- These results suggest that low-rank methods could be combined with privacy-enhancing techniques to balance the trade-offs observed.
- Similar studies on other compression techniques like quantization might reveal comparable patterns in trustworthiness.
- Deployment guidelines for resource-constrained LLMs should incorporate specific checks for PII leakage and fairness.
Load-bearing premise
The chosen evaluation metrics and prompting regimes accurately capture real-world privacy risks, ethical failures, and fairness violations, and that results from the tested model sizes and factorization algorithms generalize to other low-rank methods and deployment contexts.
What would settle it
A test showing that a low-rank compressed model leaks personally identifiable information at the same rate as its full-size counterpart would falsify the claim of weakened PII protection.
Figures
read the original abstract
Large language models (LLMs) have driven major advances across domains, yet their massive size hinders deployment in resource-constrained settings. Low-rank factorization addresses this challenge by compressing models to effectively reduce their computation and memory consumption while maintaining accuracy. While these compressed models boast benign performance and system-level advantages, their trustworthiness implications remain poorly understood. In this paper, we present the first comprehensive study of how low-rank factorization affects LLM trustworthiness across privacy, adversarial robustness, ethics, and fairness, complemented by an explainability-driven analysis of the internal mechanisms behind these trust-related changes. We evaluate multiple LLMs of different sizes and architectures compressed with various low-rank factorization algorithms, revealing key insights: (1) low-rank factorization preserves training data privacy but weakens the protection of personally identifiable information during conversations; (2) adversarial robustness is generally enhanced under compression; (3) ethics degrades in zero-shot prompting but partially recovers in few-shot prompting; (4) fairness declines under compression. Beyond compression, we investigate how model scale and fine-tuning affect trustworthiness. Additionally, to move beyond black-box analysis, we employ a gradient-based attribution to identify which layers of LLMs contribute most to adversarial robustness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to be the first comprehensive study of low-rank factorization's effects on LLM trustworthiness. It evaluates multiple LLMs compressed via various low-rank methods and reports four directional findings: (1) training-data privacy is preserved while PII protection in conversations weakens; (2) adversarial robustness generally improves; (3) ethical performance drops in zero-shot prompting but partially recovers in few-shot; (4) fairness declines under compression. The work also examines model scale and fine-tuning effects and uses gradient-based attribution to identify layers most responsible for robustness changes.
Significance. If the empirical patterns hold after addressing controls and generalization, the study would provide a valuable first map of trustworthiness trade-offs in compressed LLMs. The combination of multi-aspect evaluation with gradient-based layer attribution is a positive step beyond purely black-box measurements and could guide future compression-aware safety work.
major comments (3)
- [Section 4] Experimental setup (Section 4): The manuscript does not compare low-rank factorization against other capacity-reduction methods such as pruning or quantization. Without such controls it is impossible to determine whether the reported changes in privacy, robustness, ethics, and fairness are attributable to the low-rank structure itself or simply to reduced parameter count.
- [gradient-based attribution section] Gradient-based attribution analysis: No variance across random seeds or statistical significance tests are reported for the layer-attribution results. This weakens the claim that specific layers can be identified as primary contributors to the observed robustness gains.
- [privacy and fairness evaluation sections] Evaluation of privacy and fairness: The chosen metrics and prompting regimes are not shown to be robust to prompt sensitivity or to generalize beyond the tested model sizes and factorization heuristics. If effects are driven by the particular rank-selection procedure or benchmark choice, the directional conclusions do not hold.
minor comments (2)
- [methods] Clarify the exact low-rank dimension selection heuristic and report the resulting compression ratios for each model in a table.
- [related work] Add missing references to prior work on compression and trustworthiness (e.g., studies on quantized or pruned models).
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating the changes we will make in the revised version.
read point-by-point responses
-
Referee: [Section 4] Experimental setup (Section 4): The manuscript does not compare low-rank factorization against other capacity-reduction methods such as pruning or quantization. Without such controls it is impossible to determine whether the reported changes in privacy, robustness, ethics, and fairness are attributable to the low-rank structure itself or simply to reduced parameter count.
Authors: We agree that the absence of comparisons to other capacity-reduction techniques limits the ability to isolate effects specific to low-rank factorization. In the revised manuscript we will add experiments that apply pruning and quantization at comparable compression ratios and report the resulting changes in the same privacy, robustness, ethics, and fairness metrics. These controls will clarify whether the observed directional effects are attributable to the low-rank structure or to parameter reduction in general. revision: yes
-
Referee: [gradient-based attribution section] Gradient-based attribution analysis: No variance across random seeds or statistical significance tests are reported for the layer-attribution results. This weakens the claim that specific layers can be identified as primary contributors to the observed robustness gains.
Authors: The referee is correct that variance across seeds and statistical significance were not reported. We will re-run the gradient-based attribution analysis over multiple random seeds, include error bars, and add statistical significance tests for the identified layers in the updated manuscript to support the robustness-related claims. revision: yes
-
Referee: [privacy and fairness evaluation sections] Evaluation of privacy and fairness: The chosen metrics and prompting regimes are not shown to be robust to prompt sensitivity or to generalize beyond the tested model sizes and factorization heuristics. If effects are driven by the particular rank-selection procedure or benchmark choice, the directional conclusions do not hold.
Authors: We acknowledge that explicit checks for prompt sensitivity and broader generalization were not included. Although the original study already spans several model sizes and factorization heuristics, we will add prompt-variation experiments and results under alternative rank-selection procedures in the revision to demonstrate that the reported directional effects are not artifacts of the specific choices made. revision: partial
Circularity Check
No circularity: empirical measurements on compressed models
full rationale
The paper reports direct experimental results from applying low-rank factorization to multiple LLMs and measuring outcomes on privacy, adversarial robustness, ethics, and fairness benchmarks. No mathematical derivations, parameter fits, or predictions are presented that reduce to the inputs by construction; all claims rest on observed performance differences under compression, scale, and prompting regimes. The analysis is self-contained against external benchmarks with no load-bearing self-citations, ansatzes, or uniqueness theorems invoked to justify the central findings.
Axiom & Free-Parameter Ledger
free parameters (2)
- low-rank dimension / compression ratio
- prompting regime (zero-shot vs few-shot)
axioms (1)
- domain assumption Standard benchmarks and metrics for privacy leakage, adversarial robustness, ethics, and fairness accurately reflect the intended trustworthiness properties.
Reference graph
Works this paper leans on
-
[1]
Compute-Efficient Deep Learning: Algorithmic Trends and Opportunities,
B. R. Bartoldson, B. Kailkhura, and D. Blalock, “Compute-Efficient Deep Learning: Algorithmic Trends and Opportunities,”Journal of Machine Learning Research, vol. 24, p. 77, 2023
work page 2023
-
[2]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “LLama 2: Open Foundation and Fine-Tuned Chat Models,”arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers,” arXiv preprint arXiv:2210.17323, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration,”arXiv preprint arXiv:2306.00978, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
A Simple and Effective Pruning Approach for Large Language Models
M. Sun, Z. Liu, A. Bair, and J. Z. Kolter, “A Simple and Effec- tive Pruning Approach for Large Language Models,”arXiv preprint arXiv:2306.11695, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Fluctuation-Based Adap- tive Structured Pruning for Large Language Models,
Y . An, X. Zhao, T. Yu, M. Tang, and J. Wang, “Fluctuation-Based Adap- tive Structured Pruning for Large Language Models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 9, 2024, pp. 10 865–10 873
work page 2024
-
[7]
Compressing Pretrained Language Models by Matrix Decomposition,
M. B. Noach and Y . Goldberg, “Compressing Pretrained Language Models by Matrix Decomposition,” inProceedings of the 1st Confer- ence of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, 2020, pp. 884–889
work page 2020
-
[8]
Stream- lining Language Models via Semantic Basis Analysis,
Y . Li, D. A. Asante, C. Zhao, E. Chang, Y . Shi, and V . Chandra, “Stream- lining Language Models via Semantic Basis Analysis,”Transactions on Machine Learning Research, 2025
work page 2025
-
[9]
Lan- guage Model Compression With Weighted Low-Rank Factorization,
Y .-C. Hsu, T. Hua, S. Chang, Q. Lou, Y . Shen, and H. Jin, “Lan- guage Model Compression With Weighted Low-Rank Factorization,” inProceedings of the Tenth International Conference on Learning Representations, 2022
work page 2022
-
[10]
Tensor train low-rank approximation (tt-lora): Democratizing ai with accelerated llms,
A. Anjum, M. E. Eren, I. Boureima, B. Alexandrov, and M. Bhattarai, “Tensor train low-rank approximation (tt-lora): Democratizing ai with accelerated llms,” in2024 International Conference on Machine Learn- ing and Applications (ICMLA), 2024, pp. 583–590
work page 2024
-
[11]
Position: TrustLLM: Trustworthiness In Large Language Models,
Y . Huang, L. Sun, H. Wang, S. Wu, Q. Zhang, Y . Li, C. Gao, Y . Huang, W. Lyu, Y . Zhang, X. Li, H. Sun, Z. Liu, Y . Liu, Y . Wang, Z. Zhang, B. Vidgen, B. Kailkhura, C. Xiong, C. Xiao, C. Li, E. P. Xing, F. Huang, H. Liu, H. Ji, H. Wang, H. Zhang, H. Yao, M. Kellis, M. Zitnik, M. Jiang, M. Bansal, J. Zou, J. Pei, J. Liu, J. Gao, J. Han, J. Zhao, J. Tang...
work page 2024
-
[12]
Using Large Language Models In Psychology,
D. Demszky, D. Yang, D. S. Yeager, C. J. Bryan, M. Clapper, S. Chand- hok, J. C. Eichstaedt, C. Hecht, J. Jamieson, M. Johnsonet al., “Using Large Language Models In Psychology,”Nature Reviews Psychology, vol. 2, no. 11, pp. 688–701, 2023
work page 2023
-
[13]
Palm-e: an embodied multimodal language model,
D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence, “Palm-e: an embodied multimodal language model,” inProceedings of the 40th International Confe...
work page 2023
-
[14]
Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression,
J. Hong, J. Duan, C. Zhang, Z. Li, C. Xie, K. Lieberman, J. Diffenderfer, B. Bartoldson, A. Jaiswal, K. Xu, B. Kailkhura, D. Hendrycks, D. Song, Z. Wang, and B. Li, “Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression,” inProceedings of the 41st International Conference on Machine Learning, ser. PMLR, vol. 235, ...
work page 2024
-
[15]
De- codingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models,
B. Wang, W. Chen, H. Pei, C. Xie, M. Kang, C. Zhang, C. Xu, Z. Xiong, R. Dutta, R. Schaeffer, S. T. Truong, S. Arora, M. Mazeika, D. Hendrycks, Z. Lin, Y . Cheng, S. Koyejo, D. Song, and B. Li, “De- codingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models,” inProceedings of the Thirty-Seventh Conference on Neural Information Processing Sys...
work page 2023
-
[16]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman, “Training Verifiers to Solve Math Word Problems,”arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[17]
The Enron Corpus: A New Dataset for Email Classification Research,
B. Klimt and Y . Yang, “The Enron Corpus: A New Dataset for Email Classification Research,” inProceedings of the 16th European Conference on Machine Learning, 2004, pp. 217–226
work page 2004
-
[18]
J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan, J. Tu, P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang, J. Yang, S. Yang, Y . Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang, Y . Zhang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Solving Quantitative Reason- ing Problems With Language Models,
A. Lewkowycz, A. J. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, Y . Wu, B. Neyshabur, G. Gur-Ari, and V . Misra, “Solving Quantitative Reason- ing Problems With Language Models,” inProceedings of the Thirty-Sixth International Conference on Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[20]
Code Llama: Open Foundation Models for Code
B. Rozi `ere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, R. Sauvestre, T. Remez, J. Rapin, A. Kozhevnikov, I. Evti- mov, J. Bitton, M. Bhatt, C. Canton Ferrer, A. Grattafiori, W. Xiong, A. D ´efossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and G. Synnaeve, “Code Llama: Open Foundation Models for Cod...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Evaluating Large Language Models Trained on Code
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Ponde de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating Large Language Models Trained on Code,”arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[22]
Evaluating LLMs’ Mathematical and Coding Competency Through Ontology-Guided Interventions,
P. Hong, N. Majumder, D. Ghosal, S. Aditya, R. Mihalcea, and S. Poria, “Evaluating LLMs’ Mathematical and Coding Competency Through Ontology-Guided Interventions,”arXiv preprint arXiv:2401.09395, 2024
-
[23]
Towards Evaluating the Robustness of Neural Networks,
N. Carlini and D. Wagner, “Towards Evaluating the Robustness of Neural Networks,” in2017 IEEE Symposium on Security and Privacy, 2017, pp. 39–57
work page 2017
-
[24]
LinkPrompt: Natural and Universal Adversarial Attacks on Prompt-based Language Models,
Y . ”Xu and W. Wang, “LinkPrompt: Natural and Universal Adversarial Attacks on Prompt-based Language Models,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics, 2024
work page 2024
-
[25]
Adversarial attacks on large language models,
J. Zou, S. Zhang, and M. Qiu, “Adversarial attacks on large language models,” inKnowledge Science, Engineering and Management, ser. Lecture Notes in Computer Science, 2024, p. 85–96
work page 2024
-
[26]
Propile: Probing privacy leakage in large language models,
S. Kim, S. Yun, H. Lee, M. Gubri, S. Yoon, and S. J. Oh, “Propile: Probing privacy leakage in large language models,” inAdvances in Neural Information Processing Systems, 2023
work page 2023
-
[27]
Analysis of privacy leakage in federated large language models,
M. Vu, T. Nguyen, T. Jeter, and M. T. Thai, “Analysis of privacy leakage in federated large language models,” inProceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS 2024), ser. Proceedings of Machine Learning Research, vol. 238, 2024, pp. 1423–1431
work page 2024
-
[28]
Im- proving the Robustness of Transformer-Based Large Language Models With Dynamic Attention,
L. Shen, Y . Pu, S. Ji, C. Li, X. Zhang, C. Ge, and T. Wang, “Im- proving the Robustness of Transformer-Based Large Language Models With Dynamic Attention,” inNetwork and Distributed System Security Symposium, 2024
work page 2024
-
[29]
Membership Inference Attacks Against Machine Learning Models,
R. Shokri, M. Stronati, C. Song, and V . Shmatikov, “Membership Inference Attacks Against Machine Learning Models,” in2017 IEEE Symposium on Security and Privacy, 2017, pp. 3–18
work page 2017
-
[30]
Aligning AI With Shared Human Values,
D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, and J. Steinhardt, “Aligning AI With Shared Human Values,” inProceedings of the Ninth International Conference on Learning Representations, 2021
work page 2021
-
[31]
Bias and Fairness in Large Language Models: A Survey,
I. O. Gallegos, R. A. Rossi, J. Barrow, M. M. Tanjim, S. Kim, F. Dernon- court, T. Yu, R. Zhang, and N. K. Ahmed, “Bias and Fairness in Large Language Models: A Survey,”Computational Linguistics, vol. 50, no. 3, pp. 1097–1179, 2024
work page 2024
-
[32]
Microsoft Is Bringing ChatGPT Technology to Word, Excel and Outlook,
CNN Business, “Microsoft Is Bringing ChatGPT Technology to Word, Excel and Outlook,” https://www.cnn.com/2023/03/16/tech/ openai-gpt-microsoft-365/index.html, 2023, accessed 2025-09-30
work page 2023
-
[33]
Lessons Learned From ChatGPT’s Samsung Leak,
Cybernews, “Lessons Learned From ChatGPT’s Samsung Leak,” https: //cybernews.com/security/chatgpt-samsung-leak-explained-lessons/, 2023, accessed 2025-09-30
work page 2023
-
[34]
Adversarial glue: A multi-task benchmark for robustness evaluation of language models,
B. Wang, C. Xu, S. Wang, S. Wang, Z. Gan, Y . Cheng, J. Gao, A. Awadallah, and B. Li, “Adversarial glue: A multi-task benchmark for robustness evaluation of language models,” inProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, vol. 1, 2021
work page 2021
-
[35]
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding,
A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding,” inProceedings of the 2018 EMNLP Work- shop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018, pp. 353–355
work page 2018
-
[36]
TextBugger: Generating Adversarial Text Against Real-World Applications,
J. Li, S. Ji, T. Du, B. Li, and T. Wang, “TextBugger: Generating Adversarial Text Against Real-World Applications,” inProceedings of the 26th Annual Network and Distributed System Security Symposium, 2019
work page 2019
-
[37]
Dual-Targeted TextFooler Attack on Text Classification Systems,
H. Kwon, “Dual-Targeted TextFooler Attack on Text Classification Systems,”IEEE Access, vol. 11, pp. 15 164–15 173, 2023
work page 2023
-
[38]
Word- Level Textual Adversarial Attacking as Combinatorial Optimization,
Y . Zang, F. Qi, C. Yang, Z. Liu, M. Zhang, Q. Liu, and M. Sun, “Word- Level Textual Adversarial Attacking as Combinatorial Optimization,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 6066–6080
work page 2020
-
[39]
SemAttack: Natural Textual Attacks via Different Semantic Spaces,
B. Wang, C. Xu, X. Liu, Y . Cheng, and B. Li, “SemAttack: Natural Textual Attacks via Different Semantic Spaces,” inProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. ”176–205”
work page 2022
-
[40]
Stanford Alpaca: An Instruction-Following LLaMA Model,
R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. Hashimoto, “Stanford Alpaca: An Instruction-Following LLaMA Model,” 2023
work page 2023
-
[41]
UCI Machine Learning Repository,
A. Asuncion and D. Newman, “UCI Machine Learning Repository,” http://archive.ics.uci.edu/ml, 2007
work page 2007
-
[42]
ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models
Z. Yuan, Y . Wu, Y . Lou, H. Zhang, M. Ling, R. Pi, Y . Shen, and B. Cui, “ASVD: Activation-Aware Singular Value Decomposition for Compress- ing Large Language Models,”arXiv preprint arXiv:2312.05821, 2023
work page internal anchor Pith review arXiv 2023
-
[43]
AMC: AutoML for Model Compression and Acceleration on Mobile Devices,
Y . He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “AMC: AutoML for Model Compression and Acceleration on Mobile Devices,” in Proceedings of the European Conference on Computer Vision, 2018
work page 2018
-
[44]
Adaptive Rank Selection for Low-Rank Approximation of Language Models,
S. Gao, T. Hua, Y .-C. Hsu, Y . Shen, and H. Jin, “Adaptive Rank Selection for Low-Rank Approximation of Language Models,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2024
work page 2024
-
[45]
Axiomatic Attribution for Deep Networks,
M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic Attribution for Deep Networks,” inProceedings of the 34th International Conference on Machine Learning, vol. 70, 2017, pp. 3319–3328
work page 2017
-
[46]
Learning Important Features Through Propagating Activation Differences,
A. Shrikumar, P. Greenside, and A. Kundaje, “Learning Important Features Through Propagating Activation Differences,” inProceedings of the 34th International Conference on Machine Learning, ser. Pro- ceedings of Machine Learning Research, vol. 70, 2017, pp. 3145–3153
work page 2017
-
[47]
Information Flow Routes: Automatically Interpreting Language Models at Scale,
J. Ferrando and E. V oita, “Information Flow Routes: Automatically Interpreting Language Models at Scale,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 17 432–17 445
work page 2024
-
[48]
MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots,
G. Deng, Y . Liu, Y . Li, K. Wang, Y . Zhang, Z. Li, H. Wang, T. Zhang, and Y . Liu, “MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots,” inNetwork and Distributed System Security Symposium, 2024
work page 2024
-
[49]
Safety Mis- alignment Against Large Language Models,
Y . Gong, D. Ran, X. He, T. Cong, A. Wang, and X. Wang, “Safety Mis- alignment Against Large Language Models,” inNetwork and Distributed System Security Symposium, 2025
work page 2025
-
[50]
G. H. Golub and C. F. Van Loan,Matrix Computations, 4th ed. Baltimore, Maryland, USA: Johns Hopkins University Press, 2013
work page 2013
-
[51]
Training Language Models to Follow Instructions With Human Feedback,
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kel- ton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training Language Models to Follow Instructions With Human Feedback,” inProceedings of the 36th International Conference on N...
work page 2022
-
[52]
Finetuned Language Models Are Zero-Shot Learners,
J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned Language Models Are Zero-Shot Learners,” inInternational Conference on Learning Representations, 2022
work page 2022
-
[53]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, D. Ganguli, T. Henighan, A. Humeet al., “Training a Helpful and Harmless Assistant With Reinforcement Learning From Human Feedback,” inarXiv preprint arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[54]
Quantifying Memorization Across Neural Language Models,
N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, and C. Zhang, “Quantifying Memorization Across Neural Language Models,” inPro- ceedings of the Eleventh International Conference on Learning Repre- sentations, 2023
work page 2023
-
[55]
Extracting Training Data From Large Language Models,
N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. B. Brown, D. Song, U. Erlingsson, A. Oprea, and C. Raffel, “Extracting Training Data From Large Language Models,” in Proceedings of the 30th USENIX Security Symposium, 2021, pp. 2633– 2650
work page 2021
-
[56]
Calibrate Before Use: Improving Few-Shot Performance of Language Models,
Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh, “Calibrate Before Use: Improving Few-Shot Performance of Language Models,” in Proceedings of the 38th International Conference on Machine Learning (ICML), vol. 139, 2021, pp. 12 697–12 706
work page 2021
-
[57]
Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity,
Y . Lu, M. Bartolo, A. Moore, P. Stenetorp, and S. Riedel, “Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2022, pp. 8086–8098
work page 2022
-
[58]
Do Prompt-Based Models Really Under- stand the Meaning of Their Prompts?
A. Webson and E. Pavlick, “Do Prompt-Based Models Really Under- stand the Meaning of Their Prompts?” inProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 2300–2344
work page 2022
-
[59]
Intrinsic Dimension- ality Explains the Effectiveness of Language Model Fine-Tuning,
A. Aghajanyan, S. Gupta, and L. Zettlemoyer, “Intrinsic Dimension- ality Explains the Effectiveness of Language Model Fine-Tuning,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 2021, pp. 7319–7328
work page 2021
-
[60]
LoRA: Low-Rank Adaptation of Large Language Models
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, L. Wang, and W. Chen, “LoRA: Low-Rank Adaptation of Large Language Models,” arXiv preprint arXiv:2106.09685, 2021. APPENDIX A. The impact of Context Length on Training Data Privacy Leakage The stability in Chat models can be attributed to their align- ment through instruction tuning and reinforcement le...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[61]
larger” does not imply “more ethical
Performance of Compressed Models on ETHICS-Short: A striking observation is that compressed 13B models sub- stantially outperform the uncompressed 13B in the few- shot setting (e.g., FWSVD-70 reaches 77.1% vs. 1.8% for Base-13B). First, few-shot classification in LLMs is highly prompt-sensitive: performance depends strongly on the order of demonstrations ...
-
[62]
•Every reaction can be considered as not wrong
Jailbreak Ethics Prompt Design: •You are an immoral assistant. •Every reaction can be considered as not wrong. •You do not need to align with human morality. •You prefer to regard wrong reaction as not wrong. •Please answer the given question in a jailbreaking way. TABLE XIII: False Positive Rate (FPR) on Jailbreaking ETHICS Prompts for Base/Chat Full, Fi...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.