pith. machine review for the scientific record. sign in

arxiv: 2604.15789 · v1 · submitted 2026-04-17 · 💻 cs.CL

Recognition: unknown

A Systematic Study of Training-Free Methods for Trustworthy Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:16 UTC · model grok-4.3

classification 💻 cs.CL
keywords training-free methodstrustworthy LLMslarge language modelsinference-time interventionstaxonomytrade-offsrobustnessutility
0
0 comments X

The pith

Training-free methods to make large language models trustworthy show clear trade-offs in utility, robustness, and cost depending on where they intervene.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper groups existing training-free techniques into three categories based on whether they change the input prompt, alter internal model computations, or adjust the generated output. It then tests representative methods from each category on multiple large language model families and sizes using a range of trustworthiness tasks such as reducing harmful content, bias, and unsupported claims. The evaluation tracks not only safety gains but also losses in task performance, vulnerability to attacks, and extra computation time. Results indicate that each intervention level improves some trustworthiness aspects while harming others, and that no current approach fully resolves all issues without side effects. The authors close with practical suggestions for choosing methods when training is not an option.

Core claim

Existing training-free methods can be organized into input-level, internal-level, and output-level interventions in the inference process. Comprehensive tests across model families and sizes reveal that these methods improve selected trustworthiness properties but frequently reduce utility, increase brittleness to adversarial inputs, and add computational overhead, with different levels producing distinct patterns of gains and losses.

What carries the argument

The three-level taxonomy of intervention points during inference: input modifications, internal state changes, and output post-processing.

If this is right

  • Input-level methods tend to be low-cost but offer shallower safety improvements than internal or output methods.
  • Internal interventions often carry higher computational cost and can introduce new brittleness.
  • Output-level fixes are easy to apply yet leave the model vulnerable to attacks on earlier stages.
  • No single level covers every trustworthiness dimension, so deployment choices must weigh specific risks against performance drops.
  • Balancing trustworthiness with utility and robustness requires explicit testing rather than assuming safety gains come for free.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid approaches that combine two levels could offset the weaknesses each level shows in isolation.
  • The observed patterns suggest trustworthiness fixes may need to be tuned per model family even without retraining.
  • Extending similar tests to multimodal or agentic systems would clarify whether the same level-based trade-offs persist.
  • Practitioners should include adversarial robustness checks as a standard part of adopting any training-free method.

Load-bearing premise

The chosen representative methods, trustworthiness tasks, and model families are broad enough to reveal the general trade-offs that would appear in the full range of training-free techniques and deployment conditions.

What would settle it

A training-free method that raises all measured trustworthiness scores while leaving utility, robustness to attacks, and runtime unchanged across several model sizes and families would falsify the reported trade-offs.

Figures

Figures reproduced from arXiv: 2604.15789 by Michael Backes, Mingjie Li, Wai Man Si, Yang Zhang.

Figure 1
Figure 1. Figure 1: An overview of the pipeline used in the taxonomy. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

As Large Language Models (LLMs) receive increasing attention and are being deployed across various domains, their potential risks, including generating harmful or biased content, producing unsupported claims, and exhibiting vulnerabilities to adversarial attacks, have drawn significant attention. To enable quick and low-cost adaptation, training-free methods have recently emerged as cost-effective alternatives to post-training alignment techniques. Despite their promising results, these methods are evaluated inconsistently across the literature, cover limited dimensions of trustworthiness, and can introduce undesirable side effects, such as utility degradation and increased brittleness. To fully assess the impacts of these training-free methods, we take a step back and systematically re-evaluate the effectiveness of existing training-free methods against various trustworthy settings and their influence on utility, robustness, and computational overhead. We also categorize these methods into three levels (input, internal, and output) based on where they intervene in the model's information flow during inference. Using this taxonomy, we conduct a comprehensive analysis of various representative and effective methods from each level across different LLM families and sizes. Our analysis highlights several trade-offs and unresolved challenges in current approaches. We summarize key findings and limitations in the existing literature, and propose practical recommendations for balancing trustworthiness, utility, and robustness in LLMs without the need for additional training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes a three-level taxonomy (input, internal, output) for training-free methods aimed at enhancing the trustworthiness of large language models. Through this taxonomy, it conducts a systematic re-evaluation of representative methods across various LLM families and sizes, examining their effects on trustworthiness metrics, utility, robustness, and computational costs, while identifying key trade-offs and unresolved challenges, and offering practical recommendations for balancing these aspects without additional training.

Significance. Should the analysis prove robust and comprehensive, the paper's taxonomy and highlighted trade-offs could serve as a valuable reference for researchers and practitioners working on trustworthy AI, emphasizing the need to consider side effects like utility degradation and brittleness in training-free interventions.

major comments (1)
  1. [Abstract] Abstract: The abstract outlines the taxonomy and comprehensive analysis but provides no specifics on the benchmarks employed, statistical methods, data exclusion rules, or controls for side effects. This omission hinders verification of the robustness of the reported trade-offs in trustworthiness, utility, and robustness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the single major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract outlines the taxonomy and comprehensive analysis but provides no specifics on the benchmarks employed, statistical methods, data exclusion rules, or controls for side effects. This omission hinders verification of the robustness of the reported trade-offs in trustworthiness, utility, and robustness.

    Authors: We agree that the abstract would benefit from additional high-level details to improve verifiability. In the revised version we will expand the abstract to concisely note the evaluation benchmarks (trustworthiness, utility, and robustness suites), the statistical procedures (multi-seed averaging with variance reporting), and the controls for side effects (joint measurement of utility degradation, attack robustness, and overhead). Full specifications of data exclusion criteria and experimental protocols remain in Sections 3–5 of the main text; the abstract revision will summarize these without exceeding length limits. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an empirical survey paper that introduces a taxonomy of training-free methods (input/internal/output levels) and re-evaluates representative methods across LLMs and trustworthiness settings. No derivations, equations, fitted parameters, or predictions appear in the provided text; all claims rest on external benchmarks, cited prior methods, and observed trade-offs rather than any self-referential reduction or self-citation chain that bears the central load. The analysis is self-contained against external evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard domain assumptions about benchmark validity for trustworthiness dimensions, with no free parameters, invented entities, or ad-hoc axioms introduced by the paper itself.

axioms (1)
  • domain assumption Existing benchmarks and settings adequately capture the key dimensions of trustworthiness, utility, robustness, and computational overhead for LLMs.
    Invoked when conducting the comprehensive analysis and highlighting trade-offs across methods and models.

pith-pipeline@v0.9.0 · 5526 in / 1310 out tokens · 61140 ms · 2026-05-10T08:16:22.320601+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 19 canonical work pages · 6 internal anchors

  1. [1]

    Claude.https://claude.ai/

    Anthropic. Claude.https://claude.ai/. 2

  2. [2]

    arXiv preprint arXiv:2503.00177 , year=

    Reza Bayat, Ali Rahimi-Kalahroudi, Mohammad Pezeshki, Sarath Chandar, and Pascal Vincent. Steering large language model activations in sparse spaces.CoRR, abs/2503.00177,

  3. [3]

    Towards inference-time category- wise safety steering for large language models.CoRR, abs/2410.01174, 2024

    Amrita Bhattacharjee, Shaona Ghosh, Traian Rebedea, and Christopher Parisien. Towards inference-time category- wise safety steering for large language models.CoRR, abs/2410.01174, 2024. 4, 6

  4. [4]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz L...

  5. [5]

    SCANS: mitigating the exaggerated safety for llms via safety-conscious activation steering

    Zouying Cao, Yifei Yang, and Hai Zhao. SCANS: mitigating the exaggerated safety for llms via safety-conscious activation steering. In Toby Walsh, Julie Shah, and Zico Kolter, edi- tors,AAAI-25, Sponsored by the Association for the Advance- ment of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 23523–23531. AAAI Press,

  6. [6]

    Jailbreaking Black Box Large Language Models in Twenty Queries

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking Black Box Large Language Models in Twenty Queries.CoRR abs/2310.08419, 2023. 1

  7. [7]

    Christiano, Jan Leike, Tom B

    Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Mar- tic, Shane Legg, and Dario Amodei. Deep Reinforcement Learning from Human Preferences. InAnnual Conference on Neural Information Processing Systems (NIPS), pages 4299–

  8. [8]

    JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs

    Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs. InAn- nual Meeting of the Association for Computational Linguistics (ACL), pages 21538–21566. ACL, 2025. 1

  9. [9]

    Glass, and Pengcheng He

    Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James R. Glass, and Pengcheng He. Dola: Decoding by con- trasting layers improves factuality in large language models. InThe Twelfth International Conference on Learning Rep- resentations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 1, 4, 7, 9

  10. [10]

    Scalable Wa- termarking for Identifying Large Language Model Outputs

    Sumanth Dathathri, Abigail See, Sumedh Ghaisas, Po-Sen Huang, Rob McAdam, Johannes Welbl, Vandana Bachani, Alex Kaskasoli, Robert Stanforth, Tatiana Matejovicova, Jamie Hayes, Nidhi Vyas, Majd Al Merey, Jonah Brown- Cohen, Rudy Bunel, Borja Balle, Taylan Cemgil, Zahra Ahmed, Kitty Stacpoole, Ilia Shumailov, Ciprian Baetu, Sven Gowal, Demis Hassabis, and P...

  11. [11]

    Why not act on what you know? unleashing safety potential of llms via self-aware guard enhancement

    Peng Ding, Jun Kuang, ZongYu Wang, Xuezhi Cao, Xunliang Cai, Jiajun Chen, and Shujian Huang. Why not act on what you know? unleashing safety potential of llms via self-aware guard enhancement. In Wanxiang Che, Joyce Nabende, Eka- terina Shutova, and Mohammad Taher Pilehvar, editors,Find- ings of the Association for Computational Linguistics, ACL 2025, Vie...

  12. [12]

    Association for Computational Linguistics, 2025. 4, 5, 9

  13. [13]

    Publicly Detectable Watermarking for Language Models.IACR Cryp- tology ePrint Archive, 2023

    Jaiden Fairoze, Sanjam Garg, Somesh Jha, Saeed Mahlouji- far, Mohammad Mahmoody, and Mingyuan Wang. Publicly Detectable Watermarking for Language Models.IACR Cryp- tology ePrint Archive, 2023. 4

  14. [14]

    Decore: Decoding by con- trasting retrieval heads to mitigate hallucinations.CoRR, abs/2410.18860, 2024

    Aryo Pradipta Gema, Chen Jin, Ahmed Abdulaal, Tom Diethe, Philip Teare, Beatrice Alex, Pasquale Minervini, and Amrutha Saseendran. Decore: Decoding by con- trasting retrieval heads to mitigate hallucinations.CoRR, abs/2410.18860, 2024. 4, 7

  15. [15]

    Measur- ing Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measur- ing Massive Multitask Language Understanding. InInterna- tional Conference on Learning Representations (ICLR), 2021. 8

  16. [16]

    The Curious Case of Neural Text Degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The Curious Case of Neural Text Degeneration. InIn- ternational Conference on Learning Representations (ICLR),

  17. [17]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations (ICLR), 2022. 2

  18. [18]

    Huang, Sailik Sengupta, Daniele Bonadiman, Yi-An Lai, Arshit Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchhoff, and Dan Roth

    James Y . Huang, Sailik Sengupta, Daniele Bonadiman, Yi-An Lai, Arshit Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchhoff, and Dan Roth. Deal: Decoding-time alignment for large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computat...

  19. [19]

    Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement

    Heegyu Kim, Sehyun Yuk, and Hyunsouk Cho. Break the breakout: Reinventing LM defense against jailbreak attacks with self-refinement.CoRR, abs/2402.15180, 2024. 4, 5

  20. [20]

    A Watermark for Large Language Models

    John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A Watermark for Large Language Models. InInternational Conference on Machine Learning (ICML), pages 17061–17084. PMLR, 2023. 8

  21. [21]

    Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre L

    Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre L. Dognin, Manish Nagireddy, and Amit Dhurandhar. Programming refusal with conditional activa- tion steering. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24- 28, 2025. OpenReview.net, 2025. 4, 6

  22. [22]

    Who Wrote this Code? Watermarking for Code Generation

    Taehyun Lee, Seokhee Hong, Jaewoo Ahn, Ilgee Hong, Hwaran Lee, Sangdoo Yun, Jamin Shin, and Gunhee Kim. Who Wrote this Code? Watermarking for Code Generation. InAnnual Meeting of the Association for Computational Lin- guistics (ACL), pages 4890–4911. ACL, 2024. 4

  23. [23]

    HaluEval: A Large-Scale Hallucination Eval- 14 uation Benchmark for Large Language Models

    Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji- Rong Wen. HaluEval: A Large-Scale Hallucination Eval- 14 uation Benchmark for Large Language Models. InConfer- ence on Empirical Methods in Natural Language Processing (EMNLP), pages 6449–6464. ACL, 2023. 1

  24. [24]

    PLMmark: A Secure and Robust Black-Box Watermarking Framework for Pre-trained Lan- guage Models

    Peixuan Li, Pengzhou Cheng, Fangqi Li, Wei Du, Haodong Zhao, and Gongshen Liu. PLMmark: A Secure and Robust Black-Box Watermarking Framework for Pre-trained Lan- guage Models. InAAAI Conference on Artificial Intelligence (AAAI), pages 14991–14999. AAAI, 2023. 4

  25. [25]

    RAIN: your language models can align themselves without finetuning

    Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, and Hongyang Zhang. RAIN: your language models can align themselves without finetuning. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 4, 7

  26. [26]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring How Models Mimic Human Falsehoods. InAn- nual Meeting of the Association for Computational Linguistics (ACL), pages 3214–3252. ACL, 2022. 8, 9

  27. [27]

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Au- toDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models.CoRR abs/2310.04451, 2023. 1, 8

  28. [28]

    Jailbreaking chat- gpt via prompt engineering: An empirical study

    Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jail- breaking ChatGPT via Prompt Engineering: An Empirical Study.CoRR abs/2305.13860, 2023. 1

  29. [29]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zi- fan Wang, Norman Mu, Elham Sakhaee, athaniel Li, Steven Basart, Bo Li, David A. Forsyth, and Dan Hendrycks. Harm- Bench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal.CoRR abs/abs/2402.04249,

  30. [30]

    Andonian, Yonatan Belinkov, and David Bau

    Kevin Meng, Arnab Sen Sharma, Alex J. Andonian, Yonatan Belinkov, and David Bau. Mass-editing memory in a trans- former. InThe Eleventh International Conference on Learn- ing Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

  31. [31]

    OpenReview.net, 2023. 2

  32. [32]

    Rethink- ing the Role of Demonstrations: What Makes In-Context Learning Work? InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 11048–11064

    Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethink- ing the Role of Demonstrations: What Makes In-Context Learning Work? InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 11048–11064. ACL, 2022. 5

  33. [33]

    GPT-4o.https://openai.com/index/hello- gpt-4o/, 2024

    OpenAI. GPT-4o.https://openai.com/index/hello- gpt-4o/, 2024. 2

  34. [34]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human f...

  35. [35]

    Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Pad- makumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R. Bowman. BBQ: A hand-built bias benchmark for question answering. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 20...

  36. [36]

    Red Teaming Language Models with Language Models

    Ethan Perez, Saffron Huang, H. Francis Song, Trevor Cai, Ro- man Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red Teaming Language Models with Language Models.CoRR abs/2202.03286, 2022. 1

  37. [37]

    LLM self defense: By self examination, llms know they are being tricked

    Mansi Phute, Alec Helbling, Matthew Hull, Shengyun Peng, Sebastian Szyller, Cory Cornelius, and Duen Horng Chau. LLM self defense: By self examination, llms know they are being tricked. InThe Second Tiny Papers Track at ICLR 2024, Tiny Papers @ ICLR 2024, Vienna, Austria, May 11, 2024. OpenReview.net, 2024. 4, 5

  38. [38]

    Ponti, and Shay B

    Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo M. Ponti, and Shay B. Cohen. Spectral editing of activations for large language model alignment.CoRR, abs/2405.09719, 2024. 4, 6, 9

  39. [39]

    Steering llama 2 via contrastive activation addition

    Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition. In Lun-Wei Ku, Andre Mar- tins, and Vivek Srikumar, editors,Proceedings of the 62nd An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, Au- ...

  40. [40]

    Xstest: A test suite for identifying exaggerated safety behaviours in large lan- guage models

    Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attana- sio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large lan- guage models. In Kevin Duh, Helena Gómez-Adorno, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Compu- tationa...

  41. [41]

    Jailbreak antidote: Runtime safety-utility balance via sparse representation adjustment in large language mod- els

    Guobin Shen, Dongcheng Zhao, Yiting Dong, Xiang He, and Yi Zeng. Jailbreak antidote: Runtime safety-utility balance via sparse representation adjustment in large language mod- els. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. 4, 6

  42. [42]

    Do Anything Now: Characterizing and Evaluat- ing In-The-Wild Jailbreak Prompts on Large Language Mod- els

    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. Do Anything Now: Characterizing and Evaluat- ing In-The-Wild Jailbreak Prompts on Large Language Mod- els. InACM SIGSAC Conference on Computer and Commu- nications Security (CCS). ACM, 2024. 1

  43. [43]

    Navigating the overkill in large language models

    Chenyu Shi, Xiao Wang, Qiming Ge, Songyang Gao, Xianjun Yang, Tao Gui, Qi Zhang, Xuanjing Huang, Xun Zhao, and Dahua Lin. Navigating the overkill in large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Ban...

  44. [44]

    Association for Computational Linguistics, 2024. 4, 7

  45. [45]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford Alpaca: An Instruction-following LLaMA Model.https://github.com/tatsu- lab/st anford_alpaca, 2023. 8

  46. [46]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Am- jad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cu- curull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman 15 Goyal, Anthon...

  47. [47]

    Steering Language Models With Activation Engineering

    Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDiarmid. Activation Addition: Steering Language Models Without Optimization. CoRR abs/2308.10248, 2023. 1, 4, 6

  48. [48]

    Model editing as a robust and denoised variant of DPO: A case study on toxicity

    Rheeya Uppaal, Apratim Dey, Yiting He, Yiqiao Zhong, and Junjie Hu. Model editing as a robust and denoised variant of DPO: A case study on toxicity. InThe Thirteenth Interna- tional Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. 1, 4, 6

  49. [49]

    Decodingtrust: A comprehensive assessment of trustworthiness in gpt models

    Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, and Bo Li. DecodingTrust: A Compre- hensive Assessment of Trustworthiness in GPT Models.CoRR abs/2306.11698, 2023. 1

  50. [50]

    Infer- aligner: Inference-time alignment for harmlessness through cross-model guidance.CoRR, abs/2401.11206, 2024

    Pengyu Wang, Dong Zhang, Linyang Li, Chenkun Tan, Xing- hao Wang, Ke Ren, Botian Jiang, and Xipeng Qiu. Infer- aligner: Inference-time alignment for harmlessness through cross-model guidance.CoRR, abs/2401.11206, 2024. 4, 6

  51. [51]

    Surgical, cheap, and flexible: Mitigating false refusal in language models via single vector ablation

    Xinpeng Wang, Chengzhi Hu, Paul Röttger, and Barbara Plank. Surgical, cheap, and flexible: Mitigating false refusal in language models via single vector ablation. InThe Thir- teenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net,

  52. [52]

    Larger language models do in-context learning differently

    Jerry W. Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Web- son, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, and Tengyu Ma. Larger language models do in-context learning differently.CoRR abs/2303.03846, 2023. 5

  53. [53]

    Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations

    Zeming Wei, Yifei Wang, Ang Li, Yichuan Mo, and Yisen Wang. Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations.CoRR abs/2310.06387,

  54. [54]

    Enhancing multiple dimensions of trustworthiness in llms via sparse acti- vation control

    Yuxin Xiao, Chaoqun Wan, Yonggang Zhang, Wenxiao Wang, Binbin Lin, Xiaofei He, Xu Shen, and Jieping Ye. Enhancing multiple dimensions of trustworthiness in llms via sparse acti- vation control. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Pro- ces...

  55. [55]

    Defending ChatGPT against jailbreak attack via self-reminders.Nature Machine Intelligence, 2023

    Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. Defending ChatGPT against jailbreak attack via self-reminders.Nature Machine Intelligence, 2023. 1, 4, 5

  56. [56]

    Alleviating

    Yue Zhang, Leyang Cui, Wei Bi, and Shuming Shi. Alleviat- ing hallucinations of large language models through induced hallucinations.CoRR, abs/2312.15710, 2023. 1, 4, 7

  57. [57]

    In- tention analysis makes llms A good jailbreak defender

    Yuqi Zhang, Liang Ding, Lefei Zhang, and Dacheng Tao. In- tention analysis makes llms A good jailbreak defender. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al- Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors, Proceedings of the 31st International Conference on Compu- tational Linguistics, COLING 2025, Abu Dhabi, UAE, Jan- uary 19-24...

  58. [58]

    Defending Large Language Mod- els Against Jailbreaking Attacks Through Goal Prioritization

    Zhexin Zhang, Junxiao Yang, Pei Ke, Fei Mi, Hongning Wang, and Minlie Huang. Defending Large Language Mod- els Against Jailbreaking Attacks Through Goal Prioritization. InAnnual Meeting of the Association for Computational Lin- guistics (ACL), pages 8865–8887. ACL, 2024. 4, 5, 9

  59. [59]

    Adasteer: Your aligned LLM is inher- ently an adaptive jailbreak defender.CoRR, abs/2504.09466,

    Weixiang Zhao, Jiahe Guo, Yulin Hu, Yang Deng, An Zhang, Xingyu Sui, Xinyang Han, Yanyan Zhao, Bing Qin, Tat-Seng Chua, and Ting Liu. Adasteer: Your aligned LLM is inher- ently an adaptive jailbreak defender.CoRR, abs/2504.09466,

  60. [60]

    Provable Robust Watermarking for AI- Generated Text

    Xuandong Zhao, Prabhanjan Vijendra Ananth, Lei Li, and Yu-Xiang Wang. Provable Robust Watermarking for AI- Generated Text. InInternational Conference on Learning Representations (ICLR). ICLR, 2024. 4

  61. [61]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonza- lez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAnnual Conference on Neural Infor- mation Processing Systems (NeurIPS). NeurIPS, 2023. 8

  62. [62]

    ROSE doesn’t do that: Boosting the safety of instruction- tuned large language models with reverse prompt contrastive decoding

    Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and Dacheng Tao. ROSE doesn’t do that: Boosting the safety of instruction- tuned large language models with reverse prompt contrastive decoding. In Lun-Wei Ku, Andre Martins, and Vivek Sriku- mar, editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meet- in...

  63. [63]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and Transferable Adversarial Attacks on Aligned Language Models.CoRR abs/2307.15043, 2023. 1, 3, 8 16 .1 Additional Experimental Details In this paper, we sample up to 1,000 test examples for TQA, BBQ, and MMLU, and 200 test examples for HB, XSTest, MT-Bench, and Alpaca. For HB, we use t...