Recognition: unknown
Test-Time Safety Alignment
Pith reviewed 2026-05-07 16:02 UTC · model grok-4.3
The pith
Optimizing input word embeddings at test time can neutralize every safety-flagged response from aligned language models on standard benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Input word embeddings can be optimized in a sub-lexical manner to minimize the semantic harmfulness of aligned model responses by applying gradient descent with zeroth-order gradient estimation of a black-box text-moderation API, and this process neutralizes every safety-flagged response on standard safety benchmarks.
What carries the argument
Zeroth-order gradient estimation of a black-box moderation API with respect to input word embeddings, followed by gradient descent to reduce harm scores.
Load-bearing premise
That sub-lexical changes to input embeddings will reliably lower the moderation API's harm signal without degrading model utility or creating new unsafe outputs.
What would settle it
Finding even one safety-flagged response after optimization on a standard benchmark, or measuring a clear drop in performance on non-safety tasks with the optimized embeddings.
Figures
read the original abstract
Recent work has shown that a model's input word embeddings can serve as effective control variables for steering its behavior toward outputs that satisfy desired properties. However, this has only been demonstrated for pretrained text-completion models on the relatively simple objective of reducing surface-level profanity in short continuations. A natural and practically important question is how well input embeddings can control aligned models, which produce an imbalanced bimodal refuse-or-comply output distribution rather than the smooth distribution characteristic of open-ended generation. We explore this in the context of safety, showing that input word embeddings can be optimized in a sub-lexical manner to minimize the semantic harmfulness of aligned model responses. Our approach uses zeroth-order gradient estimation of a black-box text-moderation API with respect to the input embeddings, and then applies gradient descent on these embeddings to minimize the harmfulness of the generated text. Experiments show that the proposed method can neutralize every safety-flagged response on standard safety benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes optimizing input word embeddings of aligned LLMs at test time via zeroth-order gradient estimation with respect to a black-box text-moderation API score. The goal is to steer the model's bimodal refuse/comply output distribution toward lower semantic harmfulness. Experiments are reported to achieve complete neutralization of all safety-flagged responses on standard safety benchmarks.
Significance. If the optimization reliably reduces actual semantic harm (rather than merely evading the specific API) while preserving utility, the approach would offer a practical test-time control mechanism for safety without retraining. The extension from prior sub-lexical profanity reduction to aligned models' refuse/comply regime is a natural next step, but the significance hinges on validation that the API proxy is faithful and that no new risks or capability losses are introduced.
major comments (3)
- [Abstract] Abstract: The central claim that the method 'can neutralize every safety-flagged response on standard safety benchmarks' provides no details on benchmark composition, number of test cases, statistical significance, or any measurement of side effects such as utility degradation or introduction of new unsafe modes.
- [Method] Method section: The optimization targets the black-box API score directly; no experiments are described that validate whether reductions in the API score correspond to reductions in human-judged semantic harmfulness (as opposed to exploiting the API's particular decision boundary or training artifacts).
- [Experiments] Experiments section: No evaluation is reported of post-optimization performance on standard utility benchmarks (e.g., MMLU, GSM8K, or instruction-following tasks), which is required to establish that safety gains do not come at the expense of model capability.
minor comments (1)
- [Abstract] The phrase 'imbalanced bimodal refuse-or-comply output distribution' in the abstract would benefit from a brief definition or citation to prior work characterizing this behavior in aligned models.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive report. The comments highlight important areas for strengthening the presentation of our test-time embedding optimization approach for safety alignment. We address each major comment below and commit to a major revision that incorporates additional details, analyses, and evaluations.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the method 'can neutralize every safety-flagged response on standard safety benchmarks' provides no details on benchmark composition, number of test cases, statistical significance, or any measurement of side effects such as utility degradation or introduction of new unsafe modes.
Authors: We agree that the abstract is overly concise and should better contextualize the claims. In the revised manuscript we will expand the abstract to specify the benchmark composition (standard safety datasets such as those commonly used for refusal/compliance evaluation), the total number of test cases evaluated, and a brief statement on statistical reliability across runs. We will also note that side-effect measurements, including utility and new unsafe modes, are addressed in the experiments section of the revision. revision: yes
-
Referee: [Method] Method section: The optimization targets the black-box API score directly; no experiments are described that validate whether reductions in the API score correspond to reductions in human-judged semantic harmfulness (as opposed to exploiting the API's particular decision boundary or training artifacts).
Authors: This is a fair and substantive point. While the black-box moderation API serves as a standard, reproducible proxy in the safety literature, we acknowledge that explicit validation against human semantic harm judgments would increase confidence. In the revision we will add a targeted analysis (new subsection or appendix) that includes manual inspection of a representative sample of pre- and post-optimization responses, along with a discussion of the API's known alignment with human harm categories. If feasible within the revision timeline we will also report a small human annotation study on a subset of cases. revision: partial
-
Referee: [Experiments] Experiments section: No evaluation is reported of post-optimization performance on standard utility benchmarks (e.g., MMLU, GSM8K, or instruction-following tasks), which is required to establish that safety gains do not come at the expense of model capability.
Authors: We concur that demonstrating preservation of general capabilities is essential for a complete evaluation of test-time interventions. The current manuscript focuses on safety neutralization; we will revise the experiments section to include before-and-after results on standard utility benchmarks (MMLU, GSM8K, and instruction-following tasks) using the same models and optimization procedure. This will quantify any capability impact and support the claim that the sub-lexical embedding adjustments are localized. revision: yes
Circularity Check
No circularity: empirical black-box optimization with external API
full rationale
The paper presents a test-time method that optimizes input embeddings via zeroth-order gradient estimation against a black-box moderation API score to reduce harmfulness in aligned model outputs. No derivation chain, first-principles result, or prediction is claimed; the work is purely empirical and relies on external benchmark evaluations and an independent API signal. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the described approach. The central claim (neutralization on benchmarks) is tested directly rather than derived from internal quantities, making the method self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Zeroth-order gradient estimation can provide useful directional information for black-box objective functions
Reference graph
Works this paper leans on
-
[1]
Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai, Matt...
work page internal anchor Pith review arXiv 2024
-
[2]
Moderation boundaries with openai api
Abubakar. Moderation boundaries with openai api. https://dev.to/thatechmaestro/ moderation-boundaries-with-openai-api-333g , May 2025. DEV Community. Accessed: 2026-04-10
2025
-
[3]
Detecting language model attacks with perplexity
Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity, 2023. URLhttps://arxiv.org/abs/2308.14132
-
[4]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...
work page internal anchor Pith review arXiv 2022
-
[5]
Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...
2022
-
[6]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...
2020
-
[7]
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking Black Box Large Language Models in Twenty Queries . In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23–42, Los Alamitos, CA, USA, April 2025. IEEE Computer Society. doi: 10.1109/SaTML64287.2025.00010. URL https://doi...
-
[8]
Lang Gao, Jiahui Geng, Xiangliang Zhang, Preslav Nakov, and Xiuying Chen. Shaping the safety boundaries: Understanding and defending against jailbreaks in large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V...
-
[9]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...
work page internal anchor Pith review arXiv 2024
-
[10]
Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms
Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lam- bert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neu- ral Information Processing Syst...
2024
-
[11]
Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs
Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id= I...
2024
-
[12]
Smoothed embeddings for robust language models
Ryo Hase, Md Rafi Ur Rashid, Ashley Lewis, Jing Liu, Toshiaki Koike-Akino, Kieran Parsons, and Ye Wang. Smoothed embeddings for robust language models. InNeurips Safe Generative AI Workshop 2024, 2024. URLhttps://openreview.net/forum?id=GkBRng3Hl9
2024
-
[13]
Gradient cuff: Detecting jailbreak attacks on large language models by exploring refusal loss landscapes
Xiaomeng Hu, Pin-Yu Chen, and Tsung-Yi Ho. Gradient cuff: Detecting jailbreak attacks on large language models by exploring refusal loss landscapes. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview. net/forum?id=vI1WqFn15v
2024
-
[14]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama guard: Llm-based input-output safeguard for human-ai conversations, 2023. URL https: //arxiv.org/abs/2312.06674
work page internal anchor Pith review arXiv 2023
-
[15]
Pappas, Hamed Hassani, Yang Zhang, Eric Wong, and Shiyu Chang
Jiabao Ji, Bairu Hou, Alexander Robey, George J. Pappas, Hamed Hassani, Yang Zhang, Eric Wong, and Shiyu Chang. Defending large language models against jailbreak attacks via se- mantic smoothing. In Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, and Dhirendra Pratap Singh, ...
2025
-
[16]
URL https://aclanthology.org/2025.ijcnlp- short.2/
doi: 10.18653/v1/2025.ijcnlp-short.2. URL https://aclanthology.org/2025.ijcnlp- short.2/
-
[17]
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...
work page internal anchor Pith review arXiv 2024
-
[18]
Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models
Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. InThe Thirty- eighth Annual Conference on Neural Information Processing Systems, 2024. URL https: //open...
2024
-
[19]
Certifying LLM safety against adversarial prompting
Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Aaron Jiaxun Li, Soheil Feizi, and Himabindu Lakkaraju. Certifying LLM safety against adversarial prompting. InFirst Conference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=9Ik05cycLq
2024
-
[20]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
2023
-
[21]
RAIN: Your language models can align themselves without finetuning
Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, and Hongyang Zhang. RAIN: Your language models can align themselves without finetuning. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=pETSfWMUzy
2024
-
[22]
A holistic approach to undesired content detection.arXiv preprint arXiv:2208.03274, 2022
Todor Markov, Chong Zhang, Sandhini Agarwal, Tyna Eloundou, Teddy Lee, Steven Adler, Angela Jiang, and Lilian Weng. A holistic approach to undesired content detection.arXiv preprint arXiv:2208.03274, 2022
-
[23]
Harmbench: a standardized evaluation framework for automated red teaming and robust refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harmbench: a standardized evaluation framework for automated red teaming and robust refusal. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024
2024
-
[24]
Fight back against jailbreaking via prompt adversarial tuning
Yichuan Mo, Yuji Wang, Zeming Wei, and Yisen Wang. Fight back against jailbreaking via prompt adversarial tuning. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=nRdST1qifJ
2024
-
[25]
Random gradient-free minimization of convex functions
Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017. doi: 10.1007/s10208-015- 9296-2. URLhttps://doi.org/10.1007/s10208-015-9296-2
-
[26]
Moderation
OpenAI. Moderation. https://developers.openai.com/api/docs/guides/moderation. Accessed: 2026-04-09
2026
-
[27]
OpenAI. Gpt-4 technical report, 2024. URLhttps://arxiv.org/abs/2303.08774
work page internal anchor Pith review arXiv 2024
-
[28]
OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bubeck, Che Chang, Kai Chen, Mark Chen, Enoch Cheung, Aidan Clark, Dan Cook, Marat Dukhan, Casey Dvorak, Kevin Fives, V...
work page internal anchor Pith review arXiv 2025
-
[29]
Training language models to follow instructions with human 14 feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human 14 feed...
2022
-
[30]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 53728–53741. Cur- ran Ass...
2023
-
[31]
Alexander Robey, Eric Wong, Hamed Hassani, and George J. Pappas. SmoothLLM: Defending large language models against jailbreaking attacks.Transactions on Machine Learning Research,
-
[32]
URLhttps://openreview.net/forum?id=laPAh2hRFC
ISSN 2835-8856. URLhttps://openreview.net/forum?id=laPAh2hRFC
-
[33]
Test-time detoxification without training or learning anything, 2026
Baturay Saglam and Dionysis Kalogerias. Test-time detoxification without training or learning anything, 2026
2026
-
[34]
Large language models encode semantics and alignment in linearly separable representations
Baturay Saglam, Paul Kassianik, Blaine Nelson, Sajana Weerawardhena, Yaron Singer, and Amin Karbasi. Large language models encode semantics and alignment in linearly separable representations. In Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, and Dhirendra Pratap Singh, edi...
-
[35]
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...
work page internal anchor Pith review arXiv 2025
-
[36]
Attention is all you need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, ed- itors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL ...
2017
-
[37]
InferAligner: Inference-time alignment for harmlessness through cross-model guidance
Pengyu Wang, Dong Zhang, Linyang Li, Chenkun Tan, Xinghao Wang, Mozhi Zhang, Ke Ren, Botian Jiang, and Xipeng Qiu. InferAligner: Inference-time alignment for harmlessness through cross-model guidance. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages...
-
[38]
Zeming Wei, Yifei Wang, Ang Li, Yichuan Mo, and Yisen Wang. Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations .IEEE Transactions on Pattern Analysis & Machine Intelligence, (01):1–12, January 2023. ISSN 1939-3539. doi: 10.1109/TPAMI.2026.3660147. URL https://doi.ieeecomputersociety.org/10.1109/ TPAMI.2026.3660147
-
[39]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of- the-ar...
2020
-
[40]
Defending ChatGPT against jailbreak attack via self-reminders.Nature Machine Intelligence, 5(12):1486–1496, December 2023
Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. Defending ChatGPT against jailbreak attack via self-reminders.Nature Machine Intelligence, 5(12):1486–1496, December 2023
2023
-
[41]
GradSafe: Detecting jailbreak prompts for LLMs via safety-critical gradient analysis
Yueqi Xie, Minghong Fang, Renjie Pi, and Neil Gong. GradSafe: Detecting jailbreak prompts for LLMs via safety-critical gradient analysis. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 507–518, Bangkok, Thailand, August
-
[42]
doi: 10.18653/v1/2024.acl-long.30
Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.30. URL https://aclanthology.org/2024.acl-long.30/
-
[43]
In: Ku, L., Mar- tins, A., Srikumar, V
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, and Radha Pooven- dran. SafeDecoding: Defending against jailbreak attacks via safety-aware decoding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5587–560...
-
[44]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review arXiv 2025
-
[45]
Jbshield: defending large language models from jailbreak attacks through activated concept analysis and manipulation
Shenyi Zhang, Yuchen Zhai, Keyan Guo, Hongxin Hu, Shengnan Guo, Zheng Fang, Lingchen Zhao, Chao Shen, Cong Wang, and Qian Wang. Jbshield: defending large language models from jailbreak attacks through activated concept analysis and manipulation. InProceedings of the 34th USENIX Conference on Security Symposium, SEC ’25, USA, 2025. USENIX Association. ISBN...
2025
-
[46]
AdaSteer: Your aligned LLM is inherently an adaptive jailbreak defender
Weixiang Zhao, Jiahe Guo, Yulin Hu, Yang Deng, An Zhang, Xingyu Sui, Xinyang Han, Yanyan Zhao, Bing Qin, Tat-Seng Chua, and Ting Liu. AdaSteer: Your aligned LLM is inherently an adaptive jailbreak defender. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in ...
2025
-
[47]
Appworld: A controllable world of apps and people for benchmarking interactive coding agents
Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/ 2025.emnlp-main.1248. URLhttps://aclanthology.org/2025.emnlp-main.1248/
-
[48]
On prompt-driven safeguarding for large language models
Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, and Nanyun Peng. On prompt-driven safeguarding for large language models. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Lea...
2024
-
[49]
Robust prompt optimization for defending language mod- els against jailbreaking attacks
Andy Zhou, Bo Li, and Haohan Wang. Robust prompt optimization for defending language mod- els against jailbreaking attacks. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=jXs6Cvpe7k
2024
-
[50]
Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B
Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences,
-
[51]
URLhttps://arxiv.org/abs/1909.08593
work page internal anchor Pith review arXiv 1909
-
[52]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023. URL https: //arxiv.org/abs/2307.15043. 17 A Algorithm Pseudocode Algorithm 1 provides the complete pseudocode for the test-time safety alignment procedure described in Section 4. When ear...
work page internal anchor Pith review arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.