arxiv: 2604.26167 · v1 · submitted 2026-04-28 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

Test-Time Safety Alignment

Baturay Saglam , Dionysis Kalogerias

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:02 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords test-time safetyinput embeddingsaligned language modelsblack-box optimizationmoderation APIharmful contentsafety benchmarks

0 comments

The pith

Optimizing input word embeddings at test time can neutralize every safety-flagged response from aligned language models on standard benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether input word embeddings can steer aligned models, which have a bimodal refuse-or-comply output pattern, toward safer responses. It develops a method that uses zeroth-order gradient estimates from a black-box moderation API to adjust those embeddings and minimize semantic harmfulness in the generated text. A reader would care because the approach operates without retraining or internal model access, extending earlier embedding control from simple profanity reduction to practical safety. Experiments confirm the method removes all flagged outputs on benchmarks. This suggests a lightweight inference-time lever for safety in already-aligned systems.

Core claim

Input word embeddings can be optimized in a sub-lexical manner to minimize the semantic harmfulness of aligned model responses by applying gradient descent with zeroth-order gradient estimation of a black-box text-moderation API, and this process neutralizes every safety-flagged response on standard safety benchmarks.

What carries the argument

Zeroth-order gradient estimation of a black-box moderation API with respect to input word embeddings, followed by gradient descent to reduce harm scores.

Load-bearing premise

That sub-lexical changes to input embeddings will reliably lower the moderation API's harm signal without degrading model utility or creating new unsafe outputs.

What would settle it

Finding even one safety-flagged response after optimization on a standard benchmark, or measuring a clear drop in performance on non-safety tasks with the optimized embeddings.

Figures

Figures reproduced from arXiv: 2604.26167 by Baturay Saglam, Dionysis Kalogerias.

**Figure 1.** Figure 1: Sensitivity analysis on Gemma 3-1B and Llama 3.1-8B in terms of the average mean moderation score on the adversarial harmful split (n = 2,000) of the WildJailbreak benchmark. The average number of iterations to convergence is also shown for the early stopping threshold. The x-axis is on a logarithmic scale. Tested values: µ ∈ {0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.15}, N ∈ {1, 2, 4, 8, 16, 32}, and thres… view at source ↗

read the original abstract

Recent work has shown that a model's input word embeddings can serve as effective control variables for steering its behavior toward outputs that satisfy desired properties. However, this has only been demonstrated for pretrained text-completion models on the relatively simple objective of reducing surface-level profanity in short continuations. A natural and practically important question is how well input embeddings can control aligned models, which produce an imbalanced bimodal refuse-or-comply output distribution rather than the smooth distribution characteristic of open-ended generation. We explore this in the context of safety, showing that input word embeddings can be optimized in a sub-lexical manner to minimize the semantic harmfulness of aligned model responses. Our approach uses zeroth-order gradient estimation of a black-box text-moderation API with respect to the input embeddings, and then applies gradient descent on these embeddings to minimize the harmfulness of the generated text. Experiments show that the proposed method can neutralize every safety-flagged response on standard safety benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes optimizing input word embeddings of aligned LLMs at test time via zeroth-order gradient estimation with respect to a black-box text-moderation API score. The goal is to steer the model's bimodal refuse/comply output distribution toward lower semantic harmfulness. Experiments are reported to achieve complete neutralization of all safety-flagged responses on standard safety benchmarks.

Significance. If the optimization reliably reduces actual semantic harm (rather than merely evading the specific API) while preserving utility, the approach would offer a practical test-time control mechanism for safety without retraining. The extension from prior sub-lexical profanity reduction to aligned models' refuse/comply regime is a natural next step, but the significance hinges on validation that the API proxy is faithful and that no new risks or capability losses are introduced.

major comments (3)

[Abstract] Abstract: The central claim that the method 'can neutralize every safety-flagged response on standard safety benchmarks' provides no details on benchmark composition, number of test cases, statistical significance, or any measurement of side effects such as utility degradation or introduction of new unsafe modes.
[Method] Method section: The optimization targets the black-box API score directly; no experiments are described that validate whether reductions in the API score correspond to reductions in human-judged semantic harmfulness (as opposed to exploiting the API's particular decision boundary or training artifacts).
[Experiments] Experiments section: No evaluation is reported of post-optimization performance on standard utility benchmarks (e.g., MMLU, GSM8K, or instruction-following tasks), which is required to establish that safety gains do not come at the expense of model capability.

minor comments (1)

[Abstract] The phrase 'imbalanced bimodal refuse-or-comply output distribution' in the abstract would benefit from a brief definition or citation to prior work characterizing this behavior in aligned models.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments highlight important areas for strengthening the presentation of our test-time embedding optimization approach for safety alignment. We address each major comment below and commit to a major revision that incorporates additional details, analyses, and evaluations.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the method 'can neutralize every safety-flagged response on standard safety benchmarks' provides no details on benchmark composition, number of test cases, statistical significance, or any measurement of side effects such as utility degradation or introduction of new unsafe modes.

Authors: We agree that the abstract is overly concise and should better contextualize the claims. In the revised manuscript we will expand the abstract to specify the benchmark composition (standard safety datasets such as those commonly used for refusal/compliance evaluation), the total number of test cases evaluated, and a brief statement on statistical reliability across runs. We will also note that side-effect measurements, including utility and new unsafe modes, are addressed in the experiments section of the revision. revision: yes
Referee: [Method] Method section: The optimization targets the black-box API score directly; no experiments are described that validate whether reductions in the API score correspond to reductions in human-judged semantic harmfulness (as opposed to exploiting the API's particular decision boundary or training artifacts).

Authors: This is a fair and substantive point. While the black-box moderation API serves as a standard, reproducible proxy in the safety literature, we acknowledge that explicit validation against human semantic harm judgments would increase confidence. In the revision we will add a targeted analysis (new subsection or appendix) that includes manual inspection of a representative sample of pre- and post-optimization responses, along with a discussion of the API's known alignment with human harm categories. If feasible within the revision timeline we will also report a small human annotation study on a subset of cases. revision: partial
Referee: [Experiments] Experiments section: No evaluation is reported of post-optimization performance on standard utility benchmarks (e.g., MMLU, GSM8K, or instruction-following tasks), which is required to establish that safety gains do not come at the expense of model capability.

Authors: We concur that demonstrating preservation of general capabilities is essential for a complete evaluation of test-time interventions. The current manuscript focuses on safety neutralization; we will revise the experiments section to include before-and-after results on standard utility benchmarks (MMLU, GSM8K, and instruction-following tasks) using the same models and optimization procedure. This will quantify any capability impact and support the claim that the sub-lexical embedding adjustments are localized. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical black-box optimization with external API

full rationale

The paper presents a test-time method that optimizes input embeddings via zeroth-order gradient estimation against a black-box moderation API score to reduce harmfulness in aligned model outputs. No derivation chain, first-principles result, or prediction is claimed; the work is purely empirical and relies on external benchmark evaluations and an independent API signal. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the described approach. The central claim (neutralization on benchmarks) is tested directly rather than derived from internal quantities, making the method self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Approach rests on standard zeroth-order optimization assumptions and the premise that an external moderation API supplies a usable gradient signal for harmfulness.

axioms (1)

standard math Zeroth-order gradient estimation can provide useful directional information for black-box objective functions
Invoked to optimize embeddings with respect to the moderation API score without model internals.

pith-pipeline@v0.9.0 · 5454 in / 1104 out tokens · 35394 ms · 2026-05-07T16:02:43.986610+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 23 canonical work pages · 11 internal anchors

[1]

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai, Matt...

work page internal anchor Pith review arXiv 2024
[2]

Moderation boundaries with openai api

Abubakar. Moderation boundaries with openai api. https://dev.to/thatechmaestro/ moderation-boundaries-with-openai-api-333g , May 2025. DEV Community. Accessed: 2026-04-10

2025
[3]

Detecting language model attacks with perplexity

Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity, 2023. URLhttps://arxiv.org/abs/2308.14132

work page arXiv 2023
[4]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page internal anchor Pith review arXiv 2022
[5]

Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

2022
[6]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

2020
[7]

Pappas, and Eric Wong

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking Black Box Large Language Models in Twenty Queries . In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23–42, Los Alamitos, CA, USA, April 2025. IEEE Computer Society. doi: 10.1109/SaTML64287.2025.00010. URL https://doi...

work page doi:10.1109/satml64287.2025.00010 2025
[8]

Shaping the safety boundaries: Understanding and defending against jailbreaks in large language models

Lang Gao, Jiahui Geng, Xiangliang Zhang, Preslav Nakov, and Xiuying Chen. Shaping the safety boundaries: Understanding and defending against jailbreaks in large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V...

work page doi:10.18653/v1/2025.acl-long.1233 2025
[9]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

work page internal anchor Pith review arXiv 2024
[10]

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lam- bert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neu- ral Information Processing Syst...

2024
[11]

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id= I...

2024
[12]

Smoothed embeddings for robust language models

Ryo Hase, Md Rafi Ur Rashid, Ashley Lewis, Jing Liu, Toshiaki Koike-Akino, Kieran Parsons, and Ye Wang. Smoothed embeddings for robust language models. InNeurips Safe Generative AI Workshop 2024, 2024. URLhttps://openreview.net/forum?id=GkBRng3Hl9

2024
[13]

Gradient cuff: Detecting jailbreak attacks on large language models by exploring refusal loss landscapes

Xiaomeng Hu, Pin-Yu Chen, and Tsung-Yi Ho. Gradient cuff: Detecting jailbreak attacks on large language models by exploring refusal loss landscapes. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview. net/forum?id=vI1WqFn15v

2024
[14]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama guard: Llm-based input-output safeguard for human-ai conversations, 2023. URL https: //arxiv.org/abs/2312.06674

work page internal anchor Pith review arXiv 2023
[15]

Pappas, Hamed Hassani, Yang Zhang, Eric Wong, and Shiyu Chang

Jiabao Ji, Bairu Hou, Alexander Robey, George J. Pappas, Hamed Hassani, Yang Zhang, Eric Wong, and Shiyu Chang. Defending large language models against jailbreak attacks via se- mantic smoothing. In Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, and Dhirendra Pratap Singh, ...

2025
[16]

URL https://aclanthology.org/2025.ijcnlp- short.2/

doi: 10.18653/v1/2025.ijcnlp-short.2. URL https://aclanthology.org/2025.ijcnlp- short.2/

work page doi:10.18653/v1/2025.ijcnlp-short.2 2025
[17]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

work page internal anchor Pith review arXiv 2024
[18]

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models

Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. InThe Thirty- eighth Annual Conference on Neural Information Processing Systems, 2024. URL https: //open...

2024
[19]

Certifying LLM safety against adversarial prompting

Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Aaron Jiaxun Li, Soheil Feizi, and Himabindu Lakkaraju. Certifying LLM safety against adversarial prompting. InFirst Conference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=9Ik05cycLq

2024
[20]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

2023
[21]

RAIN: Your language models can align themselves without finetuning

Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, and Hongyang Zhang. RAIN: Your language models can align themselves without finetuning. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=pETSfWMUzy

2024
[22]

A holistic approach to undesired content detection.arXiv preprint arXiv:2208.03274, 2022

Todor Markov, Chong Zhang, Sandhini Agarwal, Tyna Eloundou, Teddy Lee, Steven Adler, Angela Jiang, and Lilian Weng. A holistic approach to undesired content detection.arXiv preprint arXiv:2208.03274, 2022

work page arXiv 2022
[23]

Harmbench: a standardized evaluation framework for automated red teaming and robust refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harmbench: a standardized evaluation framework for automated red teaming and robust refusal. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

2024
[24]

Fight back against jailbreaking via prompt adversarial tuning

Yichuan Mo, Yuji Wang, Zeming Wei, and Yisen Wang. Fight back against jailbreaking via prompt adversarial tuning. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=nRdST1qifJ

2024
[25]

Random gradient-free minimization of convex functions

Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017. doi: 10.1007/s10208-015- 9296-2. URLhttps://doi.org/10.1007/s10208-015-9296-2

work page doi:10.1007/s10208-015- 2017
[26]

Moderation

OpenAI. Moderation. https://developers.openai.com/api/docs/guides/moderation. Accessed: 2026-04-09

2026
[27]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report, 2024. URLhttps://arxiv.org/abs/2303.08774

work page internal anchor Pith review arXiv 2024
[28]

OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bubeck, Che Chang, Kai Chen, Mark Chen, Enoch Cheung, Aidan Clark, Dan Cook, Marat Dukhan, Casey Dvorak, Kevin Fives, V...

work page internal anchor Pith review arXiv 2025
[29]

Training language models to follow instructions with human 14 feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human 14 feed...

2022
[30]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 53728–53741. Cur- ran Ass...

2023
[31]

Alexander Robey, Eric Wong, Hamed Hassani, and George J. Pappas. SmoothLLM: Defending large language models against jailbreaking attacks.Transactions on Machine Learning Research,
[32]

URLhttps://openreview.net/forum?id=laPAh2hRFC

ISSN 2835-8856. URLhttps://openreview.net/forum?id=laPAh2hRFC
[33]

Test-time detoxification without training or learning anything, 2026

Baturay Saglam and Dionysis Kalogerias. Test-time detoxification without training or learning anything, 2026

2026
[34]

Large language models encode semantics and alignment in linearly separable representations

Baturay Saglam, Paul Kassianik, Blaine Nelson, Sajana Weerawardhena, Yaron Singer, and Amin Karbasi. Large language models encode semantics and alignment in linearly separable representations. In Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, and Dhirendra Pratap Singh, edi...

work page doi:10.18653/v1/2025.ijcnlp-long.124 2025
[35]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...

work page internal anchor Pith review arXiv 2025
[36]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, ed- itors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL ...

2017
[37]

InferAligner: Inference-time alignment for harmlessness through cross-model guidance

Pengyu Wang, Dong Zhang, Linyang Li, Chenkun Tan, Xinghao Wang, Mozhi Zhang, Ke Ren, Botian Jiang, and Xipeng Qiu. InferAligner: Inference-time alignment for harmlessness through cross-model guidance. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages...

work page doi:10.18653/v1/2024.emnlp-main.585 2024
[38]

Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations .IEEE Transactions on Pattern Analysis & Machine Intelligence, (01):1–12, January 2023

Zeming Wei, Yifei Wang, Ang Li, Yichuan Mo, and Yisen Wang. Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations .IEEE Transactions on Pattern Analysis & Machine Intelligence, (01):1–12, January 2023. ISSN 1939-3539. doi: 10.1109/TPAMI.2026.3660147. URL https://doi.ieeecomputersociety.org/10.1109/ TPAMI.2026.3660147

work page doi:10.1109/tpami.2026.3660147 2023
[39]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of- the-ar...

2020
[40]

Defending ChatGPT against jailbreak attack via self-reminders.Nature Machine Intelligence, 5(12):1486–1496, December 2023

Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. Defending ChatGPT against jailbreak attack via self-reminders.Nature Machine Intelligence, 5(12):1486–1496, December 2023

2023
[41]

GradSafe: Detecting jailbreak prompts for LLMs via safety-critical gradient analysis

Yueqi Xie, Minghong Fang, Renjie Pi, and Neil Gong. GradSafe: Detecting jailbreak prompts for LLMs via safety-critical gradient analysis. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 507–518, Bangkok, Thailand, August
[42]

doi: 10.18653/v1/2024.acl-long.30

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.30. URL https://aclanthology.org/2024.acl-long.30/

work page doi:10.18653/v1/2024.acl-long.30 2024
[43]

In: Ku, L., Mar- tins, A., Srikumar, V

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, and Radha Pooven- dran. SafeDecoding: Defending against jailbreak attacks via safety-aware decoding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5587–560...

work page doi:10.18653/v1/2024.acl-long.303 2024
[44]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review arXiv 2025
[45]

Jbshield: defending large language models from jailbreak attacks through activated concept analysis and manipulation

Shenyi Zhang, Yuchen Zhai, Keyan Guo, Hongxin Hu, Shengnan Guo, Zheng Fang, Lingchen Zhao, Chao Shen, Cong Wang, and Qian Wang. Jbshield: defending large language models from jailbreak attacks through activated concept analysis and manipulation. InProceedings of the 34th USENIX Conference on Security Symposium, SEC ’25, USA, 2025. USENIX Association. ISBN...

2025
[46]

AdaSteer: Your aligned LLM is inherently an adaptive jailbreak defender

Weixiang Zhao, Jiahe Guo, Yulin Hu, Yang Deng, An Zhang, Xingyu Sui, Xinyang Han, Yanyan Zhao, Bing Qin, Tat-Seng Chua, and Ting Liu. AdaSteer: Your aligned LLM is inherently an adaptive jailbreak defender. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in ...

2025
[47]

Appworld: A controllable world of apps and people for benchmarking interactive coding agents

Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/ 2025.emnlp-main.1248. URLhttps://aclanthology.org/2025.emnlp-main.1248/

work page doi:10.18653/v1/ 2025
[48]

On prompt-driven safeguarding for large language models

Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, and Nanyun Peng. On prompt-driven safeguarding for large language models. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Lea...

2024
[49]

Robust prompt optimization for defending language mod- els against jailbreaking attacks

Andy Zhou, Bo Li, and Haohan Wang. Robust prompt optimization for defending language mod- els against jailbreaking attacks. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=jXs6Cvpe7k

2024
[50]

Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences,
[51]

URLhttps://arxiv.org/abs/1909.08593

work page internal anchor Pith review arXiv 1909
[52]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023. URL https: //arxiv.org/abs/2307.15043. 17 A Algorithm Pseudocode Algorithm 1 provides the complete pseudocode for the test-time safety alignment procedure described in Section 4. When ear...

work page internal anchor Pith review arXiv 2023