ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction

Chengcan Wu; Meng Sun; Zeming Wei

arxiv: 2506.01770 · v2 · submitted 2025-06-02 · 💻 cs.CR · cs.AI· cs.LG· cs.SE

ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction

Zeming Wei , Chengcan Wu , Meng Sun This is my paper

Pith reviewed 2026-05-19 11:29 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LGcs.SE

keywords LLM safetyrepresentation engineeringmodel-based abstractionjailbreakingAI securitysafety monitoringharm detectionscalable safeguards

0 comments

The pith

LLMs contain low-dimensional safety-critical representations that allow construction of scalable abstract models for harm detection and jailbreak resistance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes ReGA, a framework that guides abstraction of LLM hidden states using safety-critical directions to create a compact model for monitoring safety. This tackles the scalability barrier that prevents traditional model-based analysis from working on the huge feature spaces of large language models. A sympathetic reader would care because it offers an interpretable and efficient way to check prompts and ongoing conversations for harmful content without retraining the LLM itself. The evaluation reports strong separation between safe and unsafe cases plus resistance to attacks and consistency across safety viewpoints.

Core claim

ReGA is a model-based analysis framework with Representation-Guided Abstraction to safeguard LLMs against harmful prompts and generations. By leveraging safety-critical representations, which are key directions in hidden states that indicate safety-related concepts, ReGA effectively narrows the scalability gap when developing the abstract model for safety modeling. Comprehensive evaluation shows that ReGA performs sufficiently well in distinguishing between safe and harmful inputs, achieving an AUROC of 0.975 at the prompt level and 0.985 at the conversation level. Additionally, ReGA exhibits robustness to real-world attacks and generalization across different safety perspectives, outperfor

What carries the argument

Representation-Guided Abstraction, which extracts key safety directions from LLM hidden states to reduce the full feature space into a smaller abstract model that still carries safety information for detection.

Load-bearing premise

Low-dimensional safety-critical representations exist inside LLMs and can be extracted and used to build an abstract model that keeps the safety information needed while avoiding too many missed threats or scaling problems.

What would settle it

Running ReGA on a fresh collection of jailbreaking prompts and finding its prompt-level AUROC drops below 0.9 or that it misses many harmful cases on a different LLM family would show the abstraction lost critical safety details.

Figures

Figures reproduced from arXiv: 2506.01770 by Chengcan Wu, Meng Sun, Zeming Wei.

**Figure 1.** Figure 1: The outline of ReGA. LLMs for constructing the abstract model to overcome the scalability issues for safeguarding LLMs, which is a set of directions in hidden states that indicate specific concepts. In the context of LLMs, the term representation is fundamentally different from features, where the latter usually indicates the overall hidden states. In particular, the safety-critical representations [34, 35… view at source ↗

**Figure 2.** Figure 2: Safety score distributions rated by ReGA with different LLMs. The first row is for the prompt inputs, and the second [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have achieved tremendous success in various tasks, yet concerns about their safety and security have emerged. In particular, they pose risks of generating harmful content and are vulnerable to jailbreaking attacks, creating unaddressed security issues regarding their deployments. In the context of software engineering for artificial intelligence (SE4AI) techniques, model-based analysis has demonstrated notable potential for analyzing and monitoring machine learning models, particularly in stateful deep neural networks. However, it suffers from scalability issues when extended to LLMs due to their vast feature spaces. In this paper, we aim to address the scalability issue of model-based analysis techniques for safeguarding LLM-scale models. Motivated by the recent discovery of low-dimensional safety-critical representations that emerged in LLMs, we propose ReGA, a model-based analysis framework with Representation-Guided Abstraction, to safeguard LLMs against harmful prompts and generations. By leveraging safety-critical representations, which are key directions in hidden states that indicate safety-related concepts, ReGA effectively narrows the scalability gap when developing the abstract model for safety modeling. Our comprehensive evaluation shows that ReGA performs sufficiently well in distinguishing between safe and harmful inputs, achieving an AUROC of 0.975 at the prompt level and 0.985 at the conversation level. Additionally, ReGA exhibits robustness to real-world attacks and generalization across different safety perspectives, outperforming existing safeguard paradigms in terms of interpretability and scalability. Overall, ReGA serves as an efficient and scalable solution to enhance LLM safety by integrating representation engineering with model-based abstraction, paving the way for new paradigms to utilize software insights for AI safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReGA uses safety directions from linear probes to abstract LLM hidden states for scalable model-based safety checks, delivering strong AUROC numbers on held-out data but depending heavily on how those directions are chosen.

read the letter

The main point is that ReGA takes the recent idea of low-dimensional safety representations in LLMs and uses them to shrink the state space so model-based analysis can run at LLM scale. They pick layers, run linear probes on hidden states to find the safety directions, reduce the model accordingly, and then monitor for harmful prompts or generations. This is a direct response to the scalability complaint that has kept model-based methods out of the LLM safety conversation so far. The full text spells out the extraction steps and reports held-out results at both prompt and conversation levels, which is more than the abstract alone gave us. The AUROCs of 0.975 and 0.985 plus the robustness and generalization claims are the concrete outputs, and the work does better than some post-hoc filters on interpretability and scalability. That combination of representation engineering and state abstraction is the actual new piece here. The soft spots are mostly around the representation step itself. How sensitive the whole thing is to layer choice, probe training data, or the exact safety perspectives used is not fully stress-tested in the numbers shown, and false-negative rates on edge cases would matter a lot in deployment. Since the method stays empirical rather than formally verified, the guarantees stop at the tested distributions. No obvious circularity or internal contradiction shows up once you see the procedure. This paper is aimed at people already working on LLM safety who want something more structured than fine-tuning or simple classifiers. Anyone following representation engineering or trying to bring model-checking ideas into large models would get something usable from it. It has enough concrete method and evaluation to deserve a serious referee rather than a desk reject, even if the reviewers will want tighter ablations on the probe stage.

Referee Report

2 major / 2 minor

Summary. The paper proposes ReGA, a model-based safeguard for LLMs that uses representation-guided abstraction. It identifies low-dimensional safety-critical directions in hidden states via linear probes, constructs a reduced state-space abstract model from those directions, and evaluates the resulting monitor for distinguishing safe versus harmful prompts and conversations. The central empirical claims are AUROC 0.975 (prompt level) and 0.985 (conversation level), plus robustness to real-world attacks and generalization across safety perspectives, while addressing the scalability limitations of prior model-based analysis techniques for LLMs.

Significance. If the reported performance and robustness hold under the held-out evaluations described, the work would demonstrate a practical integration of representation engineering with model-based abstraction that narrows the scalability gap for LLM safety monitoring. The approach offers improved interpretability over black-box safeguards and supplies concrete extraction and abstraction procedures, which are strengths for reproducibility in the SE4AI and AI safety literature.

major comments (2)

[§4.1] §4.1 (Representation Extraction): The linear-probe procedure for identifying safety-critical directions is presented with a concrete layer-selection heuristic and held-out validation, but the manuscript does not report the number of positive/negative examples used to train the probes or any ablation on probe regularization; without these, it is difficult to rule out that the high AUROC partly reflects overfitting to the probe training distribution rather than robust safety concepts.
[Table 3] Table 3 (Conversation-level results): The reported AUROC of 0.985 is given without confidence intervals or a statistical comparison to the strongest baseline; this weakens the claim that ReGA generalizes across safety perspectives and outperforms existing paradigms, especially since the central scalability argument rests on these performance numbers remaining high after abstraction.

minor comments (2)

[§2] The abstract and §2 could more explicitly cite the specific prior works on low-dimensional safety representations that motivate the approach, rather than referring only to 'recent discovery'.
[Figure 2] Figure 2 (abstraction diagram) would benefit from an explicit legend indicating which dimensions are retained versus discarded after the representation-guided reduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps strengthen the empirical rigor of the manuscript. We address each major comment below and will revise the paper accordingly.

read point-by-point responses

Referee: [§4.1] §4.1 (Representation Extraction): The linear-probe procedure for identifying safety-critical directions is presented with a concrete layer-selection heuristic and held-out validation, but the manuscript does not report the number of positive/negative examples used to train the probes or any ablation on probe regularization; without these, it is difficult to rule out that the high AUROC partly reflects overfitting to the probe training distribution rather than robust safety concepts.

Authors: We agree that reporting the training set sizes and regularization ablations would improve transparency and help rule out overfitting concerns. In the revised manuscript we will explicitly state the number of positive and negative examples used to train the probes (drawn from the safety datasets described in §4.1) and add a regularization ablation study showing that the extracted directions remain stable across reasonable hyper-parameter choices. revision: yes
Referee: [Table 3] Table 3 (Conversation-level results): The reported AUROC of 0.985 is given without confidence intervals or a statistical comparison to the strongest baseline; this weakens the claim that ReGA generalizes across safety perspectives and outperforms existing paradigms, especially since the central scalability argument rests on these performance numbers remaining high after abstraction.

Authors: We acknowledge that confidence intervals and statistical comparisons would make the performance claims more robust. We will add 95% bootstrap confidence intervals to all AUROC entries in Table 3 and include a statistical comparison (DeLong test) against the strongest baseline to support the generalization and outperformance statements. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper frames ReGA as an empirical method that extracts safety-critical directions via linear probes on hidden states and constructs an abstract state-space model from them, then reports AUROC metrics (0.975 prompt-level, 0.985 conversation-level) on held-out evaluations. These metrics are presented as measured outcomes rather than quantities forced by the extraction procedure itself. No equations or steps reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations; the central claim remains independent of the reported numbers and is consistent with external benchmarks for representation engineering and model abstraction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of usable low-dimensional safety-critical representations and on the assumption that abstraction guided by them preserves sufficient safety signal for reliable detection.

axioms (1)

domain assumption Low-dimensional safety-critical representations exist in LLMs and indicate safety-related concepts
Explicitly invoked in the abstract as the motivation and enabling factor for the abstraction technique.

pith-pipeline@v0.9.0 · 5837 in / 1224 out tokens · 52774 ms · 2026-05-19T11:29:42.061477+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By leveraging safety-critical representations, which are low-dimensional directions emerging in hidden states that indicate safety-related concepts, ReGA effectively addresses the scalability issue when constructing the abstract model for safety modeling.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we apply principal component analysis (PCA) reduction to construct the safety representations... K-Means to fit the concrete states

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RACC: Representation-Aware Coverage Criteria for LLM Safety Testing
cs.SE 2026-02 unverdicted novelty 7.0

RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.
Secure LLM Fine-Tuning via Safety-Aware Probing
cs.LG 2025-05 unverdicted novelty 6.0

SAP locates safety-correlated directions via contrastive signals and perturbs hidden-state propagation with a lightweight probe to preserve safety while fine-tuning LLMs for task performance.

Reference graph

Works this paper leans on

89 extracted references · 89 canonical work pages · cited by 2 Pith papers · 18 internal anchors

[1]

GPT-4 Technical Report

OpenAI, “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Pretraining language models with human preferences,

T. Korbak et al. , “Pretraining language models with human preferences,” in ICML, 2023

work page 2023
[3]

Imani, L

S. Imani et al. , “Mathprompter: Mathematical rea- soning using large language models,” arXiv preprint arXiv:2303.05398, 2023

work page arXiv 2023
[4]

Large language models for mathematical reasoning: Progresses and challenges

J. Ahn et al., “Large language models for mathematical reasoning: Progresses and challenges,” arXiv preprint arXiv:2402.00157, 2024

work page arXiv 2024
[5]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

D. Guo et al., “Deepseek-coder: When the large language model meets programming–the rise of code intelligence,” arXiv preprint arXiv:2401.14196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

A performance study of llm- generated code on leetcode,

T. Coignion et al. , “A performance study of llm- generated code on leetcode,” in EASE, 2024

work page 2024
[7]

RepairAgent: An Autonomous, LLM-Based Agent for Program Repair

I. Bouzenia et al. , “Repairagent: An autonomous, llm-based agent for program repair,” arXiv preprint arXiv:2403.17134, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Inferfix: End-to-end program repair with llms,

M. Jin et al., “Inferfix: End-to-end program repair with llms,” in FSE, 2023, pp. 1646–1656

work page 2023
[9]

Foundational challenges in assuring alignment and safety of large language models,

U. Anwar et al., “Foundational challenges in assuring alignment and safety of large language models,” Trans- actions on Machine Learning Research, 2024

work page 2024
[10]

The fusion of large language models and formal methods for trustworthy ai agents: A roadmap,

Y. Zhanget al., “The fusion of large language models and formal methods for trustworthy ai agents: A roadmap,” arXiv preprint arXiv:2412.06512, 2024

work page arXiv 2024
[11]

A survey on large language model (llm) security and privacy: The good, the bad, and the ugly,

Y. Yao et al., “A survey on large language model (llm) security and privacy: The good, the bad, and the ugly,” High-Confidence Computing, 2024

work page 2024
[12]

Combating misinformation in the age of llms: Opportunities and challenges,

C. Chen et al., “Combating misinformation in the age of llms: Opportunities and challenges,” AI Magazine, pp. 354–368, 2024

work page 2024
[13]

Harmbench: A standardized evalua- tion framework for automated red teaming and robust refusal,

M. Mazeika et al., “Harmbench: A standardized evalua- tion framework for automated red teaming and robust refusal,” in ICML, 2024

work page 2024
[14]

Decodingtrust: A comprehensive as- sessment of trustworthiness in gpt models

B. Wang et al. , “Decodingtrust: A comprehensive as- sessment of trustworthiness in gpt models.” in NeurIPS, 2023

work page 2023
[15]

TrustLLM: Trustworthiness in Large Language Models

L. Sun et al. , “Trustllm: Trustworthiness in large lan- guage models,” arXiv preprint arXiv:2401.05561, vol. 3, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Multitrust: A comprehensive benchmark towards trustworthy multimodal large language mod- els,

Y. o. Zhang, “Multitrust: A comprehensive benchmark towards trustworthy multimodal large language mod- els,” NeurIPS, 2024

work page 2024
[17]

”do anything now

X. Shen et al. , “”do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models,” in CCS, 2023

work page 2023
[18]

Universal and Transferable Adversarial Attacks on Aligned Language Models

A. Zou et al., “Universal and transferable adversarial attacks on aligned language models,” arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Jailbroken: How does llm safety training fail?

A. Wei et al., “Jailbroken: How does llm safety training fail?” in NeurIPS, 2023

work page 2023
[20]

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

Y. Liu et al., “Jailbreaking chatgpt via prompt engineer- ing: An empirical study,”arXiv preprint arXiv:2305.13860, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Constitutional AI: Harmlessness from AI Feedback

Y. Bai et al., “Constitutional ai: Harmlessness from ai feedback,” arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Safe rlhf: Safe reinforcement learning from human feedback,

J. Dai et al., “Safe rlhf: Safe reinforcement learning from human feedback,” in ICLR, 2024

work page 2024
[23]

Deepstellar: Model-based quantitative analysis of stateful deep learning systems,

X. Du et al. , “Deepstellar: Model-based quantitative analysis of stateful deep learning systems,” in FSE, 2019

work page 2019
[24]

Rnnrepair: Automatic rnn repair via model- based analysis,

X. Xie et al., “Rnnrepair: Automatic rnn repair via model- based analysis,” in ICML, 2021

work page 2021
[25]

Marble: Model-based robustness analysis of stateful deep learning systems,

X. Du et al., “Marble: Model-based robustness analysis of stateful deep learning systems,” in ASE, 2020

work page 2020
[26]

Deeparc: Modularizing neural networks for the model maintenance,

X. Ren et al., “Deeparc: Modularizing neural networks for the model maintenance,” in ICSE, 2023

work page 2023
[27]

Mosaic: Model-based safety analysis frame- work for ai-enabled cyber-physical systems,

X. Xie et al., “Mosaic: Model-based safety analysis frame- work for ai-enabled cyber-physical systems,” arXiv preprint arXiv:2305.03882, 2023

work page arXiv 2023
[28]

Archrepair: Block-level architecture- oriented repairing for deep neural networks,

H. Qi et al. , “Archrepair: Block-level architecture- oriented repairing for deep neural networks,” ACM Transactions on Software Engineering and Methodology , vol. 32, no. 5, pp. 1–31, 2023

work page 2023
[29]

Weighted automata extraction and ex- planation of recurrent neural networks for natural language tasks,

Z. Wei et al. , “Weighted automata extraction and ex- planation of recurrent neural networks for natural language tasks,” Journal of Logical and Algebraic Methods in Programming, 2024

work page 2024
[30]

A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversar- ial attack and defence, and interpretability,

X. Huang et al., “A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversar- ial attack and defence, and interpretability,” Computer Science Review, vol. 37, p. 100270, 2020

work page 2020
[31]

Software engineering for ai-based systems: a survey,

S. Mart ´ınez-Fern´andez et al. , “Software engineering for ai-based systems: a survey,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 31, no. 2, pp. 1–59, 2022

work page 2022
[32]

Luna: A model-based universal analysis framework for large language models,

D. Song et al., “Luna: A model-based universal analysis framework for large language models,” IEEE Transac- tions on Software Engineering, 2024

work page 2024
[33]

Representation Engineering: A Top-Down Approach to AI Transparency

A. Zou et al. , “Representation engineering: A top- down approach to ai transparency,” arXiv preprint arXiv:2310.01405, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

arXiv preprint arXiv:2402.05162 , year=

B. Wei et al., “Assessing the brittleness of safety align- ment via pruning and low-rank modifications,” arXiv preprint arXiv:2402.05162, 2024

work page arXiv 2024
[35]

Zheng, F

C. Zheng et al., “Prompt-driven llm safeguarding via directed representation optimization,” arXiv preprint arXiv:2401.18018, 2024

work page arXiv 2024
[36]

Adversarial representation engineering: A general model editing framework for large language models,

Y. Zhang et al., “Adversarial representation engineering: A general model editing framework for large language models,” arXiv preprint arXiv:2404.13752, 2024

work page arXiv 2024
[37]

Attention is all you need,

A. Vaswani et al., “Attention is all you need,” in NeurIPS, 2017

work page 2017
[38]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo et al. , “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Evaluation of openai o1: Opportunities and challenges of agi,

T. Zhong et al., “Evaluation of openai o1: Opportunities and challenges of agi,” arXiv preprint arXiv:2409.18486, 2024

work page arXiv 2024
[40]

Improving language understanding by generative pre-training,

A. Radford et al., “Improving language understanding by generative pre-training,” 2018

work page 2018
[41]

Reinforcement Learning for LLM Post-Training: A Survey

Z. Wang et al., “A comprehensive survey of llm align- ment techniques: Rlhf, rlaif, ppo, dpo and more,” arXiv preprint arXiv:2407.16216, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Towards the worst-case robustness of large language models,

H. Chen et al., “Towards the worst-case robustness of large language models,” arXiv preprint arXiv:2501.19040, 2025

work page arXiv 2025
[43]

Position: Agent-specific trustworthiness risk as a research priority,

Z. Wei et al., “Position: Agent-specific trustworthiness risk as a research priority,” OpenReview preprint, 2025

work page 2025
[44]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Y. Bai et al., “Training a helpful and harmless assistant 12 with reinforcement learning from human feedback,” arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[45]

Secure LLM Fine-Tuning via Safety-Aware Probing

C. Wu et al. , “Mitigating fine-tuning risks in llms via safety-aware probing optimization,” arXiv preprint arXiv:2505.16737, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Safety alignment should be made more than just a few tokens deep,

X. Qi et al. , “Safety alignment should be made more than just a few tokens deep,” in ICLR, 2024

work page 2024
[47]

Understanding pre-training and fine-tuning from loss landscape perspectives,

H. Chen et al., “Understanding pre-training and fine- tuning from loss landscape perspectives,” arXiv preprint arXiv:2505.17646, 2025

work page arXiv 2025
[48]

Mistral 7b,

A. Q. Jiang et al., “Mistral 7b,” 2023

work page 2023
[49]

How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms,

Y. Zeng et al. , “How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms,” in ACL, 2024

work page 2024
[50]

Guard: Role-playing to gener- ate natural-language jailbreakings to test guide- line adherence of large language models.arXiv preprint arXiv:2402.03299, 2024

H. Jin et al., “Guard: Role-playing to generate natural- language jailbreakings to test guideline adherence of large language models,” arXiv preprint arXiv:2402.03299, 2024

work page arXiv 2024
[51]

Detecting Language Model Attacks with Perplexity

G. Alon et al., “Detecting language model attacks with perplexity,” arXiv preprint arXiv:2308.14132, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

N. Jain et al. , “Baseline defenses for adversarial at- tacks against aligned language models,” arXiv preprint arXiv:2309.00614, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

A theoretical understanding of self- correction through in-context alignment,

Y. Wang et al. , “A theoretical understanding of self- correction through in-context alignment,” in NeurIPS, 2024

work page 2024
[54]

Defending chatgpt against jailbreak attack via self-reminders,

Y. Xie et al., “Defending chatgpt against jailbreak attack via self-reminders,” Nature Machine Intelligence, 2023

work page 2023
[55]

Jailbreak and guard aligned language mod- els with only few in-context demonstrations,

Z. Wei et al. , “Jailbreak and guard aligned language models with only few in-context demonstrations,” arXiv preprint arXiv:2310.06387, 2023

work page arXiv 2023
[56]

OR- Bench: An over-refusal benchmark for large language models

J. Cui et al., “Or-bench: An over-refusal benchmark for large language models,” arXiv preprint arXiv:2405.20947, 2024

work page arXiv 2024
[57]

Scalable defense against in-the-wild jailbreaking attacks with safety context retrieval,

T. Chen et al. , “Scalable defense against in-the-wild jailbreaking attacks with safety context retrieval,” arXiv preprint arXiv:2505.15753, 2025

work page arXiv 2025
[58]

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

K. Simonyan et al., “Deep inside convolutional networks: Visualising image classification models and saliency maps,” arXiv preprint arXiv:1312.6034, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[59]

Grad-cam: Visual explanations from deep networks via gradient-based localization,

R. R. Selvaraju et al., “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in ICCV, 2017

work page 2017
[60]

Linguistic regularities in continuous space word representations,

T. Mikolov et al., “Linguistic regularities in continuous space word representations,” in NAACL, 2013

work page 2013
[61]

Interpretable convolutional neural networks,

Q. Zhang et al. , “Interpretable convolutional neural networks,” in CVPR, 2018, pp. 8827–8836

work page 2018
[62]

Does representation matter? exploring intermediate layers in large language models,

O. Skean et al., “Does representation matter? exploring intermediate layers in large language models,” arXiv preprint arXiv:2412.09563, 2024

work page arXiv 2024
[63]

Improving steering vectors by tar- geting sparse autoencoder features,

S. Chalnev et al. , “Improving steering vectors by tar- geting sparse autoencoder features,” arXiv preprint arXiv:2411.02193, 2024

work page arXiv 2024
[64]

Improving instruction-following in language models through activation steering.arXiv preprint arXiv:2410.12877,

A. Stolfo et al. , “Improving instruction-following in language models through activation steering,” arXiv preprint arXiv:2410.12877, 2024

work page arXiv 2024
[65]

Advanc- ing llm safe alignment with safety representation ranking

T. Du et al., “Advancing llm safe alignment with safety representation ranking,” arXiv preprint arXiv:2505.15710, 2025

work page arXiv 2025
[66]

The hidden dimensions of llm alignment: A multi-dimensional safety analysis,

W. Pan et al., “The hidden dimensions of llm alignment: A multi-dimensional safety analysis,” arXiv preprint arXiv:2502.09674, 2025

work page arXiv 2025
[67]

Stanford alpaca: An instruction-following llama model,

R. Taoriet al., “Stanford alpaca: An instruction-following llama model,” https://github.com/tatsu-lab/stanford alpaca, 2023

work page 2023
[68]

Decision-guided weighted automata extraction from recurrent neural networks,

X. Zhang et al., “Decision-guided weighted automata extraction from recurrent neural networks,” in AAAI, 2021

work page 2021
[69]

Class-based n-gram models of natural language,

P . F. Brownet al., “Class-based n-gram models of natural language,” Computational linguistics, vol. 18, no. 4, pp. 467–480, 1992

work page 1992
[70]

Phute, A

M. Phute et al. , “Llm self defense: By self examina- tion, llms know they are being tricked,” arXiv preprint arXiv:2308.07308, 2023

work page arXiv 2023
[71]

Judging llm-as-a-judge with mt-bench and chatbot arena,

L. Zheng et al., “Judging llm-as-a-judge with mt-bench and chatbot arena,” in NeurIPS, 2023

work page 2023
[72]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron et al., “Llama 2: Open foundation and fine- tuned chat models,” arXiv preprint arXiv:2307.09288 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[73]

Qwen technical report,

J. Bai et al. , “Qwen technical report,” https://qwenlm.github.io/blog/qwen3/, 2023

work page 2023
[74]

Koala: A dialogue model for academic research,

X. Geng et al., “Koala: A dialogue model for academic research,” 2023

work page 2023
[75]

Baichuan 2: Open Large-scale Language Models

A. Yang et al., “Baichuan 2: Open large-scale language models,” arXiv preprint arXiv:2309.10305, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[76]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

P . Chao et al. , “Jailbreakbench: An open robustness benchmark for jailbreaking large language models,” arXiv preprint arXiv:2404.01318, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[77]

Lmsys-chat-1m: A large-scale real-world llm conversation dataset,

L. Zheng et al., “Lmsys-chat-1m: A large-scale real-world llm conversation dataset,” 2023

work page 2023
[78]

Rainbow teaming: Open-ended generation of diverse adversarial prompts,

M. Samvelyan et al., “Rainbow teaming: Open-ended generation of diverse adversarial prompts,” NeurIPS, 2024

work page 2024
[79]

Sorry-bench: Systematically evaluating large language model safety refusal,

T. Xie et al. , “Sorry-bench: Systematically evaluating large language model safety refusal,” in ICLR, 2025

work page 2025
[80]

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models,

L. Jiang et al., “Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models,” 2024

work page 2024

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

OpenAI, “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Pretraining language models with human preferences,

T. Korbak et al. , “Pretraining language models with human preferences,” in ICML, 2023

work page 2023

[3] [3]

Imani, L

S. Imani et al. , “Mathprompter: Mathematical rea- soning using large language models,” arXiv preprint arXiv:2303.05398, 2023

work page arXiv 2023

[4] [4]

Large language models for mathematical reasoning: Progresses and challenges

J. Ahn et al., “Large language models for mathematical reasoning: Progresses and challenges,” arXiv preprint arXiv:2402.00157, 2024

work page arXiv 2024

[5] [5]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

D. Guo et al., “Deepseek-coder: When the large language model meets programming–the rise of code intelligence,” arXiv preprint arXiv:2401.14196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

A performance study of llm- generated code on leetcode,

T. Coignion et al. , “A performance study of llm- generated code on leetcode,” in EASE, 2024

work page 2024

[7] [7]

RepairAgent: An Autonomous, LLM-Based Agent for Program Repair

I. Bouzenia et al. , “Repairagent: An autonomous, llm-based agent for program repair,” arXiv preprint arXiv:2403.17134, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Inferfix: End-to-end program repair with llms,

M. Jin et al., “Inferfix: End-to-end program repair with llms,” in FSE, 2023, pp. 1646–1656

work page 2023

[9] [9]

Foundational challenges in assuring alignment and safety of large language models,

U. Anwar et al., “Foundational challenges in assuring alignment and safety of large language models,” Trans- actions on Machine Learning Research, 2024

work page 2024

[10] [10]

The fusion of large language models and formal methods for trustworthy ai agents: A roadmap,

Y. Zhanget al., “The fusion of large language models and formal methods for trustworthy ai agents: A roadmap,” arXiv preprint arXiv:2412.06512, 2024

work page arXiv 2024

[11] [11]

A survey on large language model (llm) security and privacy: The good, the bad, and the ugly,

Y. Yao et al., “A survey on large language model (llm) security and privacy: The good, the bad, and the ugly,” High-Confidence Computing, 2024

work page 2024

[12] [12]

Combating misinformation in the age of llms: Opportunities and challenges,

C. Chen et al., “Combating misinformation in the age of llms: Opportunities and challenges,” AI Magazine, pp. 354–368, 2024

work page 2024

[13] [13]

Harmbench: A standardized evalua- tion framework for automated red teaming and robust refusal,

M. Mazeika et al., “Harmbench: A standardized evalua- tion framework for automated red teaming and robust refusal,” in ICML, 2024

work page 2024

[14] [14]

Decodingtrust: A comprehensive as- sessment of trustworthiness in gpt models

B. Wang et al. , “Decodingtrust: A comprehensive as- sessment of trustworthiness in gpt models.” in NeurIPS, 2023

work page 2023

[15] [15]

TrustLLM: Trustworthiness in Large Language Models

L. Sun et al. , “Trustllm: Trustworthiness in large lan- guage models,” arXiv preprint arXiv:2401.05561, vol. 3, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Multitrust: A comprehensive benchmark towards trustworthy multimodal large language mod- els,

Y. o. Zhang, “Multitrust: A comprehensive benchmark towards trustworthy multimodal large language mod- els,” NeurIPS, 2024

work page 2024

[17] [17]

”do anything now

X. Shen et al. , “”do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models,” in CCS, 2023

work page 2023

[18] [18]

Universal and Transferable Adversarial Attacks on Aligned Language Models

A. Zou et al., “Universal and transferable adversarial attacks on aligned language models,” arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Jailbroken: How does llm safety training fail?

A. Wei et al., “Jailbroken: How does llm safety training fail?” in NeurIPS, 2023

work page 2023

[20] [20]

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

Y. Liu et al., “Jailbreaking chatgpt via prompt engineer- ing: An empirical study,”arXiv preprint arXiv:2305.13860, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Constitutional AI: Harmlessness from AI Feedback

Y. Bai et al., “Constitutional ai: Harmlessness from ai feedback,” arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[22] [22]

Safe rlhf: Safe reinforcement learning from human feedback,

J. Dai et al., “Safe rlhf: Safe reinforcement learning from human feedback,” in ICLR, 2024

work page 2024

[23] [23]

Deepstellar: Model-based quantitative analysis of stateful deep learning systems,

X. Du et al. , “Deepstellar: Model-based quantitative analysis of stateful deep learning systems,” in FSE, 2019

work page 2019

[24] [24]

Rnnrepair: Automatic rnn repair via model- based analysis,

X. Xie et al., “Rnnrepair: Automatic rnn repair via model- based analysis,” in ICML, 2021

work page 2021

[25] [25]

Marble: Model-based robustness analysis of stateful deep learning systems,

X. Du et al., “Marble: Model-based robustness analysis of stateful deep learning systems,” in ASE, 2020

work page 2020

[26] [26]

Deeparc: Modularizing neural networks for the model maintenance,

X. Ren et al., “Deeparc: Modularizing neural networks for the model maintenance,” in ICSE, 2023

work page 2023

[27] [27]

Mosaic: Model-based safety analysis frame- work for ai-enabled cyber-physical systems,

X. Xie et al., “Mosaic: Model-based safety analysis frame- work for ai-enabled cyber-physical systems,” arXiv preprint arXiv:2305.03882, 2023

work page arXiv 2023

[28] [28]

Archrepair: Block-level architecture- oriented repairing for deep neural networks,

H. Qi et al. , “Archrepair: Block-level architecture- oriented repairing for deep neural networks,” ACM Transactions on Software Engineering and Methodology , vol. 32, no. 5, pp. 1–31, 2023

work page 2023

[29] [29]

Weighted automata extraction and ex- planation of recurrent neural networks for natural language tasks,

Z. Wei et al. , “Weighted automata extraction and ex- planation of recurrent neural networks for natural language tasks,” Journal of Logical and Algebraic Methods in Programming, 2024

work page 2024

[30] [30]

A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversar- ial attack and defence, and interpretability,

X. Huang et al., “A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversar- ial attack and defence, and interpretability,” Computer Science Review, vol. 37, p. 100270, 2020

work page 2020

[31] [31]

Software engineering for ai-based systems: a survey,

S. Mart ´ınez-Fern´andez et al. , “Software engineering for ai-based systems: a survey,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 31, no. 2, pp. 1–59, 2022

work page 2022

[32] [32]

Luna: A model-based universal analysis framework for large language models,

D. Song et al., “Luna: A model-based universal analysis framework for large language models,” IEEE Transac- tions on Software Engineering, 2024

work page 2024

[33] [33]

Representation Engineering: A Top-Down Approach to AI Transparency

A. Zou et al. , “Representation engineering: A top- down approach to ai transparency,” arXiv preprint arXiv:2310.01405, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

arXiv preprint arXiv:2402.05162 , year=

B. Wei et al., “Assessing the brittleness of safety align- ment via pruning and low-rank modifications,” arXiv preprint arXiv:2402.05162, 2024

work page arXiv 2024

[35] [35]

Zheng, F

C. Zheng et al., “Prompt-driven llm safeguarding via directed representation optimization,” arXiv preprint arXiv:2401.18018, 2024

work page arXiv 2024

[36] [36]

Adversarial representation engineering: A general model editing framework for large language models,

Y. Zhang et al., “Adversarial representation engineering: A general model editing framework for large language models,” arXiv preprint arXiv:2404.13752, 2024

work page arXiv 2024

[37] [37]

Attention is all you need,

A. Vaswani et al., “Attention is all you need,” in NeurIPS, 2017

work page 2017

[38] [38]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo et al. , “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Evaluation of openai o1: Opportunities and challenges of agi,

T. Zhong et al., “Evaluation of openai o1: Opportunities and challenges of agi,” arXiv preprint arXiv:2409.18486, 2024

work page arXiv 2024

[40] [40]

Improving language understanding by generative pre-training,

A. Radford et al., “Improving language understanding by generative pre-training,” 2018

work page 2018

[41] [41]

Reinforcement Learning for LLM Post-Training: A Survey

Z. Wang et al., “A comprehensive survey of llm align- ment techniques: Rlhf, rlaif, ppo, dpo and more,” arXiv preprint arXiv:2407.16216, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Towards the worst-case robustness of large language models,

H. Chen et al., “Towards the worst-case robustness of large language models,” arXiv preprint arXiv:2501.19040, 2025

work page arXiv 2025

[43] [43]

Position: Agent-specific trustworthiness risk as a research priority,

Z. Wei et al., “Position: Agent-specific trustworthiness risk as a research priority,” OpenReview preprint, 2025

work page 2025

[44] [44]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Y. Bai et al., “Training a helpful and harmless assistant 12 with reinforcement learning from human feedback,” arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[45] [45]

Secure LLM Fine-Tuning via Safety-Aware Probing

C. Wu et al. , “Mitigating fine-tuning risks in llms via safety-aware probing optimization,” arXiv preprint arXiv:2505.16737, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Safety alignment should be made more than just a few tokens deep,

X. Qi et al. , “Safety alignment should be made more than just a few tokens deep,” in ICLR, 2024

work page 2024

[47] [47]

Understanding pre-training and fine-tuning from loss landscape perspectives,

H. Chen et al., “Understanding pre-training and fine- tuning from loss landscape perspectives,” arXiv preprint arXiv:2505.17646, 2025

work page arXiv 2025

[48] [48]

Mistral 7b,

A. Q. Jiang et al., “Mistral 7b,” 2023

work page 2023

[49] [49]

How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms,

Y. Zeng et al. , “How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms,” in ACL, 2024

work page 2024

[50] [50]

Guard: Role-playing to gener- ate natural-language jailbreakings to test guide- line adherence of large language models.arXiv preprint arXiv:2402.03299, 2024

H. Jin et al., “Guard: Role-playing to generate natural- language jailbreakings to test guideline adherence of large language models,” arXiv preprint arXiv:2402.03299, 2024

work page arXiv 2024

[51] [51]

Detecting Language Model Attacks with Perplexity

G. Alon et al., “Detecting language model attacks with perplexity,” arXiv preprint arXiv:2308.14132, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

N. Jain et al. , “Baseline defenses for adversarial at- tacks against aligned language models,” arXiv preprint arXiv:2309.00614, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[53] [53]

A theoretical understanding of self- correction through in-context alignment,

Y. Wang et al. , “A theoretical understanding of self- correction through in-context alignment,” in NeurIPS, 2024

work page 2024

[54] [54]

Defending chatgpt against jailbreak attack via self-reminders,

Y. Xie et al., “Defending chatgpt against jailbreak attack via self-reminders,” Nature Machine Intelligence, 2023

work page 2023

[55] [55]

Jailbreak and guard aligned language mod- els with only few in-context demonstrations,

Z. Wei et al. , “Jailbreak and guard aligned language models with only few in-context demonstrations,” arXiv preprint arXiv:2310.06387, 2023

work page arXiv 2023

[56] [56]

OR- Bench: An over-refusal benchmark for large language models

J. Cui et al., “Or-bench: An over-refusal benchmark for large language models,” arXiv preprint arXiv:2405.20947, 2024

work page arXiv 2024

[57] [57]

Scalable defense against in-the-wild jailbreaking attacks with safety context retrieval,

T. Chen et al. , “Scalable defense against in-the-wild jailbreaking attacks with safety context retrieval,” arXiv preprint arXiv:2505.15753, 2025

work page arXiv 2025

[58] [58]

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

K. Simonyan et al., “Deep inside convolutional networks: Visualising image classification models and saliency maps,” arXiv preprint arXiv:1312.6034, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[59] [59]

Grad-cam: Visual explanations from deep networks via gradient-based localization,

R. R. Selvaraju et al., “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in ICCV, 2017

work page 2017

[60] [60]

Linguistic regularities in continuous space word representations,

T. Mikolov et al., “Linguistic regularities in continuous space word representations,” in NAACL, 2013

work page 2013

[61] [61]

Interpretable convolutional neural networks,

Q. Zhang et al. , “Interpretable convolutional neural networks,” in CVPR, 2018, pp. 8827–8836

work page 2018

[62] [62]

Does representation matter? exploring intermediate layers in large language models,

O. Skean et al., “Does representation matter? exploring intermediate layers in large language models,” arXiv preprint arXiv:2412.09563, 2024

work page arXiv 2024

[63] [63]

Improving steering vectors by tar- geting sparse autoencoder features,

S. Chalnev et al. , “Improving steering vectors by tar- geting sparse autoencoder features,” arXiv preprint arXiv:2411.02193, 2024

work page arXiv 2024

[64] [64]

Improving instruction-following in language models through activation steering.arXiv preprint arXiv:2410.12877,

A. Stolfo et al. , “Improving instruction-following in language models through activation steering,” arXiv preprint arXiv:2410.12877, 2024

work page arXiv 2024

[65] [65]

Advanc- ing llm safe alignment with safety representation ranking

T. Du et al., “Advancing llm safe alignment with safety representation ranking,” arXiv preprint arXiv:2505.15710, 2025

work page arXiv 2025

[66] [66]

The hidden dimensions of llm alignment: A multi-dimensional safety analysis,

W. Pan et al., “The hidden dimensions of llm alignment: A multi-dimensional safety analysis,” arXiv preprint arXiv:2502.09674, 2025

work page arXiv 2025

[67] [67]

Stanford alpaca: An instruction-following llama model,

R. Taoriet al., “Stanford alpaca: An instruction-following llama model,” https://github.com/tatsu-lab/stanford alpaca, 2023

work page 2023

[68] [68]

Decision-guided weighted automata extraction from recurrent neural networks,

X. Zhang et al., “Decision-guided weighted automata extraction from recurrent neural networks,” in AAAI, 2021

work page 2021

[69] [69]

Class-based n-gram models of natural language,

P . F. Brownet al., “Class-based n-gram models of natural language,” Computational linguistics, vol. 18, no. 4, pp. 467–480, 1992

work page 1992

[70] [70]

Phute, A

M. Phute et al. , “Llm self defense: By self examina- tion, llms know they are being tricked,” arXiv preprint arXiv:2308.07308, 2023

work page arXiv 2023

[71] [71]

Judging llm-as-a-judge with mt-bench and chatbot arena,

L. Zheng et al., “Judging llm-as-a-judge with mt-bench and chatbot arena,” in NeurIPS, 2023

work page 2023

[72] [72]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron et al., “Llama 2: Open foundation and fine- tuned chat models,” arXiv preprint arXiv:2307.09288 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[73] [73]

Qwen technical report,

J. Bai et al. , “Qwen technical report,” https://qwenlm.github.io/blog/qwen3/, 2023

work page 2023

[74] [74]

Koala: A dialogue model for academic research,

X. Geng et al., “Koala: A dialogue model for academic research,” 2023

work page 2023

[75] [75]

Baichuan 2: Open Large-scale Language Models

A. Yang et al., “Baichuan 2: Open large-scale language models,” arXiv preprint arXiv:2309.10305, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[76] [76]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

P . Chao et al. , “Jailbreakbench: An open robustness benchmark for jailbreaking large language models,” arXiv preprint arXiv:2404.01318, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[77] [77]

Lmsys-chat-1m: A large-scale real-world llm conversation dataset,

L. Zheng et al., “Lmsys-chat-1m: A large-scale real-world llm conversation dataset,” 2023

work page 2023

[78] [78]

Rainbow teaming: Open-ended generation of diverse adversarial prompts,

M. Samvelyan et al., “Rainbow teaming: Open-ended generation of diverse adversarial prompts,” NeurIPS, 2024

work page 2024

[79] [79]

Sorry-bench: Systematically evaluating large language model safety refusal,

T. Xie et al. , “Sorry-bench: Systematically evaluating large language model safety refusal,” in ICLR, 2025

work page 2025

[80] [80]

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models,

L. Jiang et al., “Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models,” 2024

work page 2024