PAIR uses an attacker LLM to iteratively craft effective jailbreak prompts for black-box target LLMs in fewer than 20 queries.
Improving question answering model robustness with synthetic adversarial data generation
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
representative citing papers
RLHF-aligned language models show increasing resistance to red teaming with scale up to 52B parameters, unlike prompted or rejection-sampled models, supported by a released dataset of 38,961 attacks.
citing papers explorer
-
Jailbreaking Black Box Large Language Models in Twenty Queries
PAIR uses an attacker LLM to iteratively craft effective jailbreak prompts for black-box target LLMs in fewer than 20 queries.
-
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
RLHF-aligned language models show increasing resistance to red teaming with scale up to 52B parameters, unlike prompted or rejection-sampled models, supported by a released dataset of 38,961 attacks.