Recognition: unknown
Adaptive Instruction Composition for Automated LLM Red-Teaming
Pith reviewed 2026-05-09 23:29 UTC · model grok-4.3
The pith
Adaptive Instruction Composition trains a contextual bandit via reinforcement learning to combine crowdsourced texts into instructions that yield more effective and diverse LLM jailbreaks than random selection or prior adaptive methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Adaptive Instruction Composition employs a lightweight neural contextual bandit with contrastive pretraining that receives reinforcement learning signals to choose how to combine crowdsourced harmful queries and tactics into instructions for an attacker LLM. This process guides the attacker toward generations that exploit target-specific weaknesses while promoting diversity across attempts, outperforming random composition on effectiveness and diversity metrics even under model transfer and surpassing other recent adaptive red-teaming approaches on Harmbench.
What carries the argument
The lightweight neural contextual bandit with contrastive pretraining that uses reinforcement learning to balance exploration and exploitation when selecting combinations from the combinatorial space of instructions.
If this is right
- Red-teaming can generate a broader set of successful attacks that better reveal model weaknesses.
- The learned bandit transfers its performance gains to new target models without retraining.
- Contrastive pretraining enables the bandit to handle the scale of instruction combinations effectively.
- The framework surpasses multiple existing adaptive red-teaming techniques on standard benchmarks.
Where Pith is reading between the lines
- The same adaptive composition mechanism could extend to other prompt-optimization tasks where both quality and variety matter.
- If new crowdsourced data arrives, the bandit might incorporate it with less retraining than full retraining approaches.
- This style of bandit could apply to combinatorial search in related areas like automated testing or content generation.
Load-bearing premise
The lightweight neural contextual bandit with contrastive pretraining can rapidly generalize and scale to the massive combinatorial space of instructions while jointly optimizing effectiveness and diversity via RL.
What would settle it
A head-to-head evaluation in which the adaptive bandit produces no higher effectiveness or diversity scores than random combination on the same metrics and Harmbench benchmark would falsify the performance advantage.
Figures
read the original abstract
Many approaches to LLM red-teaming leverage an attacker LLM to discover jailbreaks against a target. Several of them task the attacker with identifying effective strategies through trial and error, resulting in a semantically limited range of successes. Another approach discovers diverse attacks by combining crowdsourced harmful queries and tactics into instructions for the attacker, but does so at random, limiting effectiveness. This article introduces a novel framework, Adaptive Instruction Composition, that combines crowdsourced texts according to an adaptive mechanism trained to jointly optimize effectiveness with diversity. We use reinforcement learning to balance exploration with exploitation in a combinatorial space of instructions to guide the attacker toward diverse generations tailored to target vulnerabilities. We demonstrate that our approach substantially outperforms random combination on a set of effectiveness and diversity metrics, even under model transfer. Further, we show that it surpasses a host of recent adaptive approaches on Harmbench. We employ a lightweight neural contextual bandit that adapts to contrastive embedding inputs, and provide ablations suggesting that the contrastive pretraining enables the network to rapidly generalize and scale to the massive space as it learns.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Adaptive Instruction Composition, a framework that uses reinforcement learning via a lightweight neural contextual bandit with contrastive pretraining to adaptively combine crowdsourced harmful queries and tactics into instructions for attacking LLMs. It claims to jointly optimize for attack effectiveness and diversity, substantially outperforming random composition on effectiveness/diversity metrics (including under model transfer) and surpassing recent adaptive baselines on Harmbench.
Significance. If the results hold with full experimental details, the work could meaningfully advance automated LLM red-teaming by enabling more scalable discovery of diverse jailbreaks through adaptive composition rather than random or purely trial-and-error methods. The contrastive embedding approach for generalization in combinatorial spaces and the provided ablations represent a potentially useful technical contribution to balancing exploration/exploitation in this domain.
major comments (2)
- [Abstract] Abstract: The central claim of substantial outperformance over random combination and prior adaptive methods on effectiveness/diversity metrics and Harmbench lacks specific quantitative results, error bars, statistical significance tests, or full experimental setup details (e.g., number of runs, exact baselines, reward shaping for the joint objective). This prevents verification of the support for the scaling and generalization claims.
- [Methods] Methods (contextual bandit description): The paper does not specify how the action space of instruction compositions is represented or handled (e.g., whether compositions are enumerated from a fixed pool, sampled via a generative head, pruned, or otherwise restricted), which is load-bearing for the claim that contrastive pretraining enables rapid generalization and scaling to the 'massive combinatorial space' without sample-complexity blowup.
minor comments (2)
- [Abstract] Abstract and experiments: Include standard deviations or confidence intervals alongside reported metrics to allow assessment of result stability.
- [Ablations] Ablations: Clarify the exact contrastive pretraining objective and how it interacts with the RL policy to prevent mode collapse in the diversity-effectiveness tradeoff.
Simulated Author's Rebuttal
We are grateful to the referee for their careful reading and valuable suggestions. We respond to each major comment in turn and outline the changes we will implement in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of substantial outperformance over random combination and prior adaptive methods on effectiveness/diversity metrics and Harmbench lacks specific quantitative results, error bars, statistical significance tests, or full experimental setup details (e.g., number of runs, exact baselines, reward shaping for the joint objective). This prevents verification of the support for the scaling and generalization claims.
Authors: We agree that including specific quantitative results and experimental details in the abstract would strengthen the presentation of our central claims. While the body of the manuscript contains tables with performance metrics including means and standard deviations across multiple runs, comparisons to the listed baselines, and a full description of the reward function used to jointly optimize effectiveness and diversity, the abstract itself is currently more qualitative. We will revise the abstract to incorporate key quantitative findings, such as the reported improvements on the metrics and Harmbench, along with a note on the number of runs and the structure of the reward. This will better support the claims regarding scaling and generalization. revision: yes
-
Referee: [Methods] Methods (contextual bandit description): The paper does not specify how the action space of instruction compositions is represented or handled (e.g., whether compositions are enumerated from a fixed pool, sampled via a generative head, pruned, or otherwise restricted), which is load-bearing for the claim that contrastive pretraining enables rapid generalization and scaling to the 'massive combinatorial space' without sample-complexity blowup.
Authors: Thank you for highlighting this important omission. The manuscript describes the use of a combinatorial space but does not provide sufficient detail on its construction and management. In our approach, the action space is constructed by enumerating feasible compositions from a fixed pool of crowdsourced texts, with restrictions on the number of components per instruction and pruning of low-quality or duplicate compositions based on embedding similarity. The neural contextual bandit then uses contrastive embeddings of the full instructions as context to learn a policy that generalizes without needing to explore the entire space. We will expand the Methods section with a clear description of the action space representation, the enumeration and pruning procedures, and how the contrastive pretraining contributes to efficient learning in this space. This will better substantiate the claims about generalization and scaling. revision: yes
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Contrastive pretraining enables the network to rapidly generalize and scale to the massive combinatorial space of instructions.
Reference graph
Works this paper leans on
-
[1]
2024 , eprint=
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models , author=. 2024 , eprint=
2024
-
[2]
2021 , eprint=
Neural Thompson Sampling , author=. 2021 , eprint=
2021
-
[3]
Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin , journal=
-
[4]
2024 , eprint=
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , author=. 2024 , eprint=
2024
-
[5]
and Choi, Yejin
Liu, Alisa and Swayamdipta, Swabha and Smith, Noah A. and Choi, Yejin. WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation. 2022
2022
-
[6]
2024 , eprint=
PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI System , author=. 2024 , eprint=
2024
-
[7]
2024 , eprint=
The Llama 3 Herd of Models , author=. 2024 , eprint=
2024
-
[8]
2023 , eprint=
Mistral 7B , author=. 2023 , eprint=
2023
-
[9]
2024 , eprint=
Mixtral of Experts , author=. 2024 , eprint=
2024
-
[10]
2023 , eprint=
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations , author=. 2023 , eprint=
2023
-
[11]
2024 , eprint=
Safeguarding Large Language Models: A Survey , author=. 2024 , eprint=
2024
-
[12]
2024 , eprint=
Low-Resource Languages Jailbreak GPT-4 , author=. 2024 , eprint=
2024
-
[13]
2021 , eprint=
Persistent Anti-Muslim Bias in Large Language Models , author=. 2021 , eprint=
2021
-
[14]
2023 , eprint=
Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=
2023
-
[15]
2022 , eprint=
TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. 2022 , eprint=
2022
-
[16]
2021 , eprint=
Extracting Training Data from Large Language Models , author=. 2021 , eprint=
2021
-
[17]
Official Microsoft Blog , url =
Lee, Peter , date =. Official Microsoft Blog , url =. 2016 , title =
2016
-
[18]
Red Teaming Language Models with Language Models
Red teaming language models with language models , author=. arXiv preprint arXiv:2202.03286 , year=
-
[19]
2022 , eprint=
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned , author=. 2022 , eprint=
2022
-
[20]
2024 , eprint=
Curiosity-driven Red-teaming for Large Language Models , author=. 2024 , eprint=
2024
-
[21]
2024 , eprint=
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs , author=. 2024 , eprint=
2024
-
[22]
arXiv preprint arXiv:2310.12505 , year=
Attack prompt generation for red teaming and defending large language models , author=. arXiv preprint arXiv:2310.12505 , year=
-
[23]
arXiv preprint arXiv:2311.09447 , year=
How trustworthy are open-source llms? an assessment under malicious demonstrations shows their vulnerabilities , author=. arXiv preprint arXiv:2311.09447 , year=
-
[24]
Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , pages=
Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection , author=. Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , pages=
-
[25]
Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts,
Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts , author=. arXiv preprint arXiv:2309.10253 , year=
-
[26]
2020 , eprint=
Neural Contextual Bandits with UCB-based Exploration , author=. 2020 , eprint=
2020
-
[27]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Reimers, Nils and Gurevych, Iryna. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019
2019
-
[28]
2020 , eprint=
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , author=. 2020 , eprint=
2020
-
[29]
Proceedings of the 2000 ACM SIGMOD international conference on Management of data , pages=
Density biased sampling: An improved method for data mining and clustering , author=. Proceedings of the 2000 ACM SIGMOD international conference on Management of data , pages=
2000
-
[30]
Pacific-Asia conference on knowledge discovery and data mining , pages=
Density-based clustering based on hierarchical density estimates , author=. Pacific-Asia conference on knowledge discovery and data mining , pages=. 2013 , organization=
2013
-
[31]
2024 , eprint=
Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning , author=. 2024 , eprint=
2024
-
[32]
2020 , eprint=
Neural Tangent Kernel: Convergence and Generalization in Neural Networks , author=. 2020 , eprint=
2020
-
[33]
2025 , eprint=
Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models , author=. 2025 , eprint=
2025
-
[34]
2020 , eprint=
TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP , author=. 2020 , eprint=
2020
-
[35]
2024 , eprint=
Jailbreaking Black Box Large Language Models in Twenty Queries , author=. 2024 , eprint=
2024
-
[36]
2024 , eprint=
Exploring Straightforward Conversational Red-Teaming , author=. 2024 , eprint=
2024
-
[37]
Tree of Attacks: Jailbreaking Black-Box
Tree of attacks: Jailbreaking black-box llms automatically , author=. arXiv preprint arXiv:2312.02119 , year=
-
[38]
do anything now
"do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models , author=. Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security , pages=
2024
-
[39]
2024 , eprint=
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models , author=. 2024 , eprint=
2024
-
[40]
2024 , eprint=
Improving Diversity of Commonsense Generation by Large Language Models via In-Context Learning , author=. 2024 , eprint=
2024
-
[41]
2025 , eprint=
Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores , author=. 2025 , eprint=
2025
-
[42]
2021 , eprint=
Evaluating the Evaluation of Diversity in Natural Language Generation , author=. 2021 , eprint=
2021
-
[43]
2023 , eprint=
Combinatorial Neural Bandits , author=. 2023 , eprint=
2023
-
[44]
2019 , eprint=
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. 2019 , eprint=
2019
-
[45]
2024 , eprint=
AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs , author=. 2024 , eprint=
2024
-
[46]
Mart: Improving llm safety with multi-round automatic red-teaming , author=. arXiv preprint arXiv:2311.07689 , year=
-
[47]
arXiv preprint arXiv:2407.03876 , year=
Automated Progressive Red Teaming , author=. arXiv preprint arXiv:2407.03876 , year=
-
[48]
arXiv preprint arXiv:2404.00629 , year=
Against The Achilles' Heel: A Survey on Red Teaming for Generative Models , author=. arXiv preprint arXiv:2404.00629 , year=
-
[49]
Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , volume=
Red-Teaming for generative AI: Silver bullet or security theater? , author=. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , volume=
- [50]
-
[51]
2024 , eprint=
A Trembling House of Cards? Mapping Adversarial Attacks against Language Agents , author=. 2024 , eprint=
2024
-
[52]
2024 , eprint=
Revisiting Character-level Adversarial Attacks for Language Models , author=. 2024 , eprint=
2024
-
[53]
2021 , eprint=
Universal Adversarial Triggers for Attacking and Analyzing NLP , author=. 2021 , eprint=
2021
-
[54]
2024 , eprint=
A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models , author=. 2024 , eprint=
2024
-
[55]
2024 , eprint=
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study , author=. 2024 , eprint=
2024
-
[56]
Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda
Deng, Gelei and Liu, Yi and Li, Yuekang and Wang, Kailong and Zhang, Ying and Li, Zefeng and Wang, Haoyu and Zhang, Tianwei and Liu, Yang , year=. MASTERKEY: Automated Jailbreaking of Large Language Model Chatbots , url=. doi:10.14722/ndss.2024.24188 , booktitle=
-
[57]
2022 , eprint=
Training language models to follow instructions with human feedback , author=. 2022 , eprint=
2022
-
[58]
2022 , eprint=
Constitutional AI: Harmlessness from AI Feedback , author=. 2022 , eprint=
2022
-
[59]
2023 , eprint=
Pretraining Language Models with Human Preferences , author=. 2023 , eprint=
2023
-
[60]
2024 , eprint=
The Unreasonable Effectiveness of Greedy Algorithms in Multi-Armed Bandit with Many Arms , author=. 2024 , eprint=
2024
-
[61]
Advances in Neural Information Processing Systems , volume=
Algorithms for infinitely many-armed bandits , author=. Advances in Neural Information Processing Systems , volume=
-
[62]
2020 , eprint=
Satisficing in Time-Sensitive Bandit Learning , author=. 2020 , eprint=
2020
-
[63]
Data & Knowledge Engineering , volume=
Indexed-based density biased sampling for clustering applications , author=. Data & Knowledge Engineering , volume=. 2006 , publisher=
2006
-
[64]
Proceedings of the 30th International Conference on Machine Learning , pages =
Thompson Sampling for Contextual Bandits with Linear Payoffs , author =. Proceedings of the 30th International Conference on Machine Learning , pages =. 2013 , editor =
2013
-
[65]
Proceedings of the 19th international conference on World wide web , pages=
A contextual-bandit approach to personalized news article recommendation , author=. Proceedings of the 19th international conference on World wide web , pages=
-
[66]
Journal of Machine Learning Research , volume=
Using confidence bounds for exploitation-exploration trade-offs , author=. Journal of Machine Learning Research , volume=
-
[67]
Advances in neural information processing systems , volume=
Parametric bandits: The generalized linear case , author=. Advances in neural information processing systems , volume=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.