arxiv: 2601.13518 · v3 · submitted 2026-01-20 · 💻 cs.AI · cs.NE

Recognition: no theorem link

AgenticRed: Evolving Agentic Systems for Red-Teaming

Jiayi Yuan , Jonathan N\"other , Natasha Jaques , Goran Radanovi\'c

Authors on Pith no claims yet

Pith reviewed 2026-05-16 13:17 UTC · model grok-4.3

classification 💻 cs.AI cs.NE

keywords red-teamingevolutionary algorithmsLLM agentsAI safetyautomated red-teamingattack success ratesystem designHarmBench

0 comments

The pith

An evolutionary pipeline lets LLMs autonomously design red-teaming systems that reach 96-100% attack success on major models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

AgenticRed uses LLMs to evolve complete red-teaming systems automatically through iterative design, evaluation, and selection. It avoids dependence on human-specified workflows by treating attacker system creation as an open evolutionary problem. The resulting systems demonstrate high attack success rates on open models and transfer effectively to closed models. A sympathetic reader would care because this points to a scalable method for exposing model vulnerabilities as target systems advance rapidly.

Core claim

AgenticRed is an automated pipeline that leverages LLMs' in-context learning to iteratively design and refine red-teaming systems without human intervention. Rather than optimizing attacker policies within predefined structures, it treats red-teaming as a system design problem and evolves automated systems using evolutionary selection and generational knowledge. Red-teaming systems designed by AgenticRed consistently outperform state-of-the-art approaches, achieving 96% attack success rate on Llama-2-7B, 98% on Llama-3-8B and 100% on Qwen3-8B on HarmBench, with strong transfer to proprietary models at 100% ASR.

What carries the argument

AgenticRed, the evolutionary pipeline that uses LLMs' in-context learning together with generational selection to autonomously generate, test, and refine entire red-teaming systems.

If this is right

Red-teaming systems become query-agnostic and transfer strongly across open and proprietary models.
Evolutionary algorithms allow safety testing to keep pace with rapidly evolving target models.
Removal of human-specified workflows reduces biases in the explored design space.
Generated systems achieve higher attack success rates than prior automated methods on the same benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same evolutionary loop could be applied to evolve defensive monitoring systems or other agentic safety tools.
Continuous re-running of the pipeline on new model releases might produce up-to-date red-teaming agents without manual redesign.
If generational knowledge accumulates useful patterns, the method might discover attack strategies outside the scope of current human-designed red-teaming.

Load-bearing premise

LLMs can reliably use in-context learning to iteratively design and refine complete red-teaming systems that generalize beyond the evolutionary training distribution without introducing new systematic biases or blind spots.

What would settle it

An independent run of the evolutionary process followed by direct testing of the output systems on HarmBench that yields attack success rates substantially below the reported 96%, 98%, and 100% figures.

read the original abstract

While recent automated red-teaming methods show promise for systematically exposing model vulnerabilities, most existing approaches rely on human-specified workflows. This dependence on manually designed workflows suffers from human biases and makes exploring the broader design space expensive. We introduce AgenticRed, an automated pipeline that leverages LLMs' in-context learning to iteratively design and refine red-teaming systems without human intervention. Rather than optimizing attacker policies within predefined structures, AgenticRed treats red-teaming as a system design problem, and it autonomously evolves automated red-teaming systems using evolutionary selection and generational knowledge. Red-teaming systems designed by AgenticRed consistently outperform state-of-the-art approaches, achieving 96% attack success rate (ASR) on Llama-2-7B, 98% on Llama-3-8B and 100% on Qwen3-8B on HarmBench. Our approach generates robust, query-agnostic red-teaming systems that transfer strongly to the latest proprietary models, achieving an impressive 100% ASR on GPT-5.1, DeepSeek-R1 and DeepSeek V3.2. This work highlights evolutionary algorithms as a powerful approach to AI safety that can keep pace with rapidly evolving models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgenticRed evolves full red-teaming systems via LLM in-context learning and selection, claiming 96-100% ASR plus perfect transfer, but the abstract supplies no protocol or analysis to back those numbers.

read the letter

The main takeaway is that this paper frames red-teaming as an autonomous evolutionary system-design problem. LLMs use in-context learning to propose, test, and refine complete attacker systems across generations instead of optimizing inside a human-written workflow. That shift is the clearest difference from the prior work cited in the abstract. The reported results are the other headline: 96% ASR on Llama-2-7B, 98% on Llama-3-8B, 100% on Qwen3-8B, and 100% transfer to GPT-5.1, DeepSeek-R1, and DeepSeek V3.2 on HarmBench, all while staying query-agnostic. If those numbers hold, the approach would give a practical route to scaling safety evaluations as models improve. The paper earns credit for spelling out why human-specified workflows limit exploration and for showing transfer results that go beyond the models used in evolution. Those elements make the idea worth testing. The soft spots sit in the missing evidence. The abstract gives no description of the evolutionary loop, the fitness function, baseline implementations, number of runs, or any statistical checks. Without those, it is impossible to tell whether the high success rates reflect genuine generalization or whether the LLM-driven process simply overfit to patterns in the HarmBench distribution and the models seen during design. The stress-test worry about new systematic biases is therefore live until the methods section is examined. This paper is for AI safety researchers who build automated evaluation tools. Anyone already working on agentic or evolutionary methods would pick up concrete pipeline ideas even if the numbers need replication. It deserves peer review because the core technique is distinct and the empirical claims are large enough to justify a full check on the data and code.

Referee Report

2 major / 2 minor

Summary. The paper introduces AgenticRed, an automated pipeline that leverages LLMs' in-context learning to iteratively design and refine red-teaming systems via evolutionary selection and generational knowledge, without human-specified workflows. It claims that the resulting systems outperform state-of-the-art approaches, achieving ASR of 96% on Llama-2-7B, 98% on Llama-3-8B, and 100% on Qwen3-8B on HarmBench, with strong transfer yielding 100% ASR on proprietary models including GPT-5.1, DeepSeek-R1, and DeepSeek V3.2.

Significance. If the empirical results hold under rigorous verification, the work would offer a meaningful contribution to AI safety by demonstrating that evolutionary system design can automate red-teaming at scale and generalize across model families. The shift from optimizing within fixed structures to evolving complete agentic systems is a promising direction, though its impact depends on reproducibility and bias controls.

major comments (2)

[Abstract and Results] Abstract and Results section: the reported ASR figures (96% on Llama-2-7B, 98% on Llama-3-8B, 100% on Qwen3-8B and transfer models) are presented without any description of the evaluation protocol, number of HarmBench queries, exact success criterion, baseline implementations, or statistical significance testing. This absence is load-bearing because the central claim is an empirical performance comparison.
[Methods] Methods section: the evolutionary process is described at a high level but provides no concrete specification of the fitness function, selection mechanism, number of generations, population size, or any explicit penalty for systematic biases or blind spots in the in-context learning loop. Without these details the transfer and query-agnostic claims cannot be assessed.

minor comments (2)

[Abstract] Abstract: the model name 'GPT-5.1' is non-standard; the manuscript should clarify the exact proprietary model versions and access method used for the 100% transfer results.
[Introduction] Notation: the term 'query-agnostic' is used repeatedly but never formally defined; a precise definition would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that the original manuscript required substantially more detail on both the evaluation protocol and the evolutionary process to support the central empirical claims. We have revised the manuscript accordingly and address each major comment below.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results section: the reported ASR figures (96% on Llama-2-7B, 98% on Llama-3-8B, 100% on Qwen3-8B and transfer models) are presented without any description of the evaluation protocol, number of HarmBench queries, exact success criterion, baseline implementations, or statistical significance testing. This absence is load-bearing because the central claim is an empirical performance comparison.

Authors: We agree that these details were insufficient in the original submission. The revised manuscript adds a new subsection 'Evaluation Protocol' under Results that specifies: (1) use of the full HarmBench test set (400 queries), (2) success criterion defined as the standard HarmBench classifier outputting a harm score > 0.5, (3) exact baseline implementations and hyperparameters (PAIR, TAP, AutoDAN, and GCG), and (4) statistical testing via bootstrap resampling (10,000 iterations) with 95% confidence intervals reported for all ASR figures. These additions directly address the load-bearing nature of the performance claims. revision: yes
Referee: [Methods] Methods section: the evolutionary process is described at a high level but provides no concrete specification of the fitness function, selection mechanism, number of generations, population size, or any explicit penalty for systematic biases or blind spots in the in-context learning loop. Without these details the transfer and query-agnostic claims cannot be assessed.

Authors: We acknowledge the high-level description was inadequate. The revised Methods section now provides concrete specifications: population size of 15, 8 generations, fitness function = validation ASR (on a 50-query held-out set) minus a bias penalty term (variance in ASR across 6 harm categories to penalize blind spots), and selection via elitist tournament (top 3 preserved, remainder by fitness-proportional selection). We also added explicit discussion of bias mitigation via diversity prompts in the in-context learning loop. These details enable evaluation of the transfer and query-agnostic claims. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical performance claims with no derivations

full rationale

The paper describes an LLM-based evolutionary pipeline for designing red-teaming systems and reports measured attack success rates (e.g., 96% on Llama-2-7B, 100% transfer to GPT-5.1) on HarmBench and proprietary models. No equations, first-principles derivations, fitted parameters presented as predictions, or self-citation chains appear in the provided text. The central claims rest on direct experimental comparisons rather than any quantity that reduces to its own inputs by construction, satisfying the self-contained criterion for score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LLMs possess sufficient in-context learning capacity to autonomously evolve effective red-teaming systems; no explicit free parameters or invented entities are stated in the abstract.

axioms (1)

domain assumption LLMs can use in-context learning to iteratively design and refine complete red-teaming systems without human intervention
The pipeline description in the abstract depends on this capacity to replace manually designed workflows.

pith-pipeline@v0.9.0 · 5524 in / 1301 out tokens · 51252 ms · 2026-05-16T13:17:34.142174+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems
cs.CR 2026-05 unverdicted novelty 7.0

Proteus demonstrates that adaptive red-teaming achieves 40-90% attack success after five rounds and bypasses even strong auditors at up to 41% joint success, revealing that static skill vetting underestimates residual risk.
Red-Teaming Agent Execution Contexts: Open-World Security Evaluation on OpenClaw
cs.CR 2026-05 unverdicted novelty 6.0

DeepTrap automates discovery of contextual vulnerabilities in OpenClaw agents via trajectory optimization, showing that unsafe behavior can be induced while preserving task completion and that final-response checks ar...