Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models

· 2025 · cs.CR · arXiv 2512.20677

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

The increasing deployment of large language models (LLMs) in safety-critical applications raises fundamental challenges in systematically evaluating robustness against adversarial behaviors. Existing red-teaming practices are largely manual and expert-driven, which limits scalability, reproducibility, and coverage in high-dimensional prompt spaces. We formulate automated LLM red-teaming as a structured adversarial search problem and propose a learning-driven framework for scalable vulnerability discovery. The approach combines meta-prompt-guided adversarial prompt generation with a hierarchical execution and detection pipeline, enabling standardized evaluation across six representative threat categories, including reward hacking, deceptive alignment, data exfiltration, sandbagging, inappropriate tool use, and chain-of-thought manipulation. Extensive experiments on GPT-OSS-20B identify 47 vulnerabilities, including 21 high-severity failures and 12 previously undocumented attack patterns. Compared with manual red-teaming under matched query budgets, our method achieves a 3.9$\times$ higher discovery rate with 89\% detection accuracy, demonstrating superior coverage, efficiency, and reproducibility for large-scale robustness evaluation.

representative citing papers

A3M: Adaptive, Adversarial and Multi-Objective Learning for Strategic Bidding in Repeated Auctions

cs.CL · 2026-06-27 · unverdicted · novelty 5.0

A3M integrates adaptive DRL, adversarial opponent modeling, and multi-objective rewards to cut regret 30-40% versus baselines while remaining robust to strategy shifts in repeated auctions.

EVLA: An Electro-Aware Multimodal Assistant for Physically-Grounded Driving Reasoning and Control

cs.CL · 2026-06-27 · unverdicted · novelty 4.0

EVLA combines a Unified Co-State Encoder and Electro-aware Structured Reasoning Chain with physics-guided training to produce energy-optimal driving decisions, reporting +5.6% accuracy gains over fine-tuned VLM baselines on a driving QA benchmark.

citing papers explorer

Showing 2 of 2 citing papers.

A3M: Adaptive, Adversarial and Multi-Objective Learning for Strategic Bidding in Repeated Auctions cs.CL · 2026-06-27 · unverdicted · none · ref 18 · internal anchor
A3M integrates adaptive DRL, adversarial opponent modeling, and multi-objective rewards to cut regret 30-40% versus baselines while remaining robust to strategy shifts in repeated auctions.
EVLA: An Electro-Aware Multimodal Assistant for Physically-Grounded Driving Reasoning and Control cs.CL · 2026-06-27 · unverdicted · none · ref 54 · internal anchor
EVLA combines a Unified Co-State Encoder and Electro-aware Structured Reasoning Chain with physics-guided training to produce energy-optimal driving decisions, reporting +5.6% accuracy gains over fine-tuned VLM baselines on a driving QA benchmark.

Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models

fields

years

verdicts

representative citing papers

citing papers explorer