Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks

Jifan Chen; Sam Mayers; Sanjay Krishna Gouda; Shoumik Saha; Varun Kumar; Zijian Wang

arxiv: 2510.01359 · v2 · pith:QUVXACU5new · submitted 2025-10-01 · 💻 cs.CR · cs.AI

Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks

Shoumik Saha , Jifan Chen , Sam Mayers , Sanjay Krishna Gouda , Zijian Wang , Varun Kumar This is my paper

classification 💻 cs.CR cs.AI

keywords attackcodeagentscomplianceacrossagentattacksdeployable

0 comments

read the original abstract

Code-capable large language model (LLM) agents are embedded in software engineering workflows where they can read, write, and execute code, raising "jailbreak" stakes beyond text-only settings. Prior evaluations emphasize refusal or harmful-text detection, leaving open whether agents compile and run malicious programs. We present JAWS-Bench (Jailbreaks Across WorkSpaces), a benchmark spanning three escalating workspace regimes mirroring attacker capability: empty (JAWS-0), single-file (JAWS-1), and multi-file (JAWS-M). We pair this with a hierarchical, executable-aware Judge Framework that tests (i) compliance, (ii) attack success, (iii) syntactic correctness, and (iv) runtime executability, to measure deployable harm. Across seven LLM backends from five families, prompt-only attacks in JAWS-0 achieve 61% compliance; 58% are harmful, 52% parse, and 27% run end-to-end. In JAWS-1, compliance reaches ~100% for stronger models with a mean ASR (Attack Success Rate) ~71%; JAWS-M raises mean ASR to ~75%, with 32% runnable attack code. Wrapping an LLM in an agent increases ASR by 1.6$\times$, by overturning initial refusals during planning and tool use. Similar trends hold for OpenHands, SWE-Agent, and OpenAI Codex, suggesting our JAWS-Bench is agent-agnostic. Category analyses identify which attack classes are most vulnerable and deployable, motivating execution-aware defenses and refusal-preserving agent designs.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments
cs.CR 2026-05 conditional novelty 8.0

LITMUS is the first benchmark using semantic-physical dual verification and OS state rollback to measure behavioral jailbreaks in LLM agents, revealing that even strong models execute 40%+ of high-risk operations and ...
Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025)
cs.CR 2026-05 accept novelty 7.0

Systematic review of thirteen malicious-code prompt corpora for coding LLM refusal evaluation that catalogs construction methods, surfaces gaps in human baselines, cross-corpus comparability, and malware taxonomies, a...
Do Coding Agents Understand Least-Privilege Authorization?
cs.CR 2026-05 unverdicted novelty 7.0

Coding agents struggle to infer least-privilege file permissions by omitting needed accesses while granting unused or sensitive ones, but Sufficiency-Tightness Decomposition improves sensitive-task success by up to 15...
MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
cs.CR 2026-05 unverdicted novelty 7.0

MOSAIC-Bench demonstrates that nine production coding agents achieve 53-86% end-to-end attack success rates on staged innocuous tickets across 10 web substrates and 31 CWE classes, far higher than the 0-20.4% rates se...