arxiv: 2604.24020 · v1 · submitted 2026-04-27 · 💻 cs.CR · cs.AI

Recognition: unknown

Poster: ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents

Bin Sun, Jian Chang, Jiaqi Li, Lidong Zhai, Yang Yu, Yang Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:27 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords autonomous AI agentsendogenous trainingself-playsecurity awarenessprompt injectionthreat taxonomyweakest-first schedulingmemory accumulation

0 comments

The pith

Autonomous AI agents can raise their threat recognition scores from 80.9 to 96.9 by running internal attacker-defender-evaluator self-play loops without any model changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that autonomous AI agents facing attacks such as prompt injection and memory poisoning can build their own security awareness through endogenous training at inference time. The approach uses a self-play loop in which the agent alternates roles under a weakest-first curriculum to target its own gaps across a 12-dimension taxonomy. Experiments demonstrate that this method outperforms uniform random scheduling and retains gains via persistent memory accumulation across sessions. It also surfaces a calibration tradeoff in which heavily trained agents begin to misclassify legitimate queries as threats.

Core claim

ClawdGo establishes that an unmodified AI agent can improve its security judgement by cycling through attacker, defender, and evaluator roles in a self-play loop scheduled by weakest-first curriculum, raising average scores on the Three-Layer Domain Taxonomy from 80.9 to 96.9 over 16 sessions while covering 11 of 12 dimensions, with cross-session memory accumulation preserving the full improvement and a cold-start condition recovering only a small fraction of the gain.

What carries the argument

The ASAT self-play loop with weakest-first curriculum scheduling, which directs the agent to alternate attacker, defender, and evaluator roles to address its current weakest dimensions on the TLDT taxonomy.

Load-bearing premise

That the agent's internal self-play reasoning can generate genuine improvements in threat recognition without external ground truth or model changes, and that the 12 TLDT dimensions provide a complete and reliable measure of security awareness.

What would settle it

If a cold-start agent trained only with ASAT and no prior memory sessions fails to close most of the 13.6-point gap to the 96.9 score observed with full CSMA accumulation, the necessity of cross-session memory for sustained gains would be disproved.

read the original abstract

Autonomous AI agents deployed on platforms such as OpenClaw face prompt injection, memory poisoning, supply-chain attacks, and social engineering, yet existing defences address only the platform perimeter, leaving the agent's own threat judgement entirely untrained. We present ClawdGo, a framework for endogenous security awareness training: we teach the agent to recognise and reason about threats from the inside, at inference time, with no model modification. Four contributions are introduced: TLDT (Three-Layer Domain Taxonomy) organises 12 trainable dimensions across Self-Defence, Owner-Protection, and Enterprise-Security layers; ASAT (Autonomous Security Awareness Training) is a self-play loop where the agent alternates attacker, defender, and evaluator roles under weakest-first curriculum scheduling; CSMA (Cross-Session Memory Accumulation) compounds skill gains via a four-layer persistent memory architecture and Axiom Crystallisation Promotion (ACP); and SACP (Security Awareness Calibration Problem) formalises the precision-recall tradeoff introduced by endogenous training. Live experiments show weakest-first ASAT raises average TLDT score from 80.9 to 96.9 over 16 sessions, outperforming uniform-random scheduling by 6.5 points and covering 11 of 12 dimensions. CSMA retains the full gain across sessions; cold-start ablation recovers only 2.4 points, leaving a 13.6-point gap. E-mode generates 32 TLDT-conformant scenarios covering all 12 dimensions. SACP is observed when a heavily trained agent classifies a legitimate capability assessment as prompt injection (30/160).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ClawdGo proposes self-play training for AI agents to build internal threat awareness, but the reported gains come from the agent's own evaluator with no external checks.

read the letter

The core idea is to move security training inside the agent itself. Instead of only hardening the platform perimeter, ClawdGo has the agent run a self-play loop where it takes turns as attacker, defender, and evaluator, using a weakest-first schedule to focus on weak spots. It adds a memory layer to carry gains forward and defines a 12-dimension taxonomy plus a calibration problem to track the precision-recall trade-off that arises from this setup. The experiments claim the loop lifts average scores from 80.9 to 96.9 across 16 sessions and beats random scheduling by 6.5 points, with the memory component preserving most of the improvement on cold starts. That combination of taxonomy, scheduling, and persistent memory is not in the cited prior work, so the framing is new. The paper also surfaces a concrete observation that even a trained agent still flags some legitimate capability checks as attacks, which shows the authors are not hiding the calibration issue. These pieces give readers a concrete starting point for thinking about agent-internal judgment rather than just external filters. The main limitation is that every TLDT score is generated inside the same loop by the agent acting as its own evaluator. No held-out test set, human labels, or independent benchmark is described, so the jump from 80.9 to 96.9 could partly reflect the system learning to produce outputs its evaluator likes rather than genuine improvement in spotting real threats. The abstract also gives no variance numbers, exact scoring rubric, or full protocol, which makes it hard to judge how stable the gains are. This work is aimed at researchers and engineers building autonomous agents who need to think beyond perimeter defenses. Anyone already working on prompt injection or agent memory will find the taxonomy and self-play structure useful to adapt or critique. It is worth sending to peer review because the problem is timely and the components are spelled out clearly enough for referees to give targeted feedback on validation and measurement. Expect the main revisions to focus on adding external checks rather than scrapping the approach.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ClawdGo, a framework for endogenous security awareness training of autonomous AI agents without model modification. It defines TLDT (Three-Layer Domain Taxonomy) organizing 12 dimensions across Self-Defence, Owner-Protection, and Enterprise-Security layers; ASAT (Autonomous Security Awareness Training) as a self-play loop alternating attacker/defender/evaluator roles under weakest-first curriculum scheduling; CSMA (Cross-Session Memory Accumulation) with four-layer persistent memory and Axiom Crystallisation Promotion; and SACP (Security Awareness Calibration Problem) formalizing precision-recall tradeoffs. Live experiments claim weakest-first ASAT raises average TLDT score from 80.9 to 96.9 over 16 sessions (outperforming uniform-random scheduling by 6.5 points and covering 11 of 12 dimensions), with CSMA retaining gains, cold-start ablation recovering only 2.4 points, E-mode generating 32 TLDT-conformant scenarios, and SACP observed in 30/160 misclassifications of legitimate checks.

Significance. If the empirical claims hold under rigorous validation, the work addresses a genuine gap in AI agent security by enabling internal threat judgment at inference time. The self-play endogenous approach, weakest-first scheduling, and cross-session memory are novel contributions that could influence defenses beyond perimeter controls. Concrete numerical results, ablation comparisons, and formalization of SACP provide a starting point for falsifiable follow-up, though the absence of external benchmarks limits immediate impact.

major comments (2)

Live experiments section: the headline TLDT gains (80.9 to 96.9) and outperformance claims are reported as direct session measurements but without any description of the scoring rubric, inter-evaluator agreement, statistical variance, confidence intervals, or full protocol (including how scenarios are generated and scored), preventing assessment of whether the 6.5-point margin and 11/12 dimension coverage are reproducible or robust.
ASAT and SACP descriptions: the central claim that self-play produces genuine threat-recognition improvements rests on evaluator scores generated inside the same agent loop (alternating roles), yet the SACP observation of 30/160 legitimate capability checks being flagged as prompt injection already demonstrates imperfect self-calibration; no external oracle, held-out human labels, or independent benchmark is described to rule out the agent learning to satisfy its own evaluator rather than improving real-world detection.

minor comments (2)

Abstract: acronyms TLDT, ASAT, CSMA, SACP, and ACP are introduced without prior expansion on first use, reducing immediate readability.
E-mode and ACP are referenced as generating scenarios and promoting crystallization but receive no implementation details or ablation isolating their contribution to the reported gains.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our poster manuscript. We address each major comment below and indicate planned revisions to improve clarity and transparency.

read point-by-point responses

Referee: Live experiments section: the headline TLDT gains (80.9 to 96.9) and outperformance claims are reported as direct session measurements but without any description of the scoring rubric, inter-evaluator agreement, statistical variance, confidence intervals, or full protocol (including how scenarios are generated and scored), preventing assessment of whether the 6.5-point margin and 11/12 dimension coverage are reproducible or robust.

Authors: The poster format constrained space, resulting in an abbreviated Live Experiments section. In revision we will expand it to describe the TLDT-based scoring rubric (binary success/failure per dimension on threat recognition in generated scenarios), inter-evaluator consistency via repeated ASAT role alternations, observed variance across the 16 sessions, and the full protocol for E-mode scenario generation followed by evaluator scoring. This will enable assessment of the reported 6.5-point margin and 11/12 dimension coverage. revision: yes
Referee: ASAT and SACP descriptions: the central claim that self-play produces genuine threat-recognition improvements rests on evaluator scores generated inside the same agent loop (alternating roles), yet the SACP observation of 30/160 legitimate capability checks being flagged as prompt injection already demonstrates imperfect self-calibration; no external oracle, held-out human labels, or independent benchmark is described to rule out the agent learning to satisfy its own evaluator rather than improving real-world detection.

Authors: ASAT is deliberately endogenous, using self-play to build internal threat judgment at inference time without external data or model changes. SACP is formalized precisely to capture the resulting precision-recall tradeoff, and the 30/160 cases are presented as direct evidence of imperfect calibration. The cold-start ablation (only 2.4 points recovered) and CSMA retention of gains indicate that improvements arise from axiom accumulation rather than evaluator gaming. We will add explicit discussion of this limitation and future external-validation plans. revision: partial

standing simulated objections not resolved

Absence of external oracles, held-out human labels, or independent benchmarks to validate that self-play gains reflect genuine real-world detection rather than internal loop calibration artifacts.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports empirical results from live experiments measuring TLDT score improvements via the ASAT self-play loop. No mathematical derivations, equations, fitted parameters presented as predictions, or self-referential definitions appear in the abstract or description. The central claims rest on observed session outcomes rather than any reduction to inputs by construction. Self-play roles are a methodological choice, but score gains are not tautological or forced; no self-citation chains or uniqueness theorems load-bear the results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 4 invented entities

The central claim rests on several newly introduced constructs and domain assumptions about agent reasoning capabilities; no free parameters are fitted in the reported results, but multiple invented entities and unproven assumptions about self-play effectiveness are required.

axioms (2)

domain assumption An agent can alternate attacker, defender, and evaluator roles in self-play at inference time without model modification and produce useful security reasoning.
Core premise of the ASAT loop described in the abstract.
domain assumption The Three-Layer Domain Taxonomy (TLDT) with its 12 dimensions comprehensively captures relevant security threats for autonomous agents.
Used to define trainable dimensions and measure progress.

invented entities (4)

TLDT (Three-Layer Domain Taxonomy) no independent evidence
purpose: Organizes 12 trainable security awareness dimensions across Self-Defence, Owner-Protection, and Enterprise-Security layers.
Newly proposed taxonomy with no independent evidence cited.
ASAT (Autonomous Security Awareness Training) no independent evidence
purpose: Self-play loop with weakest-first curriculum scheduling for endogenous training.
New training procedure introduced in the paper.
CSMA (Cross-Session Memory Accumulation) no independent evidence
purpose: Four-layer persistent memory architecture with Axiom Crystallisation Promotion to retain skill gains.
New memory mechanism for compounding training effects.
SACP (Security Awareness Calibration Problem) no independent evidence
purpose: Formalizes the precision-recall tradeoff arising from endogenous training.
New formal problem statement.

pith-pipeline@v0.9.0 · 5597 in / 1752 out tokens · 40921 ms · 2026-05-08T03:27:02.030370+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 3 canonical work pages

[1]

OpenClaw — Wikipedia, The Free Encyclo- pedia,

Wikipedia Contributors, “OpenClaw — Wikipedia, The Free Encyclo- pedia,” https://en.wikipedia.org/wiki/OpenClaw, 2026, accessed March 2026

2026
[2]

135K OpenClaw AI agents exposed online,

Bitdefender Labs, “135K OpenClaw AI agents exposed online,” https://www.bitdefender.com/en-us/blog/hotforsecurity/ 135k-openclaw-ai-agents-exposed-online, 2026, accessed March 2026

2026
[3]

ToxicSkills: Malicious AI agent skills found in ClawHub,

Snyk Security Research, “ToxicSkills: Malicious AI agent skills found in ClawHub,” https://snyk.io/blog/ toxicskills-malicious-ai-agent-skills-clawhub/, 2026, accessed April 2026

2026
[4]

Beyond the hype: Molt- bot’s real risk is exposed infrastructure, not AI superintelligence,

SecurityScorecard Research, “Beyond the hype: Molt- bot’s real risk is exposed infrastructure, not AI superintelligence,” https://securityscorecard.com/blog/ beyond-the-hype-moltbots-real-risk-is-exposed-infrastructure-not-ai-superintelligence/, February 2026, accessed March 2026

2026
[5]

CVE-2026-25253: One-click remote code execution in OpenClaw,

MITRE Corporation, “CVE-2026-25253: One-click remote code execution in OpenClaw,” https://www.cve.org/CVERecord?id= CVE-2026-25253, 2026, CVSS 8.8; fixed in OpenClaw v2026.1.29

2026
[6]

CVE-2026-32922: Privilege escalation to remote code execution in OpenClaw,

——, “CVE-2026-32922: Privilege escalation to remote code execution in OpenClaw,” https://www.cve.org/CVERecord?id=CVE-2026-32922, 2026, accessed April 2026

2026
[7]

OWASP top 10 for LLM applications and agentic AI,

OWASP Foundation, “OWASP top 10 for LLM applications and agentic AI,” https://owasp.org/ www-project-top-10-for-large-language-model-applications/, 2024, accessed 2026

2024
[8]

MITRE ATLAS: Adversarial threat landscape for AI systems,

MITRE Corporation, “MITRE ATLAS: Adversarial threat landscape for AI systems,” https://atlas.mitre.org/, 2024, accessed 2026

2024
[9]

Curriculum learning,

Y . Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” inProceedings of the 26th International Conference on Machine Learning (ICML), 2009, pp. 41–48

2009
[10]

ARLAS: Adversarial reinforcement learning for LLM agent safety,

A. Zhouet al., “ARLAS: Adversarial reinforcement learning for LLM agent safety,” 2025, arXiv:2510.05442

work page arXiv 2025
[11]

Self-RedTeam: Online self-play reinforcement learning for safer LLMs,

M. Liuet al., “Self-RedTeam: Online self-play reinforcement learning for safer LLMs,” 2025, arXiv:2506.07468

work page arXiv 2025
[12]

Memory and the hippocampus: A synthesis from findings with rats, monkeys, and humans,

L. R. Squire, “Memory and the hippocampus: A synthesis from findings with rats, monkeys, and humans,”Psychological Review, vol. 99, no. 2, pp. 195–231, 1992

1992
[13]

Defensive refusal bias: How safety alignment fails cyber defenders,

D. Campbellet al., “Defensive refusal bias: How safety alignment fails cyber defenders,” March 2026, arXiv:2603.01246. Poster: ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents Jiaqi Li1,2 Yang Zhao1,2 Bin Sun3 Yang Yu4 Jian Chang5 Lidong Zhai1,2 • 1Institute of Information Engineering, Chinese Academy of Sciences 2School of Cyber S...

work page arXiv 2026