π€ AI Summary
Current autonomous AI agents remain vulnerable to internal threats such as prompt injection and memory contamination, while existing defenses are largely confined to platform boundaries and lack the agentβs intrinsic capacity for security-aware reasoning. This work proposes ClawdGo, a framework that enhances an agentβs threat detection and reasoning capabilities during inference without modifying the underlying model, through endogenous security-awareness training. ClawdGo introduces several innovations: a three-layer domain taxonomy (TLDT), an adversarial self-play training mechanism (ASAT), a cross-session memory accumulation architecture (CSMA), and a formalized security-awareness calibration protocol (SACP), integrated with a three-role self-play scheme, weakest-first curriculum scheduling, and axiom crystallization. Experimental results show that after 16 training rounds, TLDT scores improved from 80.9 to 96.9 across 11 of 12 dimensions, with CSMA preserving all gains; ablation under cold-start conditions recovered only 2.4 points, leaving a 13.6-point performance gap.
π Abstract
Autonomous AI agents deployed on platforms such as OpenClaw face prompt injection, memory poisoning, supply-chain attacks, and social engineering, yet existing defences address only the platform perimeter, leaving the agent's own threat judgement entirely untrained. We present ClawdGo, a framework for endogenous security awareness training: we teach the agent to recognise and reason about threats from the inside, at inference time, with no model modification. Four contributions are introduced: TLDT (Three-Layer Domain Taxonomy) organises 12 trainable dimensions across Self-Defence, Owner-Protection, and Enterprise-Security layers; ASAT (Autonomous Security Awareness Training) is a self-play loop where the agent alternates attacker, defender, and evaluator roles under weakest-first curriculum scheduling; CSMA (Cross-Session Memory Accumulation) compounds skill gains via a four-layer persistent memory architecture and Axiom Crystallisation Promotion (ACP); and SACP (Security Awareness Calibration Problem) formalises the precision-recall tradeoff introduced by endogenous training. Live experiments show weakest-first ASAT raises average TLDT score from 80.9 to 96.9 over 16 sessions, outperforming uniform-random scheduling by 6.5 points and covering 11 of 12 dimensions. CSMA retains the full gain across sessions; cold-start ablation recovers only 2.4 points, leaving a 13.6-point gap. E-mode generates 32 TLDT-conformant scenarios covering all 12 dimensions. SACP is observed when a heavily trained agent classifies a legitimate capability assessment as prompt injection (30/160).