🤖 AI Summary
This study addresses the critical gap in systematically evaluating the safety, resilience, and trustworthiness of large language model (LLM)-driven drone agents operating in adversarial environments under 6G networks. To this end, we propose α³-SecBench, the first comprehensive security evaluation framework that establishes a large-scale benchmark spanning a seven-layer autonomous architecture—encompassing perception, planning, communication, and LLM reasoning. The framework incorporates 20,000 validated attack scenarios and leverages adversarial task augmentation, cross-layer attack modeling, and automated metric quantification (including safety detection, resilience degradation, and policy compliance) to evaluate 23 mainstream LLMs across 113,475 tasks and 175 threat categories. Results reveal alarmingly low overall scores ranging from 12.9% to 57.1%, highlighting a significant gap between anomaly detection and secure decision-making, thereby filling a crucial void in trustworthy evaluation of autonomous systems under adversarial conditions.
📝 Abstract
Autonomous unmanned aerial vehicle (UAV) systems are increasingly deployed in safety-critical, networked environments where they must operate reliably in the presence of malicious adversaries. While recent benchmarks have evaluated large language model (LLM)-based UAV agents in reasoning, navigation, and efficiency, systematic assessment of security, resilience, and trust under adversarial conditions remains largely unexplored, particularly in emerging 6G-enabled settings. We introduce $\alpha^{3}$-SecBench, the first large-scale evaluation suite for assessing the security-aware autonomy of LLM-based UAV agents under realistic adversarial interference. Building on multi-turn conversational UAV missions from $\alpha^{3}$-Bench, the framework augments benign episodes with 20,000 validated security overlay attack scenarios targeting seven autonomy layers, including sensing, perception, planning, control, communication, edge/cloud infrastructure, and LLM reasoning. $\alpha^{3}$-SecBench evaluates agents across three orthogonal dimensions: security (attack detection and vulnerability attribution), resilience (safe degradation behavior), and trust (policy-compliant tool usage). We evaluate 23 state-of-the-art LLMs from major industrial providers and leading AI labs using thousands of adversarially augmented UAV episodes sampled from a corpus of 113,475 missions spanning 175 threat types. While many models reliably detect anomalous behavior, effective mitigation, vulnerability attribution, and trustworthy control actions remain inconsistent. Normalized overall scores range from 12.9% to 57.1%, highlighting a significant gap between anomaly detection and security-aware autonomous decision-making. We release $\alpha^{3}$-SecBench on GitHub: https://github.com/maferrag/AlphaSecBench