$\alpha^3$-SecBench: A Large-Scale Evaluation Suite of Security, Resilience, and Trust for LLM-based UAV Agents over 6G Networks

📅 2026-01-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the critical gap in systematically evaluating the safety, resilience, and trustworthiness of large language model (LLM)-driven drone agents operating in adversarial environments under 6G networks. To this end, we propose α³-SecBench, the first comprehensive security evaluation framework that establishes a large-scale benchmark spanning a seven-layer autonomous architecture—encompassing perception, planning, communication, and LLM reasoning. The framework incorporates 20,000 validated attack scenarios and leverages adversarial task augmentation, cross-layer attack modeling, and automated metric quantification (including safety detection, resilience degradation, and policy compliance) to evaluate 23 mainstream LLMs across 113,475 tasks and 175 threat categories. Results reveal alarmingly low overall scores ranging from 12.9% to 57.1%, highlighting a significant gap between anomaly detection and secure decision-making, thereby filling a crucial void in trustworthy evaluation of autonomous systems under adversarial conditions.

Technology Category

Application Category

📝 Abstract

Autonomous unmanned aerial vehicle (UAV) systems are increasingly deployed in safety-critical, networked environments where they must operate reliably in the presence of malicious adversaries. While recent benchmarks have evaluated large language model (LLM)-based UAV agents in reasoning, navigation, and efficiency, systematic assessment of security, resilience, and trust under adversarial conditions remains largely unexplored, particularly in emerging 6G-enabled settings. We introduce $\alpha^{3}$-SecBench, the first large-scale evaluation suite for assessing the security-aware autonomy of LLM-based UAV agents under realistic adversarial interference. Building on multi-turn conversational UAV missions from $\alpha^{3}$-Bench, the framework augments benign episodes with 20,000 validated security overlay attack scenarios targeting seven autonomy layers, including sensing, perception, planning, control, communication, edge/cloud infrastructure, and LLM reasoning. $\alpha^{3}$-SecBench evaluates agents across three orthogonal dimensions: security (attack detection and vulnerability attribution), resilience (safe degradation behavior), and trust (policy-compliant tool usage). We evaluate 23 state-of-the-art LLMs from major industrial providers and leading AI labs using thousands of adversarially augmented UAV episodes sampled from a corpus of 113,475 missions spanning 175 threat types. While many models reliably detect anomalous behavior, effective mitigation, vulnerability attribution, and trustworthy control actions remain inconsistent. Normalized overall scores range from 12.9% to 57.1%, highlighting a significant gap between anomaly detection and security-aware autonomous decision-making. We release $\alpha^{3}$-SecBench on GitHub: https://github.com/maferrag/AlphaSecBench

Problem

Research questions and friction points this paper is trying to address.

LLM-based UAV agents

security

resilience

trust

6G networks

Innovation

Methods, ideas, or system contributions that make the work stand out.

security evaluation

LLM-based UAV agents

adversarial resilience

6G networks

trustworthy autonomy

🔎 Similar Papers

The Emerged Security and Privacy of LLM Agent: A Survey with Case Studies

2024-07-28arXiv.orgCitations: 62

Pathway to Secure and Trustworthy ZSM for LLMs: Attacks, Defense, and Opportunities

2024-08-01Citations: 0

Authors to Follow