🤖 AI Summary
This study addresses a critical gap in AI alignment research, which has predominantly emphasized ex ante prevention while neglecting post-incident response and resilience management. The work proposes the first systematic framework for post-hoc response to AI incidents, introducing a three-dimensional classification matrix based on controllability, intent, and severity. This matrix distinguishes between “extremely difficult to control” and “fully uncontrollable” scenarios and further categorizes manageable incidents into accidental and adversarial failures. Drawing on circuit-breaker mechanisms from safety engineering and tiered response strategies from cybersecurity, the project develops actionable response protocols. These provide policymakers and developers with proportionate, evidence-based decision-making guidance, thereby filling a significant void in AI incident management and substantially enhancing overall system resilience.
📝 Abstract
Recent research demonstrating AI systems exhibiting deception and shutdown resistance suggests that AI loss of control (LOC) is an urgent policy concern , yet current literature focuses almost exclusively on alignment and prevention. To address this gap, this paper introduces a foundational framework and taxonomy for managing catastrophic AI LOC incidents. The taxonomy's first level distinguishes between scenarios where regaining control is 'extremely costly' versus 'impossible'. While impossible scenarios demand immediate resilience investments to fundamentally restrict an AI's attack surface , extremely costly scenarios require active incident management via Containment and Threat Neutralization. The framework further categorizes these manageable events into accidental LOC (requiring automated circuit-breaker responses) and adversarial LOC (requiring graduated escalatory measures). By mapping three severity classes to specific scenario matrices, this paper provides a concrete, proportional guide for managing unprecedented AI risks.