Before You Hand Over the Wheel: Evaluating LLMs for Security Incident Analysis

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the critical challenges faced by Security Operations Centers—including alert overload, heterogeneous data sources, and limited analyst expertise—by systematically evaluating the reliability of large language models (LLMs) in security incident analysis. We introduce SIABENCH, the first LLM-oriented security analysis benchmark, comprising 25 in-depth analytical scenarios and 135 alert triage tasks. Accompanying this benchmark is an intelligent agent framework capable of autonomously performing network and memory forensics, malware analysis, phishing detection, log parsing, and false-positive identification. Leveraging this platform, we conduct a comprehensive evaluation of 11 leading open- and closed-source LLMs, establishing a standardized and scalable foundation for assessing LLM deployment in real-world security operations.

Technology Category

Application Category

📝 Abstract

Security incident analysis (SIA) poses a major challenge for security operations centers, which must manage overwhelming alert volumes, large and diverse data sources, complex toolchains, and limited analyst expertise. These difficulties intensify because incidents evolve dynamically and require multi-step, multifaceted reasoning. Although organizations are eager to adopt Large Language Models (LLMs) to support SIA, the absence of rigorous benchmarking creates significant risks for assessing their effectiveness and guiding design decisions. Benchmarking is further complicated by: (i) the lack of an LLM-ready dataset covering a wide spectrum of SIA tasks; (ii) the continual emergence of new tasks reflecting the diversity of analyst responsibilities; and (iii) the rapid release of new LLMs that must be incorporated into evaluations. In this paper, we address these challenges by introducing SIABENCH, an agentic evaluation framework for security incident analysis. First, we construct a first-of-its-kind dataset comprising two major SIA task categories: (i) deep analysis workflows for security incidents (25 scenarios) and (ii) alert-triage tasks (135 scenarios). Second, we implement an agent capable of autonomously performing a broad spectrum of SIA tasks (including network and memory forensics, malware analysis across binary/code/PDF formats, phishing email and kit analysis, log analysis, and false-alert detection). Third, we benchmark 11 major LLMs (spanning both open- and closed-weight models) on these tasks, with extensibility to support emerging models and newly added analysis scenarios.

Problem

Research questions and friction points this paper is trying to address.

Security Incident Analysis

Large Language Models

Benchmarking

Evaluation Framework

LLM Readiness

Innovation

Methods, ideas, or system contributions that make the work stand out.

SIABENCH

Large Language Models

Security Incident Analysis