CyberSOCEval: Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation methods inadequately assess large language models’ (LLMs) capabilities in real-world security operations center (SOC) settings—particularly for malware analysis and threat intelligence reasoning, two core defensive tasks. Method: We introduce CyberSOCEval, the first open-source benchmark explicitly designed for defensive cybersecurity tasks. It features a fine-grained, scenario-driven evaluation framework integrating real-world malware samples and multi-source threat intelligence logical reasoning tasks, augmented with test-time scaling to enhance assessment robustness. Contribution/Results: Experiments reveal that state-of-the-art LLMs exhibit substantial performance gaps on CyberSOCEval, with no saturation observed. Notably, reasoning-optimized models fail to replicate their advantages from mathematical or coding benchmarks in this domain—highlighting the need for domain-specific training data and reasoning mechanisms. CyberSOCEval establishes a reproducible, extensible evaluation standard and fosters community-driven advancement in AI-powered cybersecurity.

Technology Category

Application Category

📝 Abstract
Today's cyber defenders are overwhelmed by a deluge of security alerts, threat intelligence signals, and shifting business context, creating an urgent need for AI systems to enhance operational security work. While Large Language Models (LLMs) have the potential to automate and scale Security Operations Center (SOC) operations, existing evaluations do not fully assess the scenarios most relevant to real-world defenders. This lack of informed evaluation impacts both AI developers and those applying LLMs to SOC automation. Without clear insight into LLM performance in real-world security scenarios, developers lack a north star for development, and users cannot reliably select the most effective models. Meanwhile, malicious actors are using AI to scale cyber attacks, highlighting the need for open source benchmarks to drive adoption and community-driven improvement among defenders and model developers. To address this, we introduce CyberSOCEval, a new suite of open source benchmarks within CyberSecEval 4. CyberSOCEval includes benchmarks tailored to evaluate LLMs in two tasks: Malware Analysis and Threat Intelligence Reasoning--core defensive domains with inadequate coverage in current benchmarks. Our evaluations show that larger, more modern LLMs tend to perform better, confirming the training scaling laws paradigm. We also find that reasoning models leveraging test time scaling do not achieve the same boost as in coding and math, suggesting these models have not been trained to reason about cybersecurity analysis, and pointing to a key opportunity for improvement. Finally, current LLMs are far from saturating our evaluations, showing that CyberSOCEval presents a significant challenge for AI developers to improve cyber defense capabilities.
Problem

Research questions and friction points this paper is trying to address.

Lack of realistic benchmarks for LLMs in security operations scenarios
Inadequate evaluation of malware analysis and threat intelligence reasoning
Need to measure LLM performance for real-world cyber defense applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CyberSOCEval open source benchmark suite
Evaluates LLMs on malware analysis and threat intelligence
Shows current models underperform in cybersecurity reasoning
🔎 Similar Papers
No similar papers found.
L
Lauren Deason
Meta
A
Adam Bali
Meta
C
Ciprian Bejean
CrowdStrike
D
Diana Bolocan
CrowdStrike
J
James Crnkovich
Meta
I
Ioana Croitoru
CrowdStrike
K
Krishna Durai
Meta
C
Chase Midler
CrowdStrike
C
Calin Miron
CrowdStrike
David Molnar
David Molnar
Meta Platforms
Securityprogram analysisAI
B
Brad Moon
CrowdStrike
B
Bruno Ostarcevic
CrowdStrike
A
Alberto Peltea
CrowdStrike
M
Matt Rosenberg
CrowdStrike
C
Catalin Sandu
CrowdStrike
A
Arthur Saputkin
Meta
Sagar Shah
Sagar Shah
Carl H. Lindner College of Business, University of Cincinnati
Artificial IntelligenceMachine LearningFinTechHealthcare AnalyticsCloud Computing
D
Daniel Stan
CrowdStrike
E
Ernest Szocs
CrowdStrike
Shengye Wan
Shengye Wan
Meta Platforms, Inc.
AI SecurityCybersecurityTrusted Execution Environment
Spencer Whitman
Spencer Whitman
Product Manager, Meta
SecurityArtificial IntelligenceLLMs
S
Sven Krasser
CrowdStrike
J
Joshua Saxe
Meta