Large Language Models are Unreliable for Cyber Threat Intelligence

📅 2025-03-29

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work systematically evaluates the reliability of large language models (LLMs) in cybersecurity threat intelligence (CTI) tasks, exposing critical deficiencies—including unstable performance on real-world reports, inconsistent outputs, and pervasive overconfidence. Addressing CTI’s scarcity of labeled data and stringent trustworthiness requirements, we introduce the first multi-paradigm (zero-shot, few-shot, fine-tuned) reliability evaluation framework, jointly quantifying accuracy, output consistency, and confidence calibration. Empirically validated on 350 authentic threat reports across three state-of-the-art LLMs, our analysis reveals that current LLMs fail to meet the robustness and trustworthiness demands of security operations; few-shot prompting and fine-tuning yield only marginal improvements. This study establishes a foundational benchmark and critical cautionary insights for deploying LLMs in high-assurance security applications.

Technology Category

Application Category

📝 Abstract

Several recent works have argued that Large Language Models (LLMs) can be used to tame the data deluge in the cybersecurity field, by improving the automation of Cyber Threat Intelligence (CTI) tasks. This work presents an evaluation methodology that other than allowing to test LLMs on CTI tasks when using zero-shot learning, few-shot learning and fine-tuning, also allows to quantify their consistency and their confidence level. We run experiments with three state-of-the-art LLMs and a dataset of 350 threat intelligence reports and present new evidence of potential security risks in relying on LLMs for CTI. We show how LLMs cannot guarantee sufficient performance on real-size reports while also being inconsistent and overconfident. Few-shot learning and fine-tuning only partially improve the results, thus posing doubts about the possibility of using LLMs for CTI scenarios, where labelled datasets are lacking and where confidence is a fundamental factor.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' reliability for Cyber Threat Intelligence tasks

Assessing consistency and confidence of LLMs in CTI automation

Identifying security risks in using LLMs for real-size CTI reports

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLMs on CTI tasks using multiple learning methods

Quantifies LLM consistency and confidence levels

Tests three state-of-the-art LLMs with real-world reports

🔎 Similar Papers

No similar papers found.

Authors to Follow