DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

The digital forensics and incident response (DFIR) domain lacks a comprehensive benchmark capable of jointly evaluating large language models’ (LLMs) theoretical knowledge and practical capabilities—particularly their task understanding under low-accuracy conditions. Method: We introduce the first three-dimensional DFIR benchmark, comprising knowledge assessment, CTF-style multi-step reasoning (150 questions), and real-world disk/memory artifact analysis (500 NIST cases), totaling 700 expert-crafted items. We propose the Task Understanding Score (TUS) to quantify reasoning fidelity beyond accuracy; employ multi-source authoritative question banks, multi-round consistency validation, and fine-grained error attribution. Contribution/Results: Systematic evaluation of 14 mainstream LLMs reveals pervasive hallucination and reasoning fragmentation. All data, evaluation scripts, and results are fully open-sourced for reproducibility.

Technology Category

Application Category

📝 Abstract

Digital Forensics and Incident Response (DFIR) involves analyzing digital evidence to support legal investigations. Large Language Models (LLMs) offer new opportunities in DFIR tasks such as log analysis and memory forensics, but their susceptibility to errors and hallucinations raises concerns in high-stakes contexts. Despite growing interest, there is no comprehensive benchmark to evaluate LLMs across both theoretical and practical DFIR domains. To address this gap, we present DFIR-Metric, a benchmark with three components: (1) Knowledge Assessment: a set of 700 expert-reviewed multiple-choice questions sourced from industry-standard certifications and official documentation; (2) Realistic Forensic Challenges: 150 CTF-style tasks testing multi-step reasoning and evidence correlation; and (3) Practical Analysis: 500 disk and memory forensics cases from the NIST Computer Forensics Tool Testing Program (CFTT). We evaluated 14 LLMs using DFIR-Metric, analyzing both their accuracy and consistency across trials. We also introduce a new metric, the Task Understanding Score (TUS), designed to more effectively evaluate models in scenarios where they achieve near-zero accuracy. This benchmark offers a rigorous, reproducible foundation for advancing AI in digital forensics. All scripts, artifacts, and results are available on the project website at https://github.com/DFIR-Metric.

Problem

Research questions and friction points this paper is trying to address.

Lack of benchmark for evaluating LLMs in DFIR tasks

Need to assess LLM accuracy in forensic evidence analysis

Addressing LLM error risks in high-stakes legal investigations

Innovation

Methods, ideas, or system contributions that make the work stand out.

700 expert-reviewed questions for knowledge assessment

150 CTF-style tasks for forensic challenges

500 NIST cases for practical analysis

🔎 Similar Papers

No similar papers found.

Authors to Follow