SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity

📅 2024-12-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large language models (LLMs) lack a systematic, domain-specific evaluation benchmark for cybersecurity. Method: We introduce SecBench—the first dedicated, multidimensional benchmark for cybersecurity LLM evaluation—covering bilingual (Chinese/English) content, diverse question formats (multiple-choice and short-answer questions), multiple competency dimensions (knowledge retention and logical reasoning), and broad subdomains. It comprises 47,910 high-quality, expert-annotated questions. Our methodology innovates with a “human competition + LLM collaboration” annotation paradigm, an automatically scoring short-answer evaluation agent, and multi-source data fusion with domain-specific knowledge modeling. Contribution/Results: We comprehensively evaluate 13 state-of-the-art LLMs, demonstrating SecBench’s effectiveness and discriminative power. The full dataset and evaluation toolkit are open-sourced, establishing a standardized infrastructure for assessing AI capabilities in cybersecurity.

Technology Category

Application Category

📝 Abstract
Evaluating Large Language Models (LLMs) is crucial for understanding their capabilities and limitations across various applications, including natural language processing and code generation. Existing benchmarks like MMLU, C-Eval, and HumanEval assess general LLM performance but lack focus on specific expert domains such as cybersecurity. Previous attempts to create cybersecurity datasets have faced limitations, including insufficient data volume and a reliance on multiple-choice questions (MCQs). To address these gaps, we propose SecBench, a multi-dimensional benchmarking dataset designed to evaluate LLMs in the cybersecurity domain. SecBench includes questions in various formats (MCQs and short-answer questions (SAQs)), at different capability levels (Knowledge Retention and Logical Reasoning), in multiple languages (Chinese and English), and across various sub-domains. The dataset was constructed by collecting high-quality data from open sources and organizing a Cybersecurity Question Design Contest, resulting in 44,823 MCQs and 3,087 SAQs. Particularly, we used the powerful while cost-effective LLMs to (1). label the data and (2). constructing a grading agent for automatic evaluation of SAQs. Benchmarking results on 13 SOTA LLMs demonstrate the usability of SecBench, which is arguably the largest and most comprehensive benchmark dataset for LLMs in cybersecurity. More information about SecBench can be found at our website, and the dataset can be accessed via the artifact link.
Problem

Research questions and friction points this paper is trying to address.

Cybersecurity
Language Models
Evaluation Dataset
Innovation

Methods, ideas, or system contributions that make the work stand out.

SecBench
Cybersecurity Evaluation
Large-scale Model Testing
🔎 Similar Papers
No similar papers found.
P
Pengfei Jing
The Hong Kong Polytechnic University, Tencent Security Keen Lab, China
M
Mengyun Tang
Tencent Zhuque Lab, China
X
Xiaorong Shi
Tencent Zhuque Lab, China
Xing Zheng
Xing Zheng
Ph.D. of University of California, Riverside
Sensor fusionSLAMVIO
S
Sen Nie
Tencent Security Keen Lab, China
S
Shi Wu
Tencent Security Keen Lab, China
Y
Yong Yang
Tencent Security Platform and Department, China
Xiapu Luo
Xiapu Luo
The Hong Kong Polytechnic University
Mobile SecuritySmart ContractsNetwork SecurityBlockchainSoftware Engineering