🤖 AI Summary
To address the challenge of dynamically identifying high-value, unstructured cyber threat intelligence (CTI) sources—such as news articles and blogs—this paper proposes the first active CTI crawler framework grounded in the Multi-Armed Bandit (MAB) paradigm. The method integrates SBERT-based semantic matching, adaptive crawling policies, and an online reward feedback mechanism to enable automatic seed-source expansion and unsupervised discovery of highly relevant, previously unknown pages or domains—thereby transcending conventional fixed-source extraction approaches. Experimental results demonstrate a 25.3% harvest rate, over 300% growth in seed-source scale, strong topical coherence, and successful identification of numerous high-quality, emerging CTI sources absent from existing CTI ecosystems.
📝 Abstract
Public information contains valuable Cyber Threat Intelligence (CTI) that is used to prevent future attacks. While standards exist for sharing this information, much appears in non-standardized news articles or blogs. Monitoring online sources for threats is time-consuming and source selection is uncertain. Current research focuses on extracting Indicators of Compromise from known sources, rarely addressing new source identification. This paper proposes a CTI-focused crawler using multi-armed bandit (MAB) and various crawling strategies. It employs SBERT to identify relevant documents while dynamically adapting its crawling path. Our system ThreatCrawl achieves a harvest rate exceeding 25% and expands its seed by over 300% while maintaining topical focus. Additionally, the crawler identifies previously unknown but highly relevant overview pages, datasets, and domains.