🤖 AI Summary
This paper addresses the critical problem of inconsistent naming of threat actors (TAs) by cyber threat intelligence (CTI) vendors—impeding report integration and cross-source correlation analysis. To tackle this, we propose HiP, a novel method integrating graph-based modeling with multi-source clustering. HiP synthesizes 13,371 CTI reports from 15 sources and harmonizes 17 vendor-specific TA classification schemes, constructing the first large-scale TA name association graph. Leveraging 3,287 distinct TA names and eight canonical mapping relations, HiP uncovers the “alias concentration” phenomenon and identifies its root causes—including ad hoc naming practices, toolchain reuse, and operational overlap—while exposing systemic pitfalls in existing normalization approaches. Beyond enabling automated normalization and evolutionary analysis of proprietary naming systems, HiP quantifies, for the first time, the structural origins of naming inconsistency, revealing that barriers to sensitive data sharing—not technical limitations—are the fundamental obstacle to establishing unified naming standards.
📝 Abstract
This paper studies the problem of Threat Actor (TA) naming convention inconsistency across leading Cyber Threat Intelligence (CTI) vendors. The current decentralized and proprietary nomenclature creates confusion and significant obstacles for researchers, including difficulties in integrating and correlating disparate CTI reports and TA profiles. This paper introduces HiP (Hesperus is Phosphorus, a reference to the classic question about the Morning and the Evening Star), a methodology for normalizing, integrating, and clustering TA names presumably corresponding to the same entity. Using HiP, we analyze a large dataset collected from 15 sources and spanning 13,371 CTI reports, 17 vendor taxonomies, 3,287 TA names, and 8 mappings between them. Our analysis of the resulting name graph provides insights on key features of the problem, such as the concentration of aliases on a relatively small subset of TAs, the evolution of this phenomenon over the years, and the factors that could explain TA name proliferation. We also report errors in the mappings and methodological pitfalls that contribute to make certain TA name clusters larger than they should be, including the use of temporary names for activity clusters, the existence of common tools and infrastructure, and overlapping operations. We conclude with a discussion on the inherent difficulties to adopt a TA naming standard, a quest fundamentally hampered by the need to share highly-sensitive telemetry that is private to each CTI vendor.