🤖 AI Summary
Severe scarcity of real-world cryptocurrency address labels critically hinders network crime detection research. To address this, we introduce Real-CATS—the first large-scale, high-fidelity open-source dataset comprising 209,000 addresses (103,000 illicit and 106,000 benign), annotated via on-chain behavioral modeling fused with multi-source authoritative reports, and rigorously validated through exchange-user sampling and manual verification. We further propose C3R—a novel evaluation framework assessing comprehensiveness, classifiability, customizability, and real-world transferability—to bridge interdisciplinary data gaps. Real-CATS is publicly released on GitHub, substantially lowering entry barriers for AI and statistical researchers. It enables robust development, fair benchmarking, and generalization validation of detection models under realistic operational conditions.
📝 Abstract
Cybercriminals pose a significant threat to blockchain trading security, causing $40.9 billion in losses in 2024. However, the lack of an effective real-world address dataset hinders the advancement of cybercrime detection research. The anti-cybercrime efforts of researchers from broader fields, such as statistics and artificial intelligence, are blocked by data scarcity. In this paper, we present Real-CATS, a Real-world dataset of Cryptocurrency Addresses with Transaction profileS, serving as a practical training ground for developing and assessing detection methods. Real-CATS comprises 103,203 criminal addresses from real-world reports and 106,196 benign addresses from exchange customers. It satifies the C3R characteristics (Comprehensiveness, Classifiability, Customizability, and Real-world Transferability), which are fundemental for practical detection of cryptocurrency cybercrime. The dataset provides three main functions: 1) effective evaluation of detection methods, 2) support for feature extensions, and 3) a new evaluation scenario for real-world deployment. Real-CATS also offers opportunities to expand cybercrime measurement studies. It is particularly beneficial for researchers without cryptocurrency-related knowledge to engage in this emerging research field. We hope that studies on cryptocurrency cybercrime detection will be promoted by an increasing number of cross-disciplinary researchers drawn to this versatile data platform. All datasets are available at https://github.com/sjdseu/Real-CATS