Detecting Data Exfiltration through I2P Anonymity Networks: A Two-Phase Machine Learning Approach

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

255K/year
🤖 AI Summary
This study addresses the challenge of detecting data exfiltration over the I2P anonymous network, a task inadequately handled by existing cybersecurity mechanisms due to their inability to identify malicious activities or assess associated threats. To overcome this limitation, the authors propose an innovative two-stage machine learning approach that first achieves high-precision identification of I2P traffic and then classifies its communication behavior to determine whether data exfiltration is occurring, while also enabling threat prioritization. This work uniquely integrates traffic identification with behavioral analysis, leveraging tree-based ensemble models—specifically Random Forest and XGBoost—and key features such as packet timing and flow duration. Experimental results demonstrate 99.96% accuracy in the first stage for I2P traffic identification and 91.11% accuracy in the second stage for behavior classification, significantly outperforming baseline methods including deep neural networks and support vector machines.
📝 Abstract
The Invisible Internet Project (I2P) provides strong anonymity through garlic routing and distributed network architecture, making it attractive for legitimate privacy needs. Nevertheless, the same properties can be exploited by malicious actors to steal sensitive information from corporate networks without detection. Current network security measures often fail to detect I2P traffic, and existing literature has focused primarily on protocol-level traffic identification without addressing behavioral threat assessment. This paper proposes a two-stage machine-learning model for I2P traffic analysis using the SafeSurf Darknet 2025 dataset comprising 184,548 network flows. Phase 1 achieved 99.96% accuracy in distinguishing I2P traffic from normal network traffic using a Random Forest classifier, with only 2 false positives among 32,318 normal flows. Phase 2 performed behavioral analysis on traffic identified as I2P, classifying it as either exfiltration or legitimate activity, achieving 91.11% accuracy using XGBoost. The system demonstrates that tree-based ensemble methods substantially outperform deep neural networks and support vector machines for this task. Feature importance analysis indicates that the most discriminative features are packet timing and flow duration. These findings establish that accurate I2P traffic detection and threat prioritization are achievable in operational network environments, enabling security teams to focus resources on high-risk events rather than monitoring all encrypted traffic.
Problem

Research questions and friction points this paper is trying to address.

Data Exfiltration
I2P
Anonymity Networks
Network Security
Threat Detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

I2P traffic detection
data exfiltration
two-phase machine learning
behavioral threat assessment
tree-based ensemble methods