π€ AI Summary
To address the challenges of label scarcity and severe class imbalance in supply chain fraud detection, this paper proposes a two-stage semi-supervised learning framework. In the first stage, Isolation Forest performs unsupervised coarse anomaly screening; in the second stage, a self-training SVM refines detection by incorporating high-confidence pseudo-labels. The method innovatively integrates unsupervised anomaly detection with semi-supervised classification. Evaluated on the real-world DataCo supply chain dataset, it achieves an F1-score of 0.817 at a false positive rate below 3.0%, significantly outperforming conventional supervised and single-stage semi-supervised baselines. This work establishes a novel, interpretable, high-accuracy, and deployment-friendly paradigm for supply chain risk control under low-supervision and highly imbalanced conditions.
π Abstract
Detecting fraud in modern supply chains is a growing challenge, driven by the complexity of global networks and the scarcity of labeled data. Traditional detection methods often struggle with class imbalance and limited supervision, reducing their effectiveness in real-world applications. This paper proposes a novel two-phase learning framework to address these challenges. In the first phase, the Isolation Forest algorithm performs unsupervised anomaly detection to identify potential fraud cases and reduce the volume of data requiring further analysis. In the second phase, a self-training Support Vector Machine (SVM) refines the predictions using both labeled and high-confidence pseudo-labeled samples, enabling robust semi-supervised learning. The proposed method is evaluated on the DataCo Smart Supply Chain Dataset, a comprehensive real-world supply chain dataset with fraud indicators. It achieves an F1-score of 0.817 while maintaining a false positive rate below 3.0%. These results demonstrate the effectiveness and efficiency of combining unsupervised pre-filtering with semi-supervised refinement for supply chain fraud detection under real-world constraints, though we acknowledge limitations regarding concept drift and the need for comparison with deep learning approaches.