🤖 AI Summary
In e-commerce fraud detection, Transformer-based models underperform gradient-boosted decision trees (GBDTs), and production deployment introduces selection bias due to non-random control-group sampling. Method: We propose a novel Tabular Transformer paradigm grounded in self-supervised pretraining: the model is pretrained on large-scale unlabeled tabular data via masked feature reconstruction, with customized numerical feature embeddings and learnable positional encodings, then fine-tuned on limited labeled data. Contribution/Results: To our knowledge, this is the first validation of such an approach in a real industrial setting. Our method significantly outperforms fine-tuned GBDTs in Average Precision. Remarkably, it achieves GBDT-level performance using only 10% of the labeled data. It substantially mitigates selection bias, reducing reliance on control-group sampling. Furthermore, it demonstrates superior robustness and consistency under low-data regimes.
📝 Abstract
Transformer-based neural networks, empowered by Self-Supervised Learning (SSL), have demonstrated unprecedented performance across various domains. However, related literature suggests that tabular Transformers may struggle to outperform classical Machine Learning algorithms, such as Gradient Boosted Decision Trees (GBDT). In this paper, we aim to challenge GBDTs with tabular Transformers on a typical task faced in e-commerce, namely fraud detection. Our study is additionally motivated by the problem of selection bias, often occurring in real-life fraud detection systems. It is caused by the production system affecting which subset of traffic becomes labeled. This issue is typically addressed by sampling randomly a small part of the whole production data, referred to as a Control Group. This subset follows a target distribution of production data and therefore is usually preferred for training classification models with standard ML algorithms. Our methodology leverages the capabilities of Transformers to learn transferable representations using all available data by means of SSL, giving it an advantage over classical methods. Furthermore, we conduct large-scale experiments, pre-training tabular Transformers on vast amounts of data instances and fine-tuning them on smaller target datasets. The proposed approach outperforms heavily tuned GBDTs by a considerable margin of the Average Precision (AP) score. Pre-trained models show more consistent performance than the ones trained from scratch when fine-tuning data is limited. Moreover, they require noticeably less labeled data for reaching performance comparable to their GBDT competitor that utilizes the whole dataset.