🤖 AI Summary
This study addresses the pervasive issue of performance overestimation in network intrusion detection research due to data leakage and unrealistic sequential inputs during temporal model evaluation. The authors reformulate the CIC-IDS2017 dataset as a realistic network session sequencing task and introduce a leakage-free data partitioning scheme alongside multiple padding strategies. They systematically evaluate the temporal modeling capabilities of nine architectures—including Transformer, LSTM, GRU, 1D-CNN, and Random Forest—revealing for the first time that data partitioning and padding choices exert a far greater impact on performance than model architecture itself. Specifically, random splitting with repetition padding substantially inflates robustness estimates; under authentic sequential windows, the Transformer achieves the highest macro-F1 (0.89), yet drops by 0.24 with zero-padding plus masking, while Random Forest demonstrates superior robustness in leakage-free evaluation, and the Transformer’s false positive rate surges 67-fold.
📝 Abstract
Recent deep learning approaches for network intrusion detection increasingly incorporate temporal architectures such as recurrent networks and Transformers, often reporting near-perfect performance on CIC-IDS2017. However, many existing studies neither supply their temporal modules with genuine sequence inputs nor evaluate under realistic, leakage-free conditions, making it unclear whether reported gains arise from true sequence-modeling capability. In this work, we reformulate CIC-IDS2017 as a temporal intrusion-detection task by constructing ordered flow sequences from network conversations and benchmarking nine classical and deep learning architectures under a random split, two leakage-free splits, and a padding-scheme ablation. The central finding is that padding convention, not architecture, determines the Transformer's performance: on genuinely sequential (non-padded) windows the Transformer achieves the highest macro-F1 of any model in the experiment (0.89); under zero-pad+mask evaluation it drops markedly (-0.24 macro-F1), while LSTM, GRU, and 1D-CNN remain stable. Under leakage-free group evaluation the Random Forest is the most robust model (+0.009), while the Transformer's false-alarm rate grows from 0.04% to 2.7%, a 67-fold increase invisible under conventional protocols. These findings demonstrate that evaluation methodology -- specifically padding convention and split protocol -- has a larger effect on reported performance than architectural choice, and that widely used random splits with repeat-last padding can overestimate model robustness by up to 0.24 macro-F1. We advocate leakage-free splits, explicit padding disclosure, and sequence-aware benchmarking as standard practice in future IDS research. Code and implementation details are available at https://github.com/zachmocz/temporal-ids-bench.