🤖 AI Summary
This work addresses the challenge of effectively associating network scanning sources in the absence of semantic annotations. The authors propose an unsupervised contrastive learning approach based on the Transformer architecture that learns semantic embeddings directly from raw network flow sequences, without requiring pretraining or manual labeling. By modeling inter-sequence similarity and integrating correlation-aware clustering, the method enables automatic association analysis of scanning sources. Experimental results demonstrate that the learned embeddings exhibit significantly higher similarity for sequences originating from the same source compared to those from different sources, indicating strong generalization capability. Furthermore, the clustering outcomes align closely with ground-truth scanning labels, confirming both the effectiveness and novelty of the proposed approach.
📝 Abstract
Understanding activities of Internet scanners is challenging; it often requires identifying relationships between sources, a task for which semantic annotations are scarce. This work investigates whether semantically meaningful pairwise relationships between sequences of network flow records can be estimated by contrastive learning, without pretraining and without annotations. To this end, we propose a transformer model that embeds minimally preprocessed sequences of network flow records and train it using contrastive learning. With the similarities obtained from this model, we state a correlation clustering problem and solve it locally. Experimentally, we show: Learned similarities are higher on average for sequences originating from the same source than for sequences originating from different sources, and this property generalizes to unseen sequences of unseen sources. Moreover, correlation clustering yields clusters consistent with scanner labels. The complete source code of the algorithms and for reproducing the experiments is publicly available.