🤖 AI Summary
Existing URL understanding methods directly adopt general-purpose language models (e.g., BERT), overlooking the domain-specific characteristics of URLs—namely, their loose syntactic structure and sparse semantics—and are typically confined to single-task learning, limiting generalization. This paper introduces URLBERT, the first pre-trained language model specifically designed for URL understanding. URLBERT features a URL-aware tokenizer, incorporates structure-aware self-supervised contrastive learning to model URL variants, and employs virtual adversarial training to enhance semantic robustness. It supports joint fine-tuning across multiple downstream tasks—including phishing detection, web page classification, and ad filtering. Pre-trained on a billion-scale URL corpus, URLBERT achieves state-of-the-art performance on three core security and recommendation tasks. Crucially, its multi-task variant matches or exceeds the performance of task-specific models, demonstrating strong and efficient generalization. The code is publicly available.
📝 Abstract
URLs play a crucial role in understanding and categorizing web content, particularly in tasks related to security control and online recommendations. While pre-trained models are currently dominating various fields, the domain of URL analysis still lacks specialized pre-trained models. To address this gap, this paper introduces URLBERT, the first pre-trained representation learning model applied to a variety of URL classification or detection tasks. We first train a URL tokenizer on a corpus of billions of URLs to address URL data tokenization. Additionally, we propose two novel pre-training tasks: (1) self-supervised contrastive learning tasks, which strengthen the model's understanding of URL structure and the capture of category differences by distinguishing different variants of the same URL; (2) virtual adversarial training, aimed at improving the model's robustness in extracting semantic features from URLs. Finally, our proposed methods are evaluated on tasks including phishing URL detection, web page classification, and ad filtering, achieving state-of-the-art performance. Importantly, we also explore multi-task learning with URLBERT, and experimental results demonstrate that multi-task learning model based on URLBERT exhibit equivalent effectiveness compared to independently fine-tuned models, showing the simplicity of URLBERT in handling complex task requirements. The code for our work is available at https://github.com/Davidup1/URLBERT.