A Study of Effectiveness of Brand Domain Identification Features for Phishing Detection in 2025

📅 2025-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the core phishing detection challenge of domain-name impersonation—specifically, the mismatch between declared domains and authentic brand domains. We systematically evaluate the effectiveness of five Brand Domain Identification (BDI) features—including CN information and logo-domain alignment—in detecting phishing websites, based on a large-scale empirical analysis of over 9,000 real-world websites. For the first time, we quantitatively verify the non-redundancy of these five feature categories; remarkably, a minimal combination of just three features achieves 99.8% detection accuracy. Using Weka-based attribute ranking and comparative evaluation across six machine learning models—including Random Forest and XGBoost—we identify Random Forest as optimal, attaining 99.7% classification accuracy with an average inference latency of only 0.08 seconds. Our findings provide both theoretical foundations and practical guidelines for developing lightweight, real-time, and scalable phishing detection systems.

Technology Category

Application Category

📝 Abstract
Phishing websites continue to pose a significant security challenge, making the development of robust detection mechanisms essential. Brand Domain Identification (BDI) serves as a crucial step in many phishing detection approaches. This study systematically evaluates the effectiveness of features employed over the past decade for BDI, focusing on their weighted importance in phishing detection as of 2025. The primary objective is to determine whether the identified brand domain matches the claimed domain, utilizing popular features for phishing detection. To validate feature importance and evaluate performance, we conducted two experiments on a dataset comprising 4,667 legitimate sites and 4,561 phishing sites. In Experiment 1, we used the Weka tool to identify optimized and important feature sets out of 5: CN Information(CN), Logo Domain(LD),Form Action Domain(FAD),Most Common Link in Domain(MCLD) and Cookie Domain through its 4 Attribute Ranking Evaluator. The results revealed that none of the features were redundant, and Random Forest emerged as the best classifier, achieving an impressive accuracy of 99.7% with an average response time of 0.08 seconds. In Experiment 2, we trained five machine learning models, including Random Forest, Decision Tree, Support Vector Machine, Multilayer Perceptron, and XGBoost to assess the performance of individual BDI features and their combinations. The results demonstrated an accuracy of 99.8%, achieved with feature combinations of only three features: Most Common Link Domain, Logo Domain, Form Action and Most Common Link Domain,CN Info,Logo Domain using Random Forest as the best classifier. This study underscores the importance of leveraging key domain features for efficient phishing detection and paves the way for the development of real-time, scalable detection systems.
Problem

Research questions and friction points this paper is trying to address.

Evaluates effectiveness of Brand Domain Identification features for phishing detection.
Determines if brand domain matches claimed domain using key features.
Validates feature importance and performance in detecting phishing websites.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes Brand Domain Identification for phishing detection
Employs Random Forest classifier achieving 99.7% accuracy
Combines key domain features for enhanced detection efficiency
Rina Mishra
Rina Mishra
PhD Scholar, Indian Institute of Technology, Jammu
CybersecurityNetwork SecurityAntiphishing
G
Gaurav Varshney
Indian Institute of Technology, Jammu J&K 181221, India