Context-Aware Phishing Email Detection Using Machine Learning and NLP

📅 2026-03-28

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This study addresses the limitation of traditional phishing email detection methods that overly rely on URLs while neglecting the semantic content of message bodies. To overcome this, the authors propose an end-to-end detection approach based on full-text semantics. The method employs standard NLP preprocessing—including lowercasing, tokenization, stop-word removal, and lemmatization—followed by TF-IDF feature extraction incorporating both unigrams and bigrams to capture contextual information. Two classifiers, Naive Bayes and Logistic Regression, are evaluated, with Logistic Regression achieving superior performance at 95.41% accuracy and 94.33% F1-score, outperforming the baseline by approximately 1.55 percentage points. The system is deployed as a real-time web service using FastAPI, demonstrating high efficiency with an average response time of only 127 milliseconds, thereby balancing high detection accuracy with low latency.

Technology Category

Application Category

📝 Abstract

Phishing attacks remain among the most prevalent cybersecurity threats, causing significant financial losses for individuals and organizations worldwide. This paper presents a machine learning-based phishing email detection system that analyzes email body content using natural language processing (NLP) techniques. Unlike existing approaches that primarily focus on URL analysis, our system classifies emails by extracting contextual features from the entire email content. We evaluated two classification models, Naive Bayes and Logistic Regression, trained on a combined corpus of 53,973 labeled emails from three distinct datasets. Our preprocessing pipeline incorporates lowercasing, tokenization, stop-word removal, and lemmatization, followed by Term Frequency-Inverse Document Frequency (TF-IDF) feature extraction with unigrams and bigrams. Experimental results demonstrate that Logistic Regression achieves 95.41% accuracy with an F1-score of 94.33%, outperforming Naive Bayes by 1.55 percentage points. The system was deployed as a web application with a FastAPI backend, providing real-time phishing classification with average response times of 127ms.

Problem

Research questions and friction points this paper is trying to address.

phishing email detection

context-aware

natural language processing

machine learning

email content analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

context-aware phishing detection

natural language processing

machine learning