How Can We Effectively Use LLMs for Phishing Detection?: Evaluating the Effectiveness of Large Language Model-based Phishing Detection Models

📅 2025-11-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the application of large language models (LLMs) to phishing website detection and target brand identification, aiming to enhance generalization capability and interpretability. We systematically evaluate the impact of input modalities (screenshot vs. HTML), temperature settings, and prompting strategies. Results show that commercial LLMs (e.g., GPT-4.1, Gemini) significantly outperform open-weight models (Qwen achieving up to 92% accuracy) and conventional deep learning approaches. Optimal performance—93–95% brand identification accuracy—is achieved using screenshot inputs with zero-temperature decoding; multimodal fusion and single-shot prompting yield no measurable gains. HTML serves as an effective fallback when screenshots are unavailable. This work provides the first empirical demonstration of LLM robustness and deployability on real-world phishing data, establishing a novel paradigm for interpretable, low-dependency phishing detection.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have emerged as a promising phishing detection mechanism, addressing the limitations of traditional deep learning-based detectors, including poor generalization to previously unseen websites and a lack of interpretability. However, LLMs'effectiveness for phishing detection remains unexplored. This study investigates how to effectively leverage LLMs for phishing detection (including target brand identification) by examining the impact of input modalities (screenshots, logos, HTML, and URLs), temperature settings, and prompt engineering strategies. Using a dataset of 19,131 real-world phishing websites and 243 benign sites, we evaluate seven LLMs -- two commercial models (GPT 4.1 and Gemini 2.0 flash) and five open-source models (Qwen, Llama, Janus, DeepSeek-VL2, and R1) -- alongside two deep learning (DL)-based baselines (PhishIntention and Phishpedia). Our findings reveal that commercial LLMs generally outperform open-source models in phishing detection, while DL models demonstrate better performance on benign samples. For brand identification, screenshot inputs achieve optimal results, with commercial LLMs reaching 93-95% accuracy and open-source models, particularly Qwen, achieving up to 92%. However, incorporating multiple input modalities simultaneously or applying one-shot prompts does not consistently enhance performance and may degrade results. Furthermore, higher temperature values reduce performance. Based on these results, we recommend using screenshot inputs with zero temperature to maximize accuracy for LLM-based detectors with HTML serving as auxiliary context when screenshot information is insufficient.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM effectiveness for phishing website detection
Investigating optimal input modalities and prompt strategies
Comparing commercial and open-source LLM detection performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using screenshot inputs for brand identification
Applying zero temperature to maximize detection accuracy
Utilizing HTML as auxiliary context when needed
🔎 Similar Papers
No similar papers found.
F
Fujiao Ji
University of Tennessee, Knoxville
Doowon Kim
Doowon Kim
University of Tennessee, Knoxville
Computer Security