Phish-Blitz: Advancing Phishing Detection with Comprehensive Webpage Resource Collection and Visual Integrity Preservation

📅 2025-09-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Phishing attacks continue to evolve, yet existing detection models suffer from scarce and low-quality data—particularly struggling to capture ephemeral phishing websites and their multimodal features (e.g., URL, HTML, screenshots, logos). To address this, we propose a high-fidelity web page acquisition framework that integrates automated crawling, browser-based rendering, and resource dependency analysis. It innovatively enables dynamic screenshot capture and automatic resource path resolution, ensuring visual integrity and faithful page reconstruction. Leveraging this framework, we construct PhishWeb—the first open-source, multimodal, and visually consistent phishing detection dataset—comprising 8,809 legitimate and 5,000 phishing websites, each accompanied by synchronized URL, HTML source, full-screen screenshot, and extracted logo. PhishWeb significantly enhances the generalizability of deep learning models in real-world settings and establishes a critical infrastructure for robust, multimodal phishing detection research.

Technology Category

Application Category

📝 Abstract
Phishing attacks are increasingly prevalent, with adversaries creating deceptive webpages to steal sensitive information. Despite advancements in machine learning and deep learning for phishing detection, attackers constantly develop new tactics to bypass detection models. As a result, phishing webpages continue to reach users, particularly those unable to recognize phishing indicators. To improve detection accuracy, models must be trained on large datasets containing both phishing and legitimate webpages, including URLs, webpage content, screenshots, and logos. However, existing tools struggle to collect the required resources, especially given the short lifespan of phishing webpages, limiting dataset comprehensiveness. In response, we introduce Phish-Blitz, a tool that downloads phishing and legitimate webpages along with their associated resources, such as screenshots. Unlike existing tools, Phish-Blitz captures live webpage screenshots and updates resource file paths to maintain the original visual integrity of the webpage. We provide a dataset containing 8,809 legitimate and 5,000 phishing webpages, including all associated resources. Our dataset and tool are publicly available on GitHub, contributing to the research community by offering a more complete dataset for phishing detection.
Problem

Research questions and friction points this paper is trying to address.

Detecting phishing webpages with evolving evasion tactics
Collecting comprehensive webpage resources for accurate detection models
Preserving visual integrity of webpages to improve dataset quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Downloads webpages with all associated resources
Captures live screenshots preserving visual integrity
Updates resource paths to maintain webpage authenticity
D
Duddu Hriday
Indian Institute of Technology, Dharwad, India
Aditya Kulkarni
Aditya Kulkarni
Indian Institute of Technology (IIT) Dharwad
CybersecurityPhishingDNS SecurityML and DL
V
Vivek Balachandran
Singapore Institute of Technology, Singapore
Tamal Das
Tamal Das
Indian Institute of Technology, Dharwad, India