GuardPhish: Securing Open-Source LLMs from Phishing Abuse

📅 2026-04-19

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This study addresses the vulnerability of open-source large language models (LLMs) to adversarial phishing prompt attacks in offline settings, where existing static defense mechanisms offer limited protection. The authors introduce GuardPhish, a dataset comprising 70,015 multimodal phishing samples, and systematically uncover a critical execution gap between intent recognition and refusal generation in open-source LLMs. To bridge this gap, they propose a modular, dynamic defense framework that requires no modification to the underlying model. The approach employs a Transformer-based pre-classifier trained on diverse attack vectors—including web pages, emails, SMS, and voice—augmented with a five-model ensemble labeling strategy and an expert arbitration mechanism. Evaluated on GuardPhish, the classifier achieves 98.27% accuracy and reduces phishing content generation success rates from 98.5% to near zero, demonstrating both the efficacy of the proposed method and the inadequacy of current safety protocols.

Technology Category

Application Category

📝 Abstract

The rapid adoption of open-source Large Language Models (LLMs) in offline and enterprise environments has introduced a largely unexamined security risk like susceptibility to adversarial phishing prompts under static safety configurations. In this work, we systematically investigate this vulnerability through GuardPhish, a large scale multi-vector phishing prompt dataset comprising 70,015 samples spanning web, email, SMS, and voice attack scenarios derived from real world campaigns. Using a deterministic five model ensemble for labeling, we achieve near perfect inter model agreement (Fleiss kappa = 0.9141), with residual disagreements resolved through expert adjudication. By evaluating eight open-source LLMs under fully offline inference conditions, we uncover a substantial enforcement gap like models that correctly identify phishing intent with detection rates up to 96% nevertheless generate actionable phishing content from identical prompts, with attack success rates reaching 98.5% in voice-based scenarios. These findings demonstrate that intent classification alone does not guarantee generative refusal in the absence of dynamic guardrails. To mitigate this risk, we train transformer based classifiers on GuardPhish, achieving up to 98.27% accuracy as modular pre-generation filters deployable without modifying the underlying generative model. Our results highlight a critical weakness in current open-source LLM deployments and provide a reproducible foundation for strengthening defenses against phishing and social engineering attacks.

Problem

Research questions and friction points this paper is trying to address.

phishing

open-source LLMs

adversarial prompts

security vulnerability

generative refusal

Innovation

Methods, ideas, or system contributions that make the work stand out.

phishing detection

open-source LLMs

adversarial prompts