CyberCane: Neuro-Symbolic RAG for Privacy-Preserving Phishing Detection with Formal Ontology Reasoning

📅 2026-04-26
📈 Citations: 0
Influential: 0
📄 PDF

career value

182K/year
🤖 AI Summary
This work addresses the stringent requirements of phishing email detection in privacy-sensitive settings—namely, low false positive rates, interpretability, regulatory compliance, and robustness against AI-generated attacks—by proposing CyberCane, a novel framework that uniquely integrates neural-symbolic methods with formal ontology-based reasoning. The approach employs lightweight symbolic rules to pre-screen email metadata and triggers privacy-preserving retrieval-augmented generation (RAG) only for borderline cases: after automatic de-identification, it retrieves evidence from a phishing-specific corpus and performs logical inference grounded in the PhishOnt ontology for classification. The system ensures zero leakage of sensitive information, verifiable outcomes, and support for dynamic risk calibration. Evaluated on datasets including DataPhish2025, CyberCane achieves a 78.6-point recall gain over purely symbolic baselines, precision exceeding 98%, and a false positive rate as low as 0.16%, with an estimated return on investment of 542× in healthcare deployments.

Technology Category

Application Category

📝 Abstract
Privacy-critical domains require phishing detection systems that satisfy contradictory constraints: near-zero false positives to prevent workflow disruption, transparent explanations for non-expert staff, strict regulatory compliance prohibiting sensitive data exposure to external APIs, and robustness against AI-generated attacks. Existing rule-based systems are brittle to novel campaigns, while LLM-based detectors violate privacy regulations through unredacted data transmission. We introduce CyberCane, a neuro-symbolic framework integrating deterministic symbolic analysis with privacy-preserving retrieval-augmented generation (RAG). Our dual-phase pipeline applies lightweight symbolic rules to email metadata, then escalates borderline cases to semantic classification via RAG with automated sensitive data redaction and retrieval from a phishing-only corpus. We further introduce PhishOnt, an OWL ontology enabling verifiable attack classification through formal reasoning chains. Evaluation on DataPhish2025 (12.3k emails; mixed human/LLM) and Nazario/SpamAssassin demonstrates a 78.6-point recall gain over symbolic-only detection on AI-generated threats, with precision exceeding 98% and FPR as low as 0.16%. Healthcare deployment projects a 542x ROI; tunable operating points support diverse risk tolerances, with open-source implementation at https://github.com/sbhakim/Cybercane.
Problem

Research questions and friction points this paper is trying to address.

phishing detection
privacy preservation
false positive rate
regulatory compliance
AI-generated attacks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Neuro-Symbolic AI
Privacy-Preserving RAG
Formal Ontology Reasoning
Phishing Detection
Sensitive Data Redaction