🤖 AI Summary
This work addresses the challenge of diagnosing generalization failures in fine-tuned large language models (LLMs) for high-stakes tasks such as phishing detection. We propose a multi-layer diagnostic framework that integrates SHAP value analysis with mechanistic interpretability techniques to conduct a cross-architecture investigation of generalization behavior across diverse datasets, focusing on prominent models including Llama 3.1, Gemma 2, and Mistral. Our study uncovers a synergistic interaction between model architecture and data diversity, identifies architecture-dependent failure modes, and establishes a reproducible diagnostic paradigm. Experimental results demonstrate that Gemma 2 9B achieves an F1 score above 91% under data diversity, while Llama 3.1 8B exhibits significant performance degradation due to insufficient fusion capabilities; in contrast, Mistral displays robust generalization across varying training paradigms.
📝 Abstract
The practice of fine-tuning Large Language Models (LLMs) has achieved state-of-the-art performance on specialized tasks, yet diagnosing why these models become brittle and fail to generalize remains a critical open problem. To address this, we introduce and apply a multi-layered diagnostic framework to a cross-architectural study. We fine-tune Llama 3.1 8B, Gemma 2 9B, and Mistral models on a high-stakes phishing detection task and use SHAP analysis and mechanistic interpretability to uncover the root causes of their generalization failures. Our investigation reveals three critical findings: (1) Generalization is driven by a powerful synergy between architecture and data diversity. The Gemma 2 9B model achieves state-of-the-art performance (>91\% F1), but only when trained on a stylistically diverse ``generalist''dataset. (2) Generalization is highly architecture-dependent. We diagnose a specific failure mode in Llama 3.1 8B, which performs well on a narrow domain but cannot integrate diverse data, leading to a significant performance drop. (3) Some architectures are inherently more generalizable. The Mistral model proves to be a consistent and resilient performer across multiple training paradigms. By pinpointing the flawed heuristics responsible for these failures, our work provides a concrete methodology for diagnosing and understanding generalization failures, underscoring that reliable AI requires deep validation of the interplay between architecture, data, and training strategy.