🤖 AI Summary
Existing LLM safety analysis methods struggle to identify and interpret fine-grained safety concepts. Method: This paper introduces the first sparse autoencoder (SAE)-based interpretability framework specifically designed for security-relevant fine-grained concepts. It decouples and localizes internal model features via SAEs, and systematically identifies atomic-level neural features associated with high-risk behaviors—such as toxic generation and safety violations—through a hybrid validation pipeline integrating concept activation analysis, clustering-based filtering, and human feedback. An extensible automated strategy is innovatively devised to substantially reduce manual annotation overhead. Contribution/Results: The method extracts numerous interpretable, safety-related neurons; we release an open-source toolkit containing trained SAE checkpoints and human-readable semantic interpretations. This provides a reproducible, scalable interpretability infrastructure for empirical research on LLM safety mechanisms.
📝 Abstract
Increasing deployment of large language models (LLMs) in real-world applications raises significant safety concerns. Most existing safety research focuses on evaluating LLM outputs or specific safety tasks, limiting their ability to ad- dress broader, undefined risks. Sparse Autoencoders (SAEs) facilitate interpretability research to clarify model behavior by explaining single-meaning atomic features decomposed from entangled signals. jHowever, prior applications on SAEs do not interpret features with fine-grained safety-related con- cepts, thus inadequately addressing safety-critical behaviors, such as generating toxic responses and violating safety regu- lations. For rigorous safety analysis, we must extract a rich and diverse set of safety-relevant features that effectively capture these high-risk behaviors, yet face two challenges: identifying SAEs with the greatest potential for generating safety concept-specific neurons, and the prohibitively high cost of detailed feature explanation. In this paper, we pro- pose Safe-SAIL, a framework for interpreting SAE features within LLMs to advance mechanistic understanding in safety domains. Our approach systematically identifies SAE with best concept-specific interpretability, explains safety-related neurons, and introduces efficient strategies to scale up the in- terpretation process. We will release a comprehensive toolkit including SAE checkpoints and human-readable neuron ex- planations, which supports empirical analysis of safety risks to promote research on LLM safety.