The Knowledge Microscope: Features as Better Analytical Lenses than Neurons

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of using individual MLP neurons as analytical units in large language models—namely, polysemy in factual knowledge representation, poor interpretability, and weak privacy protection. We propose replacing raw neurons with semantic features extracted via sparse autoencoders (SAEs) as the fundamental unit for knowledge analysis. We systematically demonstrate, for the first time, that SAE features exhibit superior unambiguity, stronger causal influence on factual recall, and enhanced capability for privacy erasure compared to neurons. Building on this, we introduce FeatureEdit—a method enabling efficient and precise knowledge editing and sensitive information removal. Experiments show that feature-level modeling significantly outperforms neuron-level approaches across knowledge representation fidelity, semantic interpretability, and privacy preservation. Notably, our approach achieves substantial gains over state-of-the-art methods in factual knowledge erasure tasks.

Technology Category

Application Category

📝 Abstract
Previous studies primarily utilize MLP neurons as units of analysis for understanding the mechanisms of factual knowledge in Language Models (LMs); however, neurons suffer from polysemanticity, leading to limited knowledge expression and poor interpretability. In this paper, we first conduct preliminary experiments to validate that Sparse Autoencoders (SAE) can effectively decompose neurons into features, which serve as alternative analytical units. With this established, our core findings reveal three key advantages of features over neurons: (1) Features exhibit stronger influence on knowledge expression and superior interpretability. (2) Features demonstrate enhanced monosemanticity, showing distinct activation patterns between related and unrelated facts. (3) Features achieve better privacy protection than neurons, demonstrated through our proposed FeatureEdit method, which significantly outperforms existing neuron-based approaches in erasing privacy-sensitive information from LMs.Code and dataset will be available.
Problem

Research questions and friction points this paper is trying to address.

Features improve knowledge expression in LMs.
Features enhance interpretability and monosemanticity.
Features provide superior privacy protection in LMs.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Autoencoders decompose neurons
Features enhance interpretability and monosemanticity
FeatureEdit improves privacy protection
🔎 Similar Papers
No similar papers found.
Yuheng Chen
Yuheng Chen
Elmore Family School of Electrical and Computer Engineering, Purdue University
Inverse DesignNanophotonicsMachine LearningSimulation
P
Pengfei Cao
The Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
K
Kang Liu
The Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
J
Jun Zhao
The Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China