Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework

📅 2025-06-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Neural network interpretability faces two key bottlenecks: insufficient robustness and the restrictive “monosemanticity” assumption—that each neuron encodes only one concept—despite empirical evidence of widespread polysemanticity, leading to incomplete semantic coverage and distorted explanations. To address this, we propose PRISM, the first interpretability framework that explicitly models neurons’ capacity to encode multiple concepts. PRISM abandons the monosemantic description paradigm, enabling fine-grained, multi-label semantic characterization and introducing a quantifiable polysemanticity score. By jointly analyzing feature activation patterns and concept datasets—augmented with a confidence-weighted multi-label generation mechanism—PRISM significantly improves descriptive accuracy and faithfulness in language models. It achieves state-of-the-art performance in both polysemanticity identification and holistic description quality, outperforming existing methods across multiple benchmarks.

Technology Category

Application Category

📝 Abstract
Automated interpretability research aims to identify concepts encoded in neural network features to enhance human understanding of model behavior. Current feature description methods face two critical challenges: limited robustness and the flawed assumption that each neuron encodes only a single concept (monosemanticity), despite growing evidence that neurons are often polysemantic. This assumption restricts the expressiveness of feature descriptions and limits their ability to capture the full range of behaviors encoded in model internals. To address this, we introduce Polysemantic FeatuRe Identification and Scoring Method (PRISM), a novel framework that captures the inherent complexity of neural network features. Unlike prior approaches that assign a single description per feature, PRISM provides more nuanced descriptions for both polysemantic and monosemantic features. We apply PRISM to language models and, through extensive benchmarking against existing methods, demonstrate that our approach produces more accurate and faithful feature descriptions, improving both overall description quality (via a description score) and the ability to capture distinct concepts when polysemanticity is present (via a polysemanticity score).
Problem

Research questions and friction points this paper is trying to address.

Addressing limited robustness in neural network feature descriptions
Overcoming flawed monosemanticity assumption in feature interpretation
Capturing polysemanticity to improve model behavior understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

PRISM captures polysemantic neural network features
Provides nuanced descriptions for multiple concepts
Improves accuracy and faithfulness of feature descriptions
🔎 Similar Papers
No similar papers found.
L
Laura Kopf
TU Berlin, Germany; BIFOLD, Germany
Nils Feldhus
Nils Feldhus
TU Berlin, BIFOLD, DFKI (Guest)
Natural Language ProcessingInterpretabilityExplainable AI
Kirill Bykov
Kirill Bykov
TU Munich
Machine LearningExplainable AIInterpretable MLMechanistic Interpretability
P
P. Bommer
UMI Lab, ATB Potsdam, Germany
A
Anna Hedstrom
TU Berlin, Germany; Fraunhofer Heinrich-Hertz-Institute, Germany
M
M. Hohne
University of Potsdam, Germany
Oliver Eberle
Oliver Eberle
TU Berlin
Explainable AIInterpretabilityDeep LearningMachine LearningNLP