Evaluating Sparse Autoencoders for Monosemantic Representation

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This study addresses the polysemy of neurons in large language models (LLMs) by systematically evaluating sparse autoencoders (SAEs) for enhancing conceptual monosemanticity. To quantify concept separability, we propose a fine-grained metric based on the Jensen–Shannon distance, revealing a non-monotonic relationship between sparsity and separability. We further introduce APP (Adaptive Posterior-based Pruning), a probabilistic intervention method that suppresses target concepts with higher precision, and propose partial neuron suppression to improve controllability. Experiments on Gemma-2-2B across five benchmarks demonstrate that SAEs significantly reduce neuronal polysemy and improve concept separability; APP outperforms existing intervention methods in targeted concept removal; and partial suppression achieves a favorable trade-off between efficacy and interpretability. Our core contributions include: (1) a novel, information-theoretic metric for quantifying concept separability; (2) a probabilistic, interpretable intervention paradigm; and (3) empirical evidence clarifying the nuanced interplay between sparsity and conceptual disentanglement.

Technology Category

Application Category

📝 Abstract

A key barrier to interpreting large language models is polysemanticity, where neurons activate for multiple unrelated concepts. Sparse autoencoders (SAEs) have been proposed to mitigate this issue by transforming dense activations into sparse, more interpretable features. While prior work suggests that SAEs promote monosemanticity, there has been no quantitative comparison with their base models. This paper provides the first systematic evaluation of SAEs against base models concerning monosemanticity. We introduce a fine-grained concept separability score based on the Jensen-Shannon distance, which captures how distinctly a neuron's activation distributions vary across concepts. Using Gemma-2-2B and multiple SAE variants across five benchmarks, we show that SAEs reduce polysemanticity and achieve higher concept separability. However, greater sparsity of SAEs does not always yield better separability and often impairs downstream performance. To assess practical utility, we evaluate concept-level interventions using two strategies: full neuron masking and partial suppression. We find that, compared to base models, SAEs enable more precise concept-level control when using partial suppression. Building on this, we propose Attenuation via Posterior Probabilities (APP), a new intervention method that uses concept-conditioned activation distributions for targeted suppression. APP outperforms existing approaches in targeted concept removal.

Problem

Research questions and friction points this paper is trying to address.

Evaluating sparse autoencoders for reducing neuron polysemanticity in language models

Quantitatively comparing SAEs with base models using concept separability metrics

Assessing practical utility of SAEs for precise concept-level intervention control

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse autoencoders transform dense activations into sparse features

Introduces Jensen-Shannon distance based concept separability score

Proposes Attenuation via Posterior Probabilities for targeted suppression

🔎 Similar Papers

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models