Binary Sparse Coding for Interpretability

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Neural activation features extracted by sparse autoencoders (SAEs) exhibit interpretability primarily at high activation levels, while continuous activations introduce semantic ambiguity and polysemy. Method: We propose the Binary Autoencoder (BAE) and Binary Transcoder (BTC), which enforce hard binary constraints on latent activations—restricting them strictly to {0,1}—thereby yielding discrete, unambiguous feature representations. Contribution/Results: This binarization incurs only marginal reconstruction loss but substantially improves semantic singularity and human interpretability. Empirical analysis reveals that polysemy likely stems from the inherent ambiguity of continuous activations. Experiments show that binarization markedly enhances interpretability of highly activated features; although it also induces a subset of high-frequency uninterpretable units, frequency-aware correction restores overall interpretability to surpass baseline SAEs. These results validate discrete modeling as an effective approach for elucidating the fundamental nature of neural representations.

Technology Category

Application Category

📝 Abstract

Sparse autoencoders (SAEs) are used to decompose neural network activations into sparsely activating features, but many SAE features are only interpretable at high activation strengths. To address this issue we propose to use binary sparse autoencoders (BAEs) and binary transcoders (BTCs), which constrain all activations to be zero or one. We find that binarisation significantly improves the interpretability and monosemanticity of the discovered features, while increasing reconstruction error. By eliminating the distinction between high and low activation strengths, we prevent uninterpretable information from being smuggled in through the continuous variation in feature activations. However, we also find that binarisation increases the number of uninterpretable ultra-high frequency features, and when interpretability scores are frequency-adjusted, the scores for continuous sparse coders are slightly better than those of binary ones. This suggests that polysemanticity may be an ineliminable property of neural activations.

Problem

Research questions and friction points this paper is trying to address.

Addresses poor interpretability of sparse autoencoder features at low activations

Proposes binary sparse coding to eliminate continuous activation variations

Investigates trade-offs between interpretability gains and reconstruction accuracy losses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Binary autoencoders constrain activations to zero or one

Binary transcoders eliminate continuous variation in feature activations

Binarisation improves interpretability by preventing uninterpretable information smuggling

🔎 Similar Papers

SINET: Sparsity-driven Interpretable Neural Network for Underwater Image Enhancement