🤖 AI Summary
Neural activation features extracted by sparse autoencoders (SAEs) exhibit interpretability primarily at high activation levels, while continuous activations introduce semantic ambiguity and polysemy.
Method: We propose the Binary Autoencoder (BAE) and Binary Transcoder (BTC), which enforce hard binary constraints on latent activations—restricting them strictly to {0,1}—thereby yielding discrete, unambiguous feature representations.
Contribution/Results: This binarization incurs only marginal reconstruction loss but substantially improves semantic singularity and human interpretability. Empirical analysis reveals that polysemy likely stems from the inherent ambiguity of continuous activations. Experiments show that binarization markedly enhances interpretability of highly activated features; although it also induces a subset of high-frequency uninterpretable units, frequency-aware correction restores overall interpretability to surpass baseline SAEs. These results validate discrete modeling as an effective approach for elucidating the fundamental nature of neural representations.
📝 Abstract
Sparse autoencoders (SAEs) are used to decompose neural network activations into sparsely activating features, but many SAE features are only interpretable at high activation strengths. To address this issue we propose to use binary sparse autoencoders (BAEs) and binary transcoders (BTCs), which constrain all activations to be zero or one. We find that binarisation significantly improves the interpretability and monosemanticity of the discovered features, while increasing reconstruction error. By eliminating the distinction between high and low activation strengths, we prevent uninterpretable information from being smuggled in through the continuous variation in feature activations. However, we also find that binarisation increases the number of uninterpretable ultra-high frequency features, and when interpretability scores are frequency-adjusted, the scores for continuous sparse coders are slightly better than those of binary ones. This suggests that polysemanticity may be an ineliminable property of neural activations.