๐ค AI Summary
This work addresses the interpretability challenge posed by distributed representations in neural networks. We propose Activation Spectrum Analysis (ActSpec), a novel method that models layer-wise activation patterns as pseudo-Boolean functions and quantifies the joint contribution of neuron subsets to network outputs via their Fourier spectra. Our approach constitutes the first systematic application of pseudo-Boolean Fourier analysis to representation interpretability. To identify high-contribution, low-redundancy Fourier coefficients efficiently, we design a constrained combinatorial optimization algorithm extending the GoldreichโLevin algorithm. Empirical evaluation on synthetic data, an MNIST classifier, and a Transformer-based sentiment analysis model demonstrates that ActSpec significantly outperforms existing methods. It provides quantitative insights into the distributed nature of representations and uncovers critical cooperative mechanisms among neurons. By bridging harmonic analysis and neural interpretability, ActSpec establishes a new paradigm for explaining distributed neural representations.
๐ Abstract
In the study of neural network interpretability, there is growing evidence to suggest that relevant features are encoded across many neurons in a distributed fashion. Making sense of these distributed representations without knowledge of the network's encoding strategy is a combinatorial task that is not guaranteed to be tractable. This work explores one feasible path to both detecting and tracing the joint influence of neurons in a distributed representation. We term this approach Activation Spectroscopy (ActSpec), owing to its analysis of the pseudo-Boolean Fourier spectrum defined over the activation patterns of a network layer. The sub-network defined between a given layer and an output logit is cast as a special class of pseudo-Boolean function. The contributions of each subset of neurons in the specified layer can be quantified through the function's Fourier coefficients. We propose a combinatorial optimization procedure to search for Fourier coefficients that are simultaneously high-valued, and non-redundant. This procedure can be viewed as an extension of the Goldreich-Levin algorithm which incorporates additional problem-specific constraints. The resulting coefficients specify a collection of subsets, which are used to test the degree to which a representation is distributed. We verify our approach in a number of synthetic settings and compare against existing interpretability benchmarks. We conclude with a number of experimental evaluations on an MNIST classifier, and a transformer-based network for sentiment analysis.