Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects

๐Ÿ“… 2026-05-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Although features extracted by sparse autoencoders outperform individual neurons, their semantic interpretability remains unreliable. This work proposes Query Lens, a novel method that, for the first time, incorporates the indirect effects of downstream modules reading features through layer-specific subspaces into the interpretability framework. By jointly analyzing key features in the encoder and value features in the decoder, Query Lens identifies the inputs that activate a given feature and the outputs it promotes. Building upon Logit Lens, Query Lens generates coherent token signatures for otherwise opaque sparse features, substantially enhancing their interpretability. The method further introduces the โ€œsubspace channel hypothesisโ€ to characterize the mechanism of feature transmission between modules.
๐Ÿ“ Abstract
While sparse autoencoders provide features more interpretable than individual neurons, reliably characterizing them remains challenging. We propose Query Lens, which extends Logit Lens to enable more comprehensive and faithful interpretations of sparse features. By jointly considering encoder-side key features and decoder-side value features, we identify both the inputs that activate a feature and the outputs it promotes. We also account for indirect, module-mediated effects that arise when the feature is processed by downstream modules, going beyond the direct effect captured by Logit Lens. In experiments, we find that Query Lens yields coherent token signatures for features that remain uninterpretable under Logit Lens. Finally, we propose the Subspace Channel Hypothesis, suggesting that downstream modules read features through layer-specific subspaces.
Problem

Research questions and friction points this paper is trying to address.

sparse autoencoders
feature interpretability
indirect effects
key-value features
downstream modules
Innovation

Methods, ideas, or system contributions that make the work stand out.

Query Lens
sparse autoencoders
indirect effects
feature interpretability
subspace channel hypothesis