Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

📅 2026-06-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing mechanistic interpretability methods typically focus on individual prompt–output pairs, making it difficult to uncover the underlying heterogeneity of mechanisms across a language model’s generation distribution. This work proposes an unsupervised feature discovery approach that clusters model-generated continuations by jointly leveraging semantic embeddings and attribution signatures from prefix-to-continuation mappings. Without requiring human-specified target outputs, this method achieves, for the first time, mechanism–semantic alignment at the distributional level. It optimizes a rate–distortion objective that balances semantic coherence, mechanistic consistency, and cluster granularity, effectively revealing diverse continuation mechanisms invisible to single-perspective analyses. Intervention experiments further validate that the learned cluster signatures correspond to manipulable internal computational factors, substantially enhancing the scalability of model auditing.

📝 Abstract

As large language models are increasingly deployed in high-stakes settings, there is a growing need for tools that audit not only model outputs but also the internal computations that produce them. Circuit analysis is a central approach in mechanistic interpretability, but it is typically target-conditioned, explaining a single prompt paired with a chosen completion. This target-conditioned setup can obscure heterogeneity across a model's continuation distribution. We introduce distribution-level unsupervised feature discovery, which clusters sampled continuations using both semantic content and sequence-level mechanistic attributions, without manually specifying target outputs. Our method represents each continuation with a semantic embedding and a prefix-to-continuation attribution signature, then optimizes a rate-distortion objective that trades off semantic coherence, mechanistic consistency, and cluster granularity. Across clustering and steering analyses, the discovered clusters expose continuation modes that single-view baselines miss and provide interventional evidence that cluster signatures correspond to actionable mechanistic factors. Overall, our approach complements circuit analysis and behavioral evaluation by providing a scalable audit of the mechanisms underlying a model's continuation distribution.

Problem

Research questions and friction points this paper is trying to address.

mechanistic interpretability

circuit analysis

continuation distribution

unsupervised feature discovery

attribution signatures

Innovation

Methods, ideas, or system contributions that make the work stand out.

unsupervised feature discovery

mechanistic interpretability

semantic-mechanistic alignment