I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the black-box nature of large language model (LLM) reasoning mechanisms by proposing a neuron-level interpretability framework based on sparse autoencoders (SAEs). Specifically, it systematically identifies, localizes, and intervenes upon general-purpose “reasoning features” that drive complex reasoning in the DeepSeek-R1 family of models—the first such effort for this architecture. Methodologically, the framework integrates SAE-based representation disentanglement, feature interpretability evaluation, cross-layer activation localization, and causal intervention experiments, yielding the first *interventional* reasoning attribution framework for LLMs. Key contributions include: (i) discovery of multiple semantically consistent reasoning features shared across layers; and (ii) demonstration that targeted amplification of these features improves average accuracy by 9.2% on multi-step reasoning tasks. These results establish a new paradigm for mechanistic understanding and controllable enhancement of LLM reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have achieved remarkable success in natural language processing. Recent advances have led to the developing of a new class of reasoning LLMs; for example, open-source DeepSeek-R1 has achieved state-of-the-art performance by integrating deep thinking and complex reasoning. Despite these impressive capabilities, the internal reasoning mechanisms of such models remain unexplored. In this work, we employ Sparse Autoencoders (SAEs), a method to learn a sparse decomposition of latent representations of a neural network into interpretable features, to identify features that drive reasoning in the DeepSeek-R1 series of models. First, we propose an approach to extract candidate ''reasoning features'' from SAE representations. We validate these features through empirical analysis and interpretability methods, demonstrating their direct correlation with the model's reasoning abilities. Crucially, we demonstrate that steering these features systematically enhances reasoning performance, offering the first mechanistic account of reasoning in LLMs. Code available at https://github.com/AIRI-Institute/SAE-Reasoning

Problem

Research questions and friction points this paper is trying to address.

Exploring internal reasoning mechanisms in Large Language Models

Identifying reasoning features using Sparse Autoencoders

Enhancing reasoning performance by steering identified features

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using Sparse Autoencoders for interpretable features

Extracting reasoning features from SAE representations

Steering features to enhance reasoning performance

🔎 Similar Papers

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models