Scaling sparse feature circuit finding for in-context learning

📅 2025-04-18

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work investigates the neural mechanisms underlying in-context learning (ICL) in large language models (LLMs), focusing on interpretable, task-aware representations for both task detection and execution. We propose a cross-layer sparse circuit analysis framework based on sparse autoencoders (SAEs), and—on the 2B-parameter Gemma-1 model—first integrate SAE feature decomposition with causal interventions (e.g., targeted perturbations of attention or MLP sublayers) to identify task-detection features that emerge *prior* to task execution and trace their causal propagation toward the task vector through multiple attention and MLP sublayers. Key contributions include: (1) demonstrating that ICL task vectors can be sparsely reconstructed using only a handful of SAE latent variables; (2) identifying human-interpretable feature pairs corresponding to task detection and execution; and (3) constructing the first multi-layer, causally validated sparse functional circuit for ICL.

Technology Category

Application Category

📝 Abstract

Sparse autoencoders (SAEs) are a popular tool for interpreting large language model activations, but their utility in addressing open questions in interpretability remains unclear. In this work, we demonstrate their effectiveness by using SAEs to deepen our understanding of the mechanism behind in-context learning (ICL). We identify abstract SAE features that (i) encode the model's knowledge of which task to execute and (ii) whose latent vectors causally induce the task zero-shot. This aligns with prior work showing that ICL is mediated by task vectors. We further demonstrate that these task vectors are well approximated by a sparse sum of SAE latents, including these task-execution features. To explore the ICL mechanism, we adapt the sparse feature circuits methodology of Marks et al. (2024) to work for the much larger Gemma-1 2B model, with 30 times as many parameters, and to the more complex task of ICL. Through circuit finding, we discover task-detecting features with corresponding SAE latents that activate earlier in the prompt, that detect when tasks have been performed. They are causally linked with task-execution features through the attention and MLP sublayers.

Problem

Research questions and friction points this paper is trying to address.

Understanding in-context learning mechanisms using sparse autoencoders

Identifying task-execution features in large language models

Scaling sparse feature circuits for complex tasks in Gemma-1 2B

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using sparse autoencoders for interpretability

Identifying task-execution features in ICL

Scaling circuits for large models

🔎 Similar Papers

Fast and Sample Efficient Multi-Task Representation Learning in Stochastic Contextual Bandits