Stream: Scaling up Mechanistic Interpretability to Long Context in LLMs via Sparse Attention

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

To address the prohibitive computational overhead—reaching terabytes of memory—and poor scalability of conventional interpretability methods for large language models (LLMs) processing million-token contexts, this paper proposes Sparse Tracing, an efficient interpretability analysis framework built upon dynamic sparse attention. The method introduces three key innovations: (1) Stream, a hierarchical pruning algorithm achieving near-linear time and linear space complexity via a single forward pass; (2) a binary-search-based refinement strategy for precise path identification; and (3) per-head sparse mask estimation to capture head-specific attention patterns. Evaluated on the RULER benchmark, Sparse Tracing retains 90–96% of critical attention paths while eliminating 97–99% of redundant interactions. It is the first approach to enable scalable, fine-grained tracing over million-token contexts, successfully identifying reasoning-chain “thought anchors” and cross-layer information propagation pathways.

Technology Category

Application Category

📝 Abstract

As Large Language Models (LLMs) scale to million-token contexts, traditional Mechanistic Interpretability techniques for analyzing attention scale quadratically with context length, demanding terabytes of memory beyond 100,000 tokens. We introduce Sparse Tracing, a novel technique that leverages dynamic sparse attention to efficiently analyze long context attention patterns. We present Stream, a compilable hierarchical pruning algorithm that estimates per-head sparse attention masks in near-linear time $O(T log T)$ and linear space $O(T)$, enabling one-pass interpretability at scale. Stream performs a binary-search-style refinement to retain only the top-$k$ key blocks per query while preserving the model's next-token behavior. We apply Stream to long chain-of-thought reasoning traces and identify thought anchors while pruning 97-99% of token interactions. On the RULER benchmark, Stream preserves critical retrieval paths while discarding 90-96% of interactions and exposes layer-wise routes from the needle to output. Our method offers a practical drop-in tool for analyzing attention patterns and tracing information flow without terabytes of caches. By making long context interpretability feasible on consumer GPUs, Sparse Tracing helps democratize chain-of-thought monitoring. Code is available at https://anonymous.4open.science/r/stream-03B8/.

Problem

Research questions and friction points this paper is trying to address.

Scaling mechanistic interpretability to long-context LLMs efficiently

Reducing quadratic memory demands of attention pattern analysis

Enabling interpretability on consumer GPUs via sparse attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Tracing uses dynamic sparse attention

Stream algorithm prunes token interactions hierarchically

Method enables near-linear time interpretability scaling

🔎 Similar Papers

A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models