🤖 AI Summary
To address the prohibitive computational overhead—reaching terabytes of memory—and poor scalability of conventional interpretability methods for large language models (LLMs) processing million-token contexts, this paper proposes Sparse Tracing, an efficient interpretability analysis framework built upon dynamic sparse attention. The method introduces three key innovations: (1) Stream, a hierarchical pruning algorithm achieving near-linear time and linear space complexity via a single forward pass; (2) a binary-search-based refinement strategy for precise path identification; and (3) per-head sparse mask estimation to capture head-specific attention patterns. Evaluated on the RULER benchmark, Sparse Tracing retains 90–96% of critical attention paths while eliminating 97–99% of redundant interactions. It is the first approach to enable scalable, fine-grained tracing over million-token contexts, successfully identifying reasoning-chain “thought anchors” and cross-layer information propagation pathways.
📝 Abstract
As Large Language Models (LLMs) scale to million-token contexts, traditional Mechanistic Interpretability techniques for analyzing attention scale quadratically with context length, demanding terabytes of memory beyond 100,000 tokens. We introduce Sparse Tracing, a novel technique that leverages dynamic sparse attention to efficiently analyze long context attention patterns. We present Stream, a compilable hierarchical pruning algorithm that estimates per-head sparse attention masks in near-linear time $O(T log T)$ and linear space $O(T)$, enabling one-pass interpretability at scale. Stream performs a binary-search-style refinement to retain only the top-$k$ key blocks per query while preserving the model's next-token behavior. We apply Stream to long chain-of-thought reasoning traces and identify thought anchors while pruning 97-99% of token interactions. On the RULER benchmark, Stream preserves critical retrieval paths while discarding 90-96% of interactions and exposes layer-wise routes from the needle to output. Our method offers a practical drop-in tool for analyzing attention patterns and tracing information flow without terabytes of caches. By making long context interpretability feasible on consumer GPUs, Sparse Tracing helps democratize chain-of-thought monitoring. Code is available at https://anonymous.4open.science/r/stream-03B8/.