The Dual-Stream Transformer: Channelized Architecture for Interpretable Language Modeling

📅 2026-03-08

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Standard Transformers intertwine computation within a single residual stream, leading to functional coupling among components and limited interpretability. This work proposes the Dual-Stream Transformer, which explicitly decouples the residual stream into a token stream—updated by attention—and a context stream—updated by the feedforward network. A hierarchical mixing strategy (independent, Kronecker, and dense) is introduced to regulate information interaction across attention heads. In a 29M-parameter language modeling task, the Kronecker mixing variant incurs only a 2.5% increase in validation loss, while all configurations retain generative capability even when attention is scaled up by 16×. These results demonstrate the model’s robust capacity to learn discrete algorithms while significantly enhancing interpretability of internal mechanisms without substantial performance degradation.

Technology Category

Application Category

📝 Abstract

Standard transformers entangle all computation in a single residual stream, obscuring which components perform which functions. We introduce the Dual-Stream Transformer, which decomposes the residual stream into two functionally distinct components: a token stream updated by attention and a context stream updated by feed-forward networks. Information flow between attention heads is controlled through a hierarchy of mixing strategies, from fully independent (maximum interpretability) to dense (standard transformer behavior). This design exposes a tunable tradeoff between interpretability and performance. We measure this tradeoff on language modeling tasks at 29M parameters. Fully independent head mixing increases validation loss by 8\% relative to dense baselines. The recommended Kronecker mixing strategy, which permits scalar communication between heads while preserving within-head structure, costs only 2.5\%. All configurations maintain functional generation under attention amplification (scaling logits by factors up to 16 at inference time), with degradation ranging from 16\% to 27\%. This robustness suggests the architectures learn discrete algorithms that operate independently of soft probabilistic mixing. The architecture provides a foundation for interpretable language models where internal structure is exposed by design. \footnote{This work was partially supported by DARPA Contract HR001125C0302.}

Problem

Research questions and friction points this paper is trying to address.

interpretability

transformer

language modeling

residual stream

modular architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-Stream Transformer

interpretable architecture

residual stream decomposition