State Space Models are Provably Comparable to Transformers in Dynamic Token Selection

📅 2024-05-29
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
State space models (SSMs) suffer from limited theoretical understanding of their joint representational capacity with nonlinear layers—particularly regarding dynamic, input-dependent token selection. Existing SSM theory is restricted to linear or single-layer settings. Method: This work theoretically establishes, for the first time, that a single-layer SSM composed with a fully connected nonlinear layer is expressively equivalent to the Transformer in both dynamic token selection and nonparametric regression tasks. The analysis integrates SSM theory, feedforward neural networks, and nonparametric regression. Two synthetic tasks are designed to empirically validate the mechanism. Results: Experiments demonstrate that the proposed architecture efficiently solves the target tasks; moreover, it is rigorously proven to achieve estimation rates for a specific function class identical to those of the Transformer—providing the first formal theoretical justification for SSMs as viable Transformer alternatives.

Technology Category

Application Category

📝 Abstract
Deep neural networks based on state space models (SSMs) are attracting significant attention in sequence modeling since their computational cost is much smaller than that of Transformers. While the capabilities of SSMs have been demonstrated through experiments in various tasks, theoretical understanding of SSMs is still limited. In particular, most theoretical studies discuss the capabilities of SSM layers without nonlinear layers, and there is a lack of discussion on their combination with nonlinear layers. In this paper, we explore the capabilities of SSMs combined with fully connected neural networks, and show that they are comparable to Transformers in extracting the essential tokens depending on the input. As concrete examples, we consider two synthetic tasks, which are challenging for a single SSM layer, and demonstrate that SSMs combined with nonlinear layers can efficiently solve these tasks. Furthermore, we study the nonparametric regression task, and prove that the ability of SSMs is equivalent to that of Transformers in estimating functions belonging to a certain class.
Problem

Research questions and friction points this paper is trying to address.

Theoretical understanding of SSMs with nonlinear layers is limited.
SSMs combined with nonlinear layers are comparable to Transformers in token selection.
SSMs and Transformers have equivalent function estimation capabilities in nonparametric regression.
Innovation

Methods, ideas, or system contributions that make the work stand out.

State space models combined with neural networks
SSMs comparable to Transformers in token selection
SSMs solve tasks efficiently with nonlinear layers
🔎 Similar Papers
No similar papers found.