🤖 AI Summary
Existing selective mechanisms in State Space Models (SSMs) lack theoretical grounding and are susceptible to spurious correlations, compromising model optimality and robustness. To address this, we propose the **Predictive Sufficiency Principle**—the first information-theoretic framework that rigorously preserves future predictive capability while compressing historical information into a minimal sufficient statistic. This principle provides a first-principles foundation for selective mechanisms and serves as a general-purpose regularization method applicable across diverse architectures. Our approach jointly optimizes state representations and sufficient statistics via an information-theoretic objective, effectively suppressing non-causal noise and spurious patterns. Empirically, it achieves state-of-the-art performance on multiple benchmarks, with particularly notable gains in long-horizon forecasting and high-noise regimes—demonstrating superior generalization and robustness.
📝 Abstract
State Space Models (SSMs), particularly recent selective variants like Mamba, have emerged as a leading architecture for sequence modeling, challenging the dominance of Transformers. However, the success of these state-of-the-art models largely relies on heuristically designed selective mechanisms, which lack a rigorous first-principle derivation. This theoretical gap raises questions about their optimality and robustness against spurious correlations. To address this, we introduce the Principle of Predictive Sufficiency, a novel information-theoretic criterion stipulating that an ideal hidden state should be a minimal sufficient statistic of the past for predicting the future. Based on this principle, we propose the Minimal Predictive Sufficiency State Space Model (MPS-SSM), a new framework where the selective mechanism is guided by optimizing an objective function derived from our principle. This approach encourages the model to maximally compress historical information without losing predictive power, thereby learning to ignore non-causal noise and spurious patterns. Extensive experiments on a wide range of benchmark datasets demonstrate that MPS-SSM not only achieves state-of-the-art performance, significantly outperforming existing models in long-term forecasting and noisy scenarios, but also exhibits superior robustness. Furthermore, we show that the MPS principle can be extended as a general regularization framework to enhance other popular architectures, highlighting its broad potential.