RNN as Linear Transformer: A Closer Investigation into Representational Potentials of Visual Mamba Models

📅 2025-11-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of interpretability in visual Mamba models by systematically investigating their representational properties. Methodologically, we formulate Mamba as a low-rank approximation of Softmax attention, uncovering its intrinsic connections to linear attention and RNN architectures; we further propose a binary-partition-based activation map quantification metric and enhance attention visualization quality via DINO self-supervised pretraining. Our contributions are threefold: (1) establishing the first theoretical interpretability framework for Mamba in vision; (2) designing the first explainability evaluation method specifically tailored for visual Mamba models; and (3) achieving 78.5% top-1 accuracy on ImageNet linear probing—demonstrating both strong long-range modeling capability and improved interpretability. The results validate that Mamba’s architectural design enables efficient global context aggregation while maintaining human-interpretable attention patterns.

Technology Category

Application Category

📝 Abstract
Mamba has recently garnered attention as an effective backbone for vision tasks. However, its underlying mechanism in visual domains remains poorly understood. In this work, we systematically investigate Mamba's representational properties and make three primary contributions. First, we theoretically analyze Mamba's relationship to Softmax and Linear Attention, confirming that it can be viewed as a low-rank approximation of Softmax Attention and thereby bridging the representational gap between Softmax and Linear forms. Second, we introduce a novel binary segmentation metric for activation map evaluation, extending qualitative assessments to a quantitative measure that demonstrates Mamba's capacity to model long-range dependencies. Third, by leveraging DINO for self-supervised pretraining, we obtain clearer activation maps than those produced by standard supervised approaches, highlighting Mamba's potential for interpretability. Notably, our model also achieves a 78.5 percent linear probing accuracy on ImageNet, underscoring its strong performance. We hope this work can provide valuable insights for future investigations of Mamba-based vision architectures.
Problem

Research questions and friction points this paper is trying to address.

Investigating Mamba's representational mechanisms in visual domains
Bridging representational gap between Softmax and Linear Attention
Evaluating Mamba's long-range dependency modeling capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba as low-rank Softmax Attention approximation
Binary segmentation metric for activation evaluation
DINO self-supervised pretraining for clearer activations
🔎 Similar Papers
No similar papers found.