PerfMamba: Performance Analysis and Pruning of Selective State Space Models

📅 2025-11-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Systematic understanding of runtime behavior, resource consumption, and scalability of selective state space models (SSMs) remains lacking. Method: This paper conducts multi-granularity performance profiling of Mamba-1/2, quantitatively characterizing SSM computational patterns, memory access bottlenecks, and state activity distributions across varying sequence lengths. Based on these insights, we propose a state-activity-driven structured pruning method that dynamically prunes low-activity states during inference to jointly optimize accuracy and efficiency. Contribution/Results: Experiments demonstrate an average 1.14× throughput speedup and 11.50% memory compression across diverse sequence lengths—substantially outperforming baselines. This work establishes a reproducible empirical foundation and a novel paradigm for lightweight SSM design and hardware-aware optimization.

Technology Category

Application Category

📝 Abstract
Recent advances in sequence modeling have introduced selective SSMs as promising alternatives to Transformer architectures, offering theoretical computational efficiency and sequence processing advantages. A comprehensive understanding of selective SSMs in runtime behavior, resource utilization patterns, and scaling characteristics still remains unexplored, thus obstructing their optimal deployment and further architectural improvements. This paper presents a thorough empirical study of Mamba-1 and Mamba-2, systematically profiled for performance to assess the design principles that contribute to their efficiency in state-space modeling. A detailed analysis of computation patterns, memory access, I/O characteristics, and scaling properties was performed for sequence lengths ranging from 64 to 16384 tokens. Our findings show that the SSM component, a central part of the selective SSM architecture, demands a significant portion of computational resources compared to other components in the Mamba block. Based on these insights, we propose a pruning technique that selectively removes low-activity states within the SSM component, achieving measurable throughput and memory gains while maintaining accuracy within a moderate pruning regime. This approach results in performance improvements across varying sequence lengths, achieving a 1.14x speedup and reducing memory usage by 11.50%. These results offer valuable guidance for designing more efficient SSM architectures that can be applied to a wide range of real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Analyzes runtime behavior and scaling of selective SSMs
Identifies SSM component as major computational bottleneck
Proposes pruning technique to improve speed and memory
Innovation

Methods, ideas, or system contributions that make the work stand out.

Profiles Mamba models for performance and scaling analysis
Proposes pruning low-activity states in SSM component
Improves speed and reduces memory while maintaining accuracy
🔎 Similar Papers
No similar papers found.