HRVMamba: High-Resolution Visual State Space Model for Dense Prediction

📅 2024-10-04

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

262K/year

🤖 AI Summary

Weak spatial inductive bias, insufficient long-range dependency modeling, and low output resolution hinder visual state space models (VSSMs) in dense prediction tasks such as human pose estimation and semantic segmentation. To address these bottlenecks, this paper proposes the Dynamic Visual State Space (DVSS) module. DVSS integrates multi-scale convolutions to strengthen local inductive bias, incorporates deformable convolutions to explicitly model long-range spatial dependencies, and embeds an HRNet-style high-resolution parallel feature stream to preserve fine-grained spatial information throughout the network. Crucially, DVSS requires no post-processing or auxiliary training techniques. Evaluated on multiple benchmarks, it achieves state-of-the-art performance—significantly improving keypoint localization accuracy and boundary segmentation fidelity—while maintaining favorable computational efficiency and representational capacity.

Technology Category

Application Category

📝 Abstract

Recently, State Space Models (SSMs) with efficient hardware-aware designs, i.e., Mamba, have demonstrated significant potential in computer vision tasks due to their linear computational complexity with respect to token length and their global receptive field. However, Mamba's performance on dense prediction tasks, including human pose estimation and semantic segmentation, has been constrained by three key challenges: insufficient inductive bias, long-range forgetting, and low-resolution output representation. To address these challenges, we introduce the Dynamic Visual State Space (DVSS) block, which utilizes multi-scale convolutional kernels to extract local features across different scales and enhance inductive bias, and employs deformable convolution to mitigate the long-range forgetting problem while enabling adaptive spatial aggregation based on input and task-specific information. By leveraging the multi-resolution parallel design proposed in HRNet, we introduce High-Resolution Visual State Space Model (HRVMamba) based on the DVSS block, which preserves high-resolution representations throughout the entire process while promoting effective multi-scale feature learning. Extensive experiments highlight HRVMamba's impressive performance on dense prediction tasks, achieving competitive results against existing benchmark models without bells and whistles. Code is available at https://github.com/zhanghao5201/HRVMamba.

Problem

Research questions and friction points this paper is trying to address.

Efficient high-resolution visual learning for pose estimation

Overcoming weak spatial bias in State Space Models

Mitigating long-range forgetting in dense prediction tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Visual State Space (DVSS) block

Multi-scale convolutional operations

Deformable operation in DVSS

🔎 Similar Papers

No similar papers found.