🤖 AI Summary
This work addresses the limitations of existing Mamba-based methods in complex traffic scenes, which struggle to simultaneously capture fine-grained local details and global semantics, exhibit limited performance on multi-scale objects—particularly small targets—and lack hierarchical feature representation and cross-scale interaction capabilities. To overcome these challenges, we propose MDDCNet, a novel architecture that integrates multi-scale deformable dilated convolution (MSDDC) with the state space model (Mamba) into a hybrid backbone to enhance local perception. We further introduce a channel-enhanced feed-forward network (CE-FFN) to strengthen inter-channel interactions and design a Mamba-based attentive aggregation feature pyramid network (A²FPN) for efficient multi-scale feature fusion. Extensive experiments demonstrate that our method significantly outperforms current detectors on multiple public and real-world traffic datasets, notably improving accuracy and robustness for small object detection.
📝 Abstract
In a real-world traffic scenario, varying-scale objects are usually distributed in a cluttered background, which poses great challenges to accurate detection. Although current Mamba-based methods can efficiently model long-range dependencies, they still struggle to capture small objects with abundant local details, which hinders joint modeling of local structures and global semantics. Moreover, state-space models exhibit limited hierarchical feature representation and weak cross-scale interaction due to flat sequential modeling and insufficient spatial inductive biases, leading to sub-optimal performance in complex scenes. To address these issues, we propose a Mamba with Deformable Dilated Convolutions Network (MDDCNet) for accurate traffic object detection in this study. In MDDCNet, a well-designed hybrid backbone with successive Multi-Scale Deformable Dilated Convolution (MSDDC) blocks and Mamba blocks enables hierarchical feature representation from local details to global semantics. Meanwhile, a Channel-Enhanced Feed-Forward Network (CE-FFN) is further devised to overcome the limited channel interaction capability of conventional feed-forward networks, whilst a Mamba-based Attention-Aggregating Feature Pyramid Network (A^2FPN) is constructed to achieve enhanced multi-scale feature fusion and interaction. Extensive experimental results on public benchmark and real-world datasets demonstrate the superiority of our method over various advanced detectors. The code is available at https://github.com/Bettermea/MDDCNet.