๐ค AI Summary
To address the challenge of balancing global contextual modeling and low inference latency in real-time object detection under resource-constrained settings, this paper proposes MambaYOLOโthe first YOLO architecture integrating the linear state space model (Mamba). We design the lightweight MambaNeXt module to enable synergistic CNN-Mamba feature modeling, and introduce a multi-branch asymmetric fusion pyramid (MAFPN) to strengthen cross-scale feature interaction. Crucially, MambaYOLO requires no pretraining and is optimized end-to-end for edge deployment. On PASCAL VOC, it achieves 66.6% mAP at 31.9 FPS, enabling real-time inference on embedded platforms such as Jetson Xavier NX and Orin NX. Our core contributions are: (i) the first Mamba-CNN hybrid backbone tailored for real-time object detection; and (ii) an edge-native design that eliminates pretraining, minimizes computational overhead, and delivers high accuracy.
๐ Abstract
Real-time object detection is a fundamental but challenging task in computer vision, particularly when computational resources are limited. Although YOLO-series models have set strong benchmarks by balancing speed and accuracy, the increasing need for richer global context modeling has led to the use of Transformer-based architectures. Nevertheless, Transformers have high computational complexity because of their self-attention mechanism, which limits their practicality for real-time and edge deployments. To overcome these challenges, recent developments in linear state space models, such as Mamba, provide a promising alternative by enabling efficient sequence modeling with linear complexity. Building on this insight, we propose MambaNeXt-YOLO, a novel object detection framework that balances accuracy and efficiency through three key contributions: (1) MambaNeXt Block: a hybrid design that integrates CNNs with Mamba to effectively capture both local features and long-range dependencies; (2) Multi-branch Asymmetric Fusion Pyramid Network (MAFPN): an enhanced feature pyramid architecture that improves multi-scale object detection across various object sizes; and (3) Edge-focused Efficiency: our method achieved 66.6% mAP at 31.9 FPS on the PASCAL VOC dataset without any pre-training and supports deployment on edge devices such as the NVIDIA Jetson Xavier NX and Orin NX.