MambaNeXt-YOLO: A Hybrid State Space Model for Real-time Object Detection

๐Ÿ“… 2025-06-04
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the challenge of balancing global contextual modeling and low inference latency in real-time object detection under resource-constrained settings, this paper proposes MambaYOLOโ€”the first YOLO architecture integrating the linear state space model (Mamba). We design the lightweight MambaNeXt module to enable synergistic CNN-Mamba feature modeling, and introduce a multi-branch asymmetric fusion pyramid (MAFPN) to strengthen cross-scale feature interaction. Crucially, MambaYOLO requires no pretraining and is optimized end-to-end for edge deployment. On PASCAL VOC, it achieves 66.6% mAP at 31.9 FPS, enabling real-time inference on embedded platforms such as Jetson Xavier NX and Orin NX. Our core contributions are: (i) the first Mamba-CNN hybrid backbone tailored for real-time object detection; and (ii) an edge-native design that eliminates pretraining, minimizes computational overhead, and delivers high accuracy.

Technology Category

Application Category

๐Ÿ“ Abstract
Real-time object detection is a fundamental but challenging task in computer vision, particularly when computational resources are limited. Although YOLO-series models have set strong benchmarks by balancing speed and accuracy, the increasing need for richer global context modeling has led to the use of Transformer-based architectures. Nevertheless, Transformers have high computational complexity because of their self-attention mechanism, which limits their practicality for real-time and edge deployments. To overcome these challenges, recent developments in linear state space models, such as Mamba, provide a promising alternative by enabling efficient sequence modeling with linear complexity. Building on this insight, we propose MambaNeXt-YOLO, a novel object detection framework that balances accuracy and efficiency through three key contributions: (1) MambaNeXt Block: a hybrid design that integrates CNNs with Mamba to effectively capture both local features and long-range dependencies; (2) Multi-branch Asymmetric Fusion Pyramid Network (MAFPN): an enhanced feature pyramid architecture that improves multi-scale object detection across various object sizes; and (3) Edge-focused Efficiency: our method achieved 66.6% mAP at 31.9 FPS on the PASCAL VOC dataset without any pre-training and supports deployment on edge devices such as the NVIDIA Jetson Xavier NX and Orin NX.
Problem

Research questions and friction points this paper is trying to address.

Balancing speed and accuracy in real-time object detection
Reducing computational complexity of global context modeling
Enabling efficient deployment on edge devices
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid CNN-Mamba design for feature capture
Multi-branch fusion pyramid for multi-scale detection
Edge-optimized efficiency for real-time deployment
๐Ÿ”Ž Similar Papers
No similar papers found.
X
Xiaochun Lei
School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541010, Guangxi, China; Guangxi Key Laboratory of Image and Graphic Intelligent Processing, Guilin University of Electronic Technology, Guilin, 541010, Guangxi, China
Siqi Wu
Siqi Wu
Indiana University Bloomington
Computational social scienceSocial computingAlgorithmic auditingCrowdsourcing
W
Weilin Wu
School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541010, Guangxi, China
Z
Zetao Jiang
School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541010, Guangxi, China; Guangxi Key Laboratory of Image and Graphic Intelligent Processing, Guilin University of Electronic Technology, Guilin, 541010, Guangxi, China