MambaNeXt-YOLO: A Hybrid State Space Model for Real-time Object Detection

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

To address the challenge of balancing global contextual modeling and low inference latency in real-time object detection under resource-constrained settings, this paper proposes MambaYOLO—the first YOLO architecture integrating the linear state space model (Mamba). We design the lightweight MambaNeXt module to enable synergistic CNN-Mamba feature modeling, and introduce a multi-branch asymmetric fusion pyramid (MAFPN) to strengthen cross-scale feature interaction. Crucially, MambaYOLO requires no pretraining and is optimized end-to-end for edge deployment. On PASCAL VOC, it achieves 66.6% mAP at 31.9 FPS, enabling real-time inference on embedded platforms such as Jetson Xavier NX and Orin NX. Our core contributions are: (i) the first Mamba-CNN hybrid backbone tailored for real-time object detection; and (ii) an edge-native design that eliminates pretraining, minimizes computational overhead, and delivers high accuracy.

Technology Category

Application Category

📝 Abstract

Real-time object detection is a fundamental but challenging task in computer vision, particularly when computational resources are limited. Although YOLO-series models have set strong benchmarks by balancing speed and accuracy, the increasing need for richer global context modeling has led to the use of Transformer-based architectures. Nevertheless, Transformers have high computational complexity because of their self-attention mechanism, which limits their practicality for real-time and edge deployments. To overcome these challenges, recent developments in linear state space models, such as Mamba, provide a promising alternative by enabling efficient sequence modeling with linear complexity. Building on this insight, we propose MambaNeXt-YOLO, a novel object detection framework that balances accuracy and efficiency through three key contributions: (1) MambaNeXt Block: a hybrid design that integrates CNNs with Mamba to effectively capture both local features and long-range dependencies; (2) Multi-branch Asymmetric Fusion Pyramid Network (MAFPN): an enhanced feature pyramid architecture that improves multi-scale object detection across various object sizes; and (3) Edge-focused Efficiency: our method achieved 66.6% mAP at 31.9 FPS on the PASCAL VOC dataset without any pre-training and supports deployment on edge devices such as the NVIDIA Jetson Xavier NX and Orin NX.

Problem

Research questions and friction points this paper is trying to address.

Balancing speed and accuracy in real-time object detection

Reducing computational complexity of global context modeling

Enabling efficient deployment on edge devices

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid CNN-Mamba design for feature capture

Multi-branch fusion pyramid for multi-scale detection

Edge-optimized efficiency for real-time deployment

🔎 Similar Papers

No similar papers found.