Scaling Vision Mamba Across Resolutions via Fractal Traversal

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision Mamba suffers from spatial locality disruption due to conventional 2D patch serialization and weak cross-resolution generalization. To address these limitations, we propose FractalMamba++, a novel vision backbone. First, we introduce a fractal traversal serialization paradigm based on the Hilbert curve, preserving pixel-level spatial continuity while enabling resolution-agnostic adaptation. Second, we design a Cross-State Routing (CSR) mechanism to mitigate long-range dependency decay in state-space modeling. Third, we propose a Positional Relation Capture (PRC) module to rectify topological discontinuities induced by curve turning points. FractalMamba++ is the first Mamba-based architecture to achieve robust multi-scale visual representation learning. It consistently outperforms existing Mamba baselines across image classification, semantic segmentation, object detection, and change detection—particularly excelling in high-resolution scenarios, where it delivers substantial gains in both accuracy and generalization capability.

Technology Category

Application Category

📝 Abstract
Vision Mamba has recently emerged as a promising alternative to Transformer-based architectures, offering linear complexity in sequence length while maintaining strong modeling capacity. However, its adaptation to visual inputs is hindered by challenges in 2D-to-1D patch serialization and weak scalability across input resolutions. Existing serialization strategies such as raster scanning disrupt local spatial continuity and limit the model's ability to generalize across scales. In this paper, we propose FractalMamba++, a robust vision backbone that leverages fractal-based patch serialization via Hilbert curves to preserve spatial locality and enable seamless resolution adaptability. To address long-range dependency fading in high-resolution inputs, we further introduce a Cross-State Routing (CSR) mechanism that enhances global context propagation through selective state reuse. Additionally, we propose a Positional-Relation Capture (PRC) module to recover local adjacency disrupted by curve inflection points. Extensive experiments on image classification, semantic segmentation, object detection, and change detection demonstrate that FractalMamba++ consistently outperforms previous Mamba-based backbones, particularly under high-resolution settings.
Problem

Research questions and friction points this paper is trying to address.

Adapting Vision Mamba to 2D visual inputs effectively
Improving resolution scalability in Vision Mamba architectures
Preserving spatial locality in patch serialization strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fractal-based patch serialization via Hilbert curves
Cross-State Routing for global context propagation
Positional-Relation Capture to recover local adjacency
🔎 Similar Papers
No similar papers found.
B
Bo Li
vivo Mobile Communication Co., Ltd, Shanghai, China
H
Haoke Xiao
vivo Mobile Communication Co., Ltd, Shanghai, China
Lv Tang
Lv Tang
University of Alberta. Former researcher @ UCAS/Nanjing University
Computer VisionMLLMVideo CompressionImage Segmentation