🤖 AI Summary
Small-object detection in UAV imagery is hindered by extremely small object sizes, low signal-to-noise ratios, and cluttered backgrounds; existing multi-scale approaches often compromise fine-grained detail or incur excessive computational overhead. To address this, we propose a lightweight multi-scale global–local feature fusion framework featuring a novel Fusion Lock mechanism. This mechanism jointly integrates Token-Statistics self-attention (for long-range semantic modeling), directional convolution with parallel attention (to enhance local structural perception), and dynamic pixel-wise weighting (to suppress background interference), enabling efficient and precise global–local feature coupling. Evaluated on the VisDrone benchmark, our method consistently outperforms state-of-the-art approaches across diverse backbone networks and detector architectures, achieving significant gains in both precision and recall while maintaining real-time inference speed—making it well-suited for resource-constrained onboard UAV platforms.
📝 Abstract
Small object detection in UAV imagery is crucial for applications such as search-and-rescue, traffic monitoring, and environmental surveillance, but it is hampered by tiny object size, low signal-to-noise ratios, and limited feature extraction. Existing multi-scale fusion methods help, but add computational burden and blur fine details, making small object detection in cluttered scenes difficult. To overcome these challenges, we propose the Multi-scale Global-detail Feature Integration Strategy (MGDFIS), a unified fusion framework that tightly couples global context with local detail to boost detection performance while maintaining efficiency. MGDFIS comprises three synergistic modules: the FusionLock-TSS Attention Module, which marries token-statistics self-attention with DynamicTanh normalization to highlight spectral and spatial cues at minimal cost; the Global-detail Integration Module, which fuses multi-scale context via directional convolution and parallel attention while preserving subtle shape and texture variations; and the Dynamic Pixel Attention Module, which generates pixel-wise weighting maps to rebalance uneven foreground and background distributions and sharpen responses to true object regions. Extensive experiments on the VisDrone benchmark demonstrate that MGDFIS consistently outperforms state-of-the-art methods across diverse backbone architectures and detection frameworks, achieving superior precision and recall with low inference time. By striking an optimal balance between accuracy and resource usage, MGDFIS provides a practical solution for small-object detection on resource-constrained UAV platforms.