🤖 AI Summary
This study addresses the challenge of detecting fine, low-contrast cracks in buildings, which are often obscured by background noise and lead to suboptimal detection performance. To overcome this limitation, the authors propose an enhanced YOLOv11 architecture that replaces the original C2PSA module and systematically integrates three attention mechanisms—Global Attention Mechanism (GAM), Residual CBAM, and Shuffle Attention—into the Neck’s multi-scale feature fusion layers for the first time. This integration substantially improves cross-scale feature aggregation. Experimental results demonstrate that the proposed method achieves state-of-the-art performance with mAP@0.5 of 0.9917 and mAP@0.5:0.95 of 0.9506 on the test set, outperforming both YOLOv11 and YOLOv8. Moreover, it maintains high deployment efficiency, attaining an inference speed of 110.95 FPS on an RTX 4090 GPU and approximately 5 FPS on a Raspberry Pi 5.
📝 Abstract
Crack detection plays an important role in infrastructure inspection and Structural Health Monitoring (SHM). However, cracks typically appear as thin, low-contrast structures and are easily affected by background noise, posing challenges for existing object detection models. This study proposes an improved YOLO-based architecture with integrated attention mechanisms, termed YOLO-AMC (YOLO with Attention Mechanisms for Crack Detection), to enhance automated crack detection performance. Based on YOLOv11, the original C2PSA module is removed, and multiple attention mechanisms, including Global Attention Mechanism (GAM), Residual Convolutional Block Attention Module (Res-CBAM), and Shuffle Attention (SA), are introduced into the multi-scale feature fusion layers of the Neck to strengthen cross-scale feature integration. Experimental results demonstrate that YOLO-AMC consistently outperforms baseline models YOLOv11n and YOLOv8n across multiple evaluation metrics. Among the evaluated attention modules, GAM achieves the best detection performance, obtaining mAP@0.5 = 0.9917 and mAP@0.5:0.95 = 0.9506 on the test dataset, which are higher than those of YOLOv11 (0.9833 / 0.9112) and YOLOv8 (0.9707 / 0.8921). Furthermore, while maintaining a computational complexity of 7.6 GFLOPs, the proposed model achieves 110.95 FPS on an NVIDIA RTX 4090 platform and approximately 5 FPS on a Raspberry Pi 5 edge device, demonstrating a favorable trade-off between accuracy and deployment efficiency. The implementation code for this study is available on GitHub at https://github.com/CY-Tsai24/YOLO-AMC.