MSFNet-CPD: Multi-Scale Cross-Modal Fusion Network for Crop Pest Detection

📅 2025-05-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address challenges in fine-grained pest identification in agricultural fields—including large intra-class variance, modality scarcity, and limited labeled data—this paper proposes a multi-scale cross-modal fusion framework. Methodologically, we (1) introduce CTIP102/STIP102, the first publicly available agricultural multi-modal pest benchmark; (2) design an Image-Text Fusion (ITF) module for joint visual-semantic modeling and an Image-Text Cross-scale Enhancement (ITC) module for detail reconstruction across scales, enabling deep vision-language alignment; and (3) incorporate super-resolution reconstruction and Adversarial Combinatorial Image Enhancement (ACIE) to improve model robustness. Extensive experiments demonstrate that our approach achieves significant improvements over state-of-the-art methods across multiple benchmarks, with strong generalization capability. All code and datasets are publicly released.

Technology Category

Application Category

📝 Abstract
Accurate identification of agricultural pests is essential for crop protection but remains challenging due to the large intra-class variance and fine-grained differences among pest species. While deep learning has advanced pest detection, most existing approaches rely solely on low-level visual features and lack effective multi-modal integration, leading to limited accuracy and poor interpretability. Moreover, the scarcity of high-quality multi-modal agricultural datasets further restricts progress in this field. To address these issues, we construct two novel multi-modal benchmarks-CTIP102 and STIP102-based on the widely-used IP102 dataset, and introduce a Multi-scale Cross-Modal Fusion Network (MSFNet-CPD) for robust pest detection. Our approach enhances visual quality via a super-resolution reconstruction module, and feeds both the original and reconstructed images into the network to improve clarity and detection performance. To better exploit semantic cues, we propose an Image-Text Fusion (ITF) module for joint modeling of visual and textual features, and an Image-Text Converter (ITC) that reconstructs fine-grained details across multiple scales to handle challenging backgrounds. Furthermore, we introduce an Arbitrary Combination Image Enhancement (ACIE) strategy to generate a more complex and diverse pest detection dataset, MTIP102, improving the model's generalization to real-world scenarios. Extensive experiments demonstrate that MSFNet-CPD consistently outperforms state-of-the-art methods on multiple pest detection benchmarks. All code and datasets will be made publicly available at: https://github.com/Healer-ML/MSFNet-CPD.
Problem

Research questions and friction points this paper is trying to address.

Improving pest detection accuracy despite large intra-class variance
Enhancing multi-modal integration for better interpretability and performance
Addressing scarcity of high-quality multi-modal agricultural datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-scale Cross-Modal Fusion Network for pest detection
Super-resolution reconstruction enhances image clarity
Image-Text Fusion module integrates visual and textual features
🔎 Similar Papers
No similar papers found.
J
Jiaqi Zhang
College of Computer and Information Science, Southwest University, Chongqing 400700, China
Zhuodong Liu
Zhuodong Liu
Qiyuan Lab
K
Kejian Yu
School of Computer Science and Technology, Donghua University, Shanghai 201620, China