🤖 AI Summary
Traditional object detection suffers from heavy reliance on labor-intensive manual annotations, poor generalization, and limited adaptability to novel categories and dynamic environments. To address these challenges, this work proposes an end-to-end fully automated detection pipeline. Methodologically, it introduces— for the first time—a unified framework integrating CLIP-driven open-vocabulary localization, diffusion model–enhanced feature representation, uncertainty-aware pseudo-label filtering, and an interactive human verification mechanism. This enables zero-shot category extension and closed-loop optimization with controllable annotation quality. Built upon a fine-tuned YOLOv8 backbone, the method achieves 92% of the full-supervision state-of-the-art mAP on COCO and LVIS using only 15% of the manual annotations required by conventional approaches. The proposed pipeline significantly reduces annotation cost while substantially improving cross-domain generalization capability.