🤖 AI Summary
Infrared video-based gas leak detection suffers from frequent false negatives and positives due to the inherent blur, non-rigidity, and inter-frame sparsity of gas plumes. To address this, we propose the first vision-language joint segmentation framework specifically designed for gas leakage detection. Our method aligns and fuses multimodal features from infrared video sequences and textual descriptions to enhance semantic representation of leaking regions. A confidence-guided dynamic post-processing mechanism is introduced to suppress noise and background interference. The framework supports both fully supervised and few-shot learning settings. Extensive experiments across diverse real-world scenarios demonstrate that our approach significantly outperforms existing methods, achieving state-of-the-art performance in detection accuracy, robustness, and generalization—particularly under label-scarce conditions, where it exhibits exceptional adaptability.
📝 Abstract
Gas leaks pose serious threats to human health and contribute significantly to atmospheric pollution, drawing increasing public concern. However, the lack of effective detection methods hampers timely and accurate identification of gas leaks. While some vision-based techniques leverage infrared videos for leak detection, the blurry and non-rigid nature of gas clouds often limits their effectiveness. To address these challenges, we propose a novel framework called Joint Vision-Language Gas leak Segmentation (JVLGS), which integrates the complementary strengths of visual and textual modalities to enhance gas leak representation and segmentation. Recognizing that gas leaks are sporadic and many video frames may contain no leak at all, our method incorporates a post-processing step to reduce false positives caused by noise and non-target objects, an issue that affects many existing approaches. Extensive experiments conducted across diverse scenarios show that JVLGS significantly outperforms state-of-the-art gas leak segmentation methods. We evaluate our model under both supervised and few-shot learning settings, and it consistently achieves strong performance in both, whereas competing methods tend to perform well in only one setting or poorly in both. Code available at: https://github.com/GeekEagle/JVLGS