DiffV2IR: Visible-to-Infrared Diffusion Model via Vision-Language Understanding

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses three key challenges in visible-to-infrared (V2IR) image translation: weak semantic awareness, spectral diversity across infrared bands, and scarcity of real-world annotated data. To this end, we propose the first diffusion-based framework integrating progressive learning with vision-language understanding. Methodologically, we design a multi-stage infrared-band-adaptive diffusion architecture to explicitly model spectral characteristics; introduce a CLIP-driven vision-language unified understanding module to enhance cross-modal semantic alignment; and construct IR-500K—the first large-scale real-world infrared dataset comprising 500,000 images. Extensive experiments demonstrate significant improvements in translation fidelity and generalization across multiple benchmarks, achieving state-of-the-art performance. All code, pretrained models, and the IR-500K dataset are publicly released.

Technology Category

Application Category

📝 Abstract

The task of translating visible-to-infrared images (V2IR) is inherently challenging due to three main obstacles: 1) achieving semantic-aware translation, 2) managing the diverse wavelength spectrum in infrared imagery, and 3) the scarcity of comprehensive infrared datasets. Current leading methods tend to treat V2IR as a conventional image-to-image synthesis challenge, often overlooking these specific issues. To address this, we introduce DiffV2IR, a novel framework for image translation comprising two key elements: a Progressive Learning Module (PLM) and a Vision-Language Understanding Module (VLUM). PLM features an adaptive diffusion model architecture that leverages multi-stage knowledge learning to infrared transition from full-range to target wavelength. To improve V2IR translation, VLUM incorporates unified Vision-Language Understanding. We also collected a large infrared dataset, IR-500K, which includes 500,000 infrared images compiled by various scenes and objects under various environmental conditions. Through the combination of PLM, VLUM, and the extensive IR-500K dataset, DiffV2IR markedly improves the performance of V2IR. Experiments validate DiffV2IR's excellence in producing high-quality translations, establishing its efficacy and broad applicability. The code, dataset, and DiffV2IR model will be available at https://github.com/LidongWang-26/DiffV2IR.

Problem

Research questions and friction points this paper is trying to address.

Achieving semantic-aware visible-to-infrared image translation

Managing diverse wavelength spectrum in infrared imagery

Addressing scarcity of comprehensive infrared datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive Learning Module for multi-stage knowledge

Vision-Language Understanding Module for semantic translation

Large IR-500K dataset for diverse conditions

🔎 Similar Papers

PID: Physics-Informed Diffusion Model for Infrared Image Generation