🤖 AI Summary
Existing vision-language models struggle to effectively align thermal imaging with text, primarily due to scarce annotated data, limited understanding of thermal phenomena by large language models, and conflicting representations between scene-level and object-level features within a unified embedding space. To address these challenges, this work introduces IR-Cap—the first thermal image–text dataset incorporating physical priors—and proposes T-CLIP, a dual-LoRA disentangled framework built upon CLIP that separately models global contextual and local thermal semantics in thermal images, thereby overcoming the limitations of a single embedding space. Experiments demonstrate that the proposed approach significantly outperforms existing baselines across three thermal imaging benchmarks, achieves substantial gains in cross-modal retrieval performance, and shows promising applicability in text-guided thermal image generation tasks.
📝 Abstract
Thermal imaging offers a powerful alternative to visible-spectrum vision under challenging conditions such as low illumination and adverse weather, yet foundational vision-language models like CLIP fail to align thermal images with textual descriptions due to a fundamental thermal perception gap. We identify three major challenges: the lack of captioned thermal datasets, the inability of standard LLMs to reason about thermal phenomena, and a key representational challenge in thermal imaging where global scene context and object-level heat signatures conflict when learned together in a single embedding space. To address these, we introduce IR-Cap, the first physics-aware thermal captioning pipeline and dataset providing complementary global and fine-grained thermal descriptions across three public benchmarks, and T-CLIP, a decoupled dual-LoRA framework that independently adapts CLIP for scene-level and object-level thermal understanding. T-CLIP achieves consistent improvements over all baselines across three thermal benchmarks in cross-modal retrieval, and we provide an exploratory demonstration of its applicability to text-conditioned thermal image generation.