IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting

📅 2025-12-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The lack of high-quality evaluation benchmarks and effective adaptation methods hinders multimodal understanding of infrared (IR) images. Method: We introduce IF-Bench—the first dedicated benchmark for IR image understanding—comprising 499 cross-source IR images and 680 vision-language question-answer pairs spanning ten cognitive dimensions. We systematically evaluate over 40 open- and closed-source multimodal large language models (MLLMs). To address domain shift, we propose GenViP, a training-free generative visual prompting method leveraging diffusion models for unpaired, semantically and spatially aligned IR-to-RGB translation. We further establish a zero-shot generalization evaluation framework incorporating cyclic assessment, bilingual evaluation, and hybrid criteria. Results: Experiments demonstrate that GenViP significantly enhances IR understanding across diverse MLLMs and reveal systematic influences of model scale, architecture, and inference paradigms. IF-Bench, code, and models are publicly released.

Technology Category

Application Category

📝 Abstract
Recent advances in multimodal large language models (MLLMs) have led to impressive progress across various benchmarks. However, their capability in understanding infrared images remains unexplored. To address this gap, we introduce IF-Bench, the first high-quality benchmark designed for evaluating multimodal understanding of infrared images. IF-Bench consists of 499 images sourced from 23 infrared datasets and 680 carefully curated visual question-answer pairs, covering 10 essential dimensions of image understanding. Based on this benchmark, we systematically evaluate over 40 open-source and closed-source MLLMs, employing cyclic evaluation, bilingual assessment, and hybrid judgment strategies to enhance the reliability of the results. Our analysis reveals how model scale, architecture, and inference paradigms affect infrared image comprehension, providing valuable insights for this area. Furthermore, we propose a training-free generative visual prompting (GenViP) method, which leverages advanced image editing models to translate infrared images into semantically and spatially aligned RGB counterparts, thereby mitigating domain distribution shifts. Extensive experiments demonstrate that our method consistently yields significant performance improvements across a wide range of MLLMs. The benchmark and code are available at https://github.com/casiatao/IF-Bench.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' infrared image understanding capability
Introducing IF-Bench with curated infrared images and questions
Proposing a method to reduce domain shift in infrared images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces IF-Bench benchmark for infrared image evaluation
Proposes training-free generative visual prompting method
Translates infrared images to aligned RGB counterparts
🔎 Similar Papers
No similar papers found.
T
Tao Zhang
MAIS, Institute of Automation; School of Artificial Intelligence, UCAS
Y
Yuyang Hong
MAIS, Institute of Automation; School of Artificial Intelligence, UCAS
Yang Xia
Yang Xia
Dalian University of Technology
computational mechanicsautomotive engineering
Kun Ding
Kun Ding
CASIA
CVMultimodal
Z
Zeyu Zhang
Research Center of Aerospace Information, Institute of Automation
Y
Ying Wang
MAIS, Institute of Automation; Research Center of Aerospace Information, Institute of Automation
Shiming Xiang
Shiming Xiang
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
Distance Metric LearningSemi-supervised LearningManifold LearningRegressionFeature Selection
C
Chunhong Pan
MAIS, Institute of Automation; Research Center of Aerospace Information, Institute of Automation