ModalFormer: Multimodal Transformer for Low-Light Image Enhancement

📅 2025-07-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Low-light image enhancement (LLIE) suffers from severe noise, blurred details, and insufficient contrast. To address these challenges, this paper proposes the first large-scale multimodal Transformer framework specifically designed for LLIE, systematically incorporating nine auxiliary modalities—including depth, semantic segmentation, geometric structure, and color distribution—for the first time. The core innovation lies in a cross-modal multi-head self-attention (CM-MSA) mechanism that enables fine-grained alignment and deep fusion of RGB features with heterogeneous multimodal representations. We further design a Cross-modal Transformer (CM-T) backbone alongside lightweight auxiliary subnetworks to jointly reconstruct and fuse multimodal features. Extensive experiments demonstrate state-of-the-art performance across multiple standard benchmarks, with significant improvements in PSNR and SSIM. The code and pretrained models are publicly released.

Technology Category

Application Category

📝 Abstract
Low-light image enhancement (LLIE) is a fundamental yet challenging task due to the presence of noise, loss of detail, and poor contrast in images captured under insufficient lighting conditions. Recent methods often rely solely on pixel-level transformations of RGB images, neglecting the rich contextual information available from multiple visual modalities. In this paper, we present ModalFormer, the first large-scale multimodal framework for LLIE that fully exploits nine auxiliary modalities to achieve state-of-the-art performance. Our model comprises two main components: a Cross-modal Transformer (CM-T) designed to restore corrupted images while seamlessly integrating multimodal information, and multiple auxiliary subnetworks dedicated to multimodal feature reconstruction. Central to the CM-T is our novel Cross-modal Multi-headed Self-Attention mechanism (CM-MSA), which effectively fuses RGB data with modality-specific features--including deep feature embeddings, segmentation information, geometric cues, and color information--to generate information-rich hybrid attention maps. Extensive experiments on multiple benchmark datasets demonstrate ModalFormer's state-of-the-art performance in LLIE. Pre-trained models and results are made available at https://github.com/albrateanu/ModalFormer.
Problem

Research questions and friction points this paper is trying to address.

Enhancing low-light images with noise and poor contrast
Integrating multiple visual modalities for better enhancement
Fusing RGB data with auxiliary features via transformer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Transformer for low-light enhancement
Cross-modal Multi-headed Self-Attention mechanism
Integrates nine auxiliary modalities for performance
🔎 Similar Papers
No similar papers found.