🤖 AI Summary
Existing low-light video enhancement methods suffer significant performance degradation when auxiliary modalities—such as event streams or infrared—are unavailable, limiting their applicability in real-world scenarios requiring flexible inference. To address this challenge, this work proposes AMNet, a unified multimodal framework that achieves, for the first time, modality-agnostic low-light video enhancement. AMNet employs a spatial-spectral dual-gated translator to model cross-modal relationships between RGB and auxiliary inputs, generating implicit auxiliary representations even in the absence of explicit auxiliary data. Coupled with large-scale synthetic data pretraining, the framework enables robust inference under arbitrary modality combinations. Extensive experiments demonstrate that AMNet consistently outperforms state-of-the-art methods across various modality-missing settings, achieving leading performance without relying on specific auxiliary inputs.
📝 Abstract
Low-light video enhancement (LLVE) remains a challenging task due to severe information degradation under low-illumination conditions. Recent multimodal approaches have significantly improved enhancement performance by incorporating auxiliary modalities, such as event streams and infrared images. However, these methods typically assume the availability of these modalities at inference, which is often not feasible in real-world scenarios. To solve this problem, in this work, we propose AMNet, a unified multimodal framework for LLVE, to support flexible modality-agnostic inference, where auxiliary modalities may be unavailable. To address the issue of modality absence, we introduce a Spatial-Spectral Dual-Gated Translator that learns the correspondence between auxiliary modalities and RGB inputs, producing implicit auxiliary representations to support the robust enhancement. Additionally, to fully facilitate the learning of cross-modal correspondence, we conduct large-scale multimodal pretraining based on the RGB-only dataset with synthetic auxiliary modalities. Extensive experiments demonstrate that AMNet could handle arbitrary inference-time modality combinations and exhibits superior performance for LLVE under modality absence conditions. Code and models are available on the project page.