DCER: Dual-Stage Compression and Energy-Based Reconstruction

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses two key robustness challenges in multimodal fusion: noise interference and missing modalities. The authors propose the first unified framework integrating a two-stage compression mechanism with an energy-based model. In the first stage, intra-modal frequency-domain transformations—wavelet for audio and DCT for video—combined with cross-modal bottleneck tokens enable effective denoising and deep fusion. In the second stage, for missing modalities, learned energy functions reconstruct representations via gradient descent, while the associated energy values quantify predictive uncertainty. The method achieves state-of-the-art results across all metrics on CMU-MOSI, CMU-MOSEI, and CH-SIMS, maintains stable performance under high modality dropout rates, and demonstrates a strong correlation (ρ > 0.72) between its intrinsic uncertainty estimates and prediction errors.

Technology Category

Application Category

📝 Abstract

Multimodal fusion faces two robustness challenges: noisy inputs degrade representation quality, and missing modalities cause prediction failures. We propose DCER, a unified framework addressing both challenges through dual-stage compression and energy-based reconstruction. The compression stage operates at two levels: within-modality frequency transforms (wavelet for audio, DCT for video) remove noise while preserving task-relevant patterns, and cross-modality bottleneck tokens force genuine integration rather than modality-specific shortcuts. For missing modalities, energy-based reconstruction recovers representations via gradient descent on a learned energy function, with the final energy providing intrinsic uncertainty quantification (\r{ho}>0.72 correlation with prediction error). Experiments on CMU-MOSI, CMU-MOSEI, and CH-SIMS demonstrate state-of-the-art performance across all benchmarks, with a U-shaped robustness pattern favoring multimodal fusion at both complete and high-missing conditions. The code will be available on Github.

Problem

Research questions and friction points this paper is trying to address.

multimodal fusion

noisy inputs

missing modalities

robustness

representation quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-stage compression

energy-based reconstruction

multimodal fusion