COMPASS: Complete Multimodal Fusion via Proxy Tokens and Shared Spaces for Ubiquitous Sensing

📅 2026-04-02

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the degradation of multimodal fusion and cross-modal interaction caused by missing modalities in multimodal perception. To preserve the structural integrity of the fusion process, the authors propose a framework that generates high-quality proxy tokens for missing modalities within a shared latent space, ensuring that the fusion module consistently receives inputs of fixed structure. Guided by a fusion integrity principle, the method employs a source-to-target proxy generation mechanism enhanced with alignment constraints, regularization, and discriminative supervision to improve proxy fidelity. Extensive experiments across multiple datasets—including XRF55, MM-Fi, and OctoNet—under diverse modality-missing scenarios demonstrate that the proposed approach significantly outperforms existing methods, confirming the effectiveness and generalizability of the fusion integrity design.

Technology Category

Application Category

📝 Abstract

Missing modalities remain a major challenge for multimodal sensing, because most existing methods adapt the fusion process to the observed subset by dropping absent branches, using subset-specific fusion, or reconstructing missing features. As a result, the fusion head often receives an input structure different from the one seen during training, leading to incomplete fusion and degraded cross-modal interaction. We propose COMPASS, a missing-modality fusion framework built on the principle of fusion completeness: the fusion head always receives a fixed N-slot multimodal input, with one token per modality slot. For each missing modality, COMPASS synthesizes a target-specific proxy token from the observed modalities using pairwise source-to-target generators in a shared latent space, and aggregates them into a single replacement token. To make these proxies both representation-compatible and task-informative, we combine proxy alignment, shared-space regularization, and per-proxy discriminative supervision. Experiments on XRF55, MM-Fi, and OctoNet under diverse single- and multiple-missing settings show that COMPASS outperforms prior methods on the large majority of scenarios. Our results suggest that preserving a modality-complete fusion interface is a simple and effective design principle for robust multimodal sensing.

Problem

Research questions and friction points this paper is trying to address.

missing modalities

multimodal fusion

fusion completeness

ubiquitous sensing

cross-modal interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal fusion

missing modalities

proxy tokens