D-CAT: Decoupled Cross-Attention Transfer between Sensor Modalities for Unimodal Inference

๐Ÿ“… 2025-09-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing cross-modal transfer methods rely on paired multimodal data, limiting their deployment in resource-constrained single-sensor scenarios. This paper proposes D-CAT, the first framework enabling modality decoupling during both training and inference. Leveraging self-attention for unimodal feature extraction and a novel cross-attention alignment loss, D-CAT achieves cross-modal feature space alignment without requiring joint multimodal inputsโ€”enabling high-accuracy inference from a single sensor modality (e.g., IMU, video, or audio) alone. Critically, it operates without paired-data supervision, substantially enhancing deployment flexibility. Experiments demonstrate up to a 10% improvement in F1-score under in-distribution settings; moreover, D-CAT robustly enhances target-modality performance under out-of-distribution conditions, effectively reducing hardware redundancy.

Technology Category

Application Category

๐Ÿ“ Abstract
Cross-modal transfer learning is used to improve multi-modal classification models (e.g., for human activity recognition in human-robot collaboration). However, existing methods require paired sensor data at both training and inference, limiting deployment in resource-constrained environments where full sensor suites are not economically and technically usable. To address this, we propose Decoupled Cross-Attention Transfer (D-CAT), a framework that aligns modality-specific representations without requiring joint sensor modality during inference. Our approach combines a self-attention module for feature extraction with a novel cross-attention alignment loss, which enforces the alignment of sensors' feature spaces without requiring the coupling of the classification pipelines of both modalities. We evaluate D-CAT on three multi-modal human activity datasets (IMU, video, and audio) under both in-distribution and out-of-distribution scenarios, comparing against uni-modal models. Results show that in in-distribution scenarios, transferring from high-performing modalities (e.g., video to IMU) yields up to 10% F1-score gains over uni-modal training. In out-of-distribution scenarios, even weaker source modalities (e.g., IMU to video) improve target performance, as long as the target model isn't overfitted on the training data. By enabling single-sensor inference with cross-modal knowledge, D-CAT reduces hardware redundancy for perception systems while maintaining accuracy, which is critical for cost-sensitive or adaptive deployments (e.g., assistive robots in homes with variable sensor availability). Code is available at https://github.com/Schindler-EPFL-Lab/D-CAT.
Problem

Research questions and friction points this paper is trying to address.

Enables unimodal inference using cross-modal knowledge transfer
Eliminates need for paired sensor data during inference
Improves performance in resource-constrained environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled Cross-Attention Transfer framework
Self-attention module for feature extraction
Cross-attention alignment loss without joint sensors
๐Ÿ”Ž Similar Papers
No similar papers found.