🤖 AI Summary
This work addresses the performance degradation in real-world multimodal federated learning caused by missing and heterogeneous modalities across clients. To tackle this challenge, the authors propose a task-agnostic, block-level federated learning framework that employs a modular neural network architecture. This design enables flexible participation from clients with arbitrary modality subsets through a block-wise parameter aggregation mechanism. Furthermore, it integrates modality-aware personalized training to preserve task-specific representations while sharing common modules across participants. Experimental results demonstrate that the proposed framework achieves an average performance gain of 18.7% in scenarios with incomplete modalities and up to 37.7% improvement when clients possess exclusive modalities, substantially enhancing the practicality and robustness of multimodal federated learning systems.
📝 Abstract
Multimodal federated learning (FL) is essential for real-world applications such as autonomous systems and healthcare, where data is distributed across heterogeneous clients with varying and often missing modalities. However, most existing FL approaches assume uniform modality availability, limiting their applicability in practice. We introduce BLOSSOM, a task-agnostic framework for multimodal FL designed to operate under shared and sparsely observed modality conditions. BLOSSOM supports clients with arbitrary modality subsets and enables flexible sharing of model components. To address client and task heterogeneity, we propose a block-wise aggregation strategy that selectively aggregates shared components while keeping task-specific blocks private, enabling partial personalization. We evaluate BLOSSOM on multiple diverse multimodal datasets and analyse the effects of missing modalities and personalization. Our results show that block-wise personalization significantly improves performance, particularly in settings with severe modality sparsity. In modality-incomplete scenarios, BLOSSOM achieves an average performance gain of 18.7% over full-model aggregation, while in modality-exclusive settings the gain increases to 37.7%, highlighting the importance of block-wise learning for practical multimodal FL systems.