🤖 AI Summary
To address the challenge of efficiently fine-tuning large multimodal Transformer models on resource-constrained edge devices, this paper proposes Multimodal Parallel Split Learning (MPSL). MPSL introduces a modality-agnostic unified encoder and lightweight client-side tokenizers, enabling flexible adaptation across multimodal tasks without requiring label sharing, client synchronization, or submodel management. By integrating split learning, asynchronous distributed optimization, and sublinear communication design, MPSL achieves comparable or superior performance to federated learning across seven benchmark datasets, reduces client-side computational overhead by 250×, and ensures communication cost scales sublinearly with model size. The core contributions lie in overcoming three fundamental limitations of conventional split learning—modality coupling, synchronization dependency, and communication bottlenecks—thereby unifying privacy preservation, ultra-low resource consumption, and strong generalization capability.
📝 Abstract
Multimodal transformers integrate diverse data types like images, audio, and text, advancing tasks such as audio-visual understanding and image-text retrieval; yet their high parameterization limits deployment on resource-constrained edge devices. Split Learning (SL), which partitions models at a designated cut-layer to offload compute-intensive operations to the server, offers a promising approach for distributed training of multimodal transformers, though its application remains underexplored. We present MPSL, a parallel SL approach for computational efficient fine-tuning of multimodal transformers in a distributed manner, while eliminating label sharing, client synchronization, and per-client sub-model management. MPSL employs lightweight client-side tokenizers and a unified modality-agnostic encoder, allowing flexible adaptation to task-specific needs. Our evaluation across 7 multimodal datasets demonstrates that MPSL matches or outperforms Federated Learning, reduces client-side computations by 250x, and achieves superior scalability in communication cost with model growth. Through extensive analysis, we highlight task suitability, trade-offs, and scenarios where MPSL excels, inspiring further exploration.