🤖 AI Summary
To address low feature fusion efficiency, poor interpretability, and model redundancy in multimodal sentiment analysis, this paper proposes a Progressive Gated Fusion (PGF) framework. PGF introduces intra-layer progressive fusion and adaptive gated arbitration mechanisms within deep Transformer layers to enable dynamic, synergistic modeling of textual, acoustic, and visual features. Furthermore, it integrates Cross-Attention with Low-Rank Adaptation (LoRA) and Post-Fusion Adapters into a hybrid Parameter-Efficient Fine-Tuning (PEFT) strategy, balancing representational capacity and parameter efficiency. Evaluated on the MOSI dataset, PGF achieves state-of-the-art performance with only 3.09M trainable parameters—attaining MAE = 0.691 and F1-Score = 86.9%. The framework significantly enhances modeling stability, feature-level interpretability, and practical deployability.
📝 Abstract
We introduce PGF-Net (Progressive Gated-Fusion Network), a novel deep learning framework designed for efficient and interpretable multimodal sentiment analysis. Our framework incorporates three primary innovations. Firstly, we propose a Progressive Intra-Layer Fusion paradigm, where a Cross-Attention mechanism empowers the textual representation to dynamically query and integrate non-linguistic features from audio and visual streams within the deep layers of a Transformer encoder. This enables a deeper, context-dependent fusion process. Secondly, the model incorporates an Adaptive Gated Arbitration mechanism, which acts as a dynamic controller to balance the original linguistic information against the newly fused multimodal context, ensuring stable and meaningful integration while preventing noise from overwhelming the signal. Lastly, a hybrid Parameter-Efficient Fine-Tuning (PEFT) strategy is employed, synergistically combining global adaptation via LoRA with local refinement through Post-Fusion Adapters. This significantly reduces trainable parameters, making the model lightweight and suitable for resource-limited scenarios. These innovations are integrated into a hierarchical encoder architecture, enabling PGF-Net to perform deep, dynamic, and interpretable multimodal sentiment analysis while maintaining exceptional parameter efficiency. Experimental results on MOSI dataset demonstrate that our proposed PGF-Net achieves state-of-the-art performance, with a Mean Absolute Error (MAE) of 0.691 and an F1-Score of 86.9%. Notably, our model achieves these results with only 3.09M trainable parameters, showcasing a superior balance between performance and computational efficiency.