PGF-Net: A Progressive Gated-Fusion Framework for Efficient Multimodal Sentiment Analysis

📅 2025-08-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low feature fusion efficiency, poor interpretability, and model redundancy in multimodal sentiment analysis, this paper proposes a Progressive Gated Fusion (PGF) framework. PGF introduces intra-layer progressive fusion and adaptive gated arbitration mechanisms within deep Transformer layers to enable dynamic, synergistic modeling of textual, acoustic, and visual features. Furthermore, it integrates Cross-Attention with Low-Rank Adaptation (LoRA) and Post-Fusion Adapters into a hybrid Parameter-Efficient Fine-Tuning (PEFT) strategy, balancing representational capacity and parameter efficiency. Evaluated on the MOSI dataset, PGF achieves state-of-the-art performance with only 3.09M trainable parameters—attaining MAE = 0.691 and F1-Score = 86.9%. The framework significantly enhances modeling stability, feature-level interpretability, and practical deployability.

Technology Category

Application Category

📝 Abstract
We introduce PGF-Net (Progressive Gated-Fusion Network), a novel deep learning framework designed for efficient and interpretable multimodal sentiment analysis. Our framework incorporates three primary innovations. Firstly, we propose a Progressive Intra-Layer Fusion paradigm, where a Cross-Attention mechanism empowers the textual representation to dynamically query and integrate non-linguistic features from audio and visual streams within the deep layers of a Transformer encoder. This enables a deeper, context-dependent fusion process. Secondly, the model incorporates an Adaptive Gated Arbitration mechanism, which acts as a dynamic controller to balance the original linguistic information against the newly fused multimodal context, ensuring stable and meaningful integration while preventing noise from overwhelming the signal. Lastly, a hybrid Parameter-Efficient Fine-Tuning (PEFT) strategy is employed, synergistically combining global adaptation via LoRA with local refinement through Post-Fusion Adapters. This significantly reduces trainable parameters, making the model lightweight and suitable for resource-limited scenarios. These innovations are integrated into a hierarchical encoder architecture, enabling PGF-Net to perform deep, dynamic, and interpretable multimodal sentiment analysis while maintaining exceptional parameter efficiency. Experimental results on MOSI dataset demonstrate that our proposed PGF-Net achieves state-of-the-art performance, with a Mean Absolute Error (MAE) of 0.691 and an F1-Score of 86.9%. Notably, our model achieves these results with only 3.09M trainable parameters, showcasing a superior balance between performance and computational efficiency.
Problem

Research questions and friction points this paper is trying to address.

Efficient multimodal sentiment analysis with reduced parameters
Dynamic fusion of text, audio, and visual features
Interpretable cross-modal integration preventing information noise
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive Intra-Layer Fusion with Cross-Attention
Adaptive Gated Arbitration for balanced integration
Hybrid PEFT strategy combining LoRA and Adapters
🔎 Similar Papers
No similar papers found.