🤖 AI Summary
To address high fine-tuning costs and negative transfer in cross-domain multimodal sequential recommendation, this paper proposes a lightweight and robust transfer learning framework. Methodologically, it introduces: (1) an algebraic constraint mechanism to enforce cross-domain semantic consistency; (2) a Cross-SSD temporal fusion module leveraging state space models (SSMs) to capture long-range dependencies; (3) dual-channel Fourier-adaptive filtering to suppress cross-modal noise propagation; and (4) shared projection with two-stage constrained optimization for cross-modal alignment. Evaluated on standard cross-domain benchmarks, the framework achieves a 31.78% improvement in NDCG@10 and accelerates fine-tuning convergence by 10×, significantly outperforming existing state-of-the-art methods.
📝 Abstract
Sequential Recommendation (SR) systems model user preferences by analyzing interaction histories. Although transferable multi-modal SR architectures demonstrate superior performance compared to traditional ID-based approaches, current methods incur substantial fine-tuning costs when adapting to new domains due to complex optimization requirements and negative transfer effects - a significant deployment bottleneck that hinders engineers from efficiently repurposing pre-trained models for novel application scenarios with minimal tuning overhead. We propose MMM4Rec (Multi-Modal Mamba for Sequential Recommendation), a novel multi-modal SR framework that incorporates a dedicated algebraic constraint mechanism for efficient transfer learning. By combining State Space Duality (SSD)'s temporal decay properties with a time-aware modeling design, our model dynamically prioritizes key modality information, overcoming limitations of Transformer-based approaches. The framework implements a constrained two-stage process: (1) sequence-level cross-modal alignment via shared projection matrices, followed by (2) temporal fusion using our newly designed Cross-SSD module and dual-channel Fourier adaptive filtering. This architecture maintains semantic consistency while suppressing noise propagation.MMM4Rec achieves rapid fine-tuning convergence with simple cross-entropy loss, significantly improving multi-modal recommendation accuracy while maintaining strong transferability. Extensive experiments demonstrate MMM4Rec's state-of-the-art performance, achieving the maximum 31.78% NDCG@10 improvement over existing models and exhibiting 10 times faster average convergence speed when transferring to large-scale downstream datasets.