🤖 AI Summary
Cross-domain sequential recommendation (CDSR) suffers from insufficient multimodal information utilization and coarse-grained cross-domain interest modeling. To address these challenges, we propose a cognition-inspired multimodal hierarchical attention framework. First, we freeze the CLIP model to extract lightweight vision–text joint representations and enforce cross-modal semantic consistency via embedding alignment. Second, we design a hierarchical self-attention mechanism to jointly capture intra-sequence dynamic preferences and inter-domain structural associations. Finally, we fuse intra-domain and cross-domain sequence encodings to enable fine-grained interest transfer. This work is the first to introduce human-cognition-motivated hierarchical attention into CDSR, achieving both interpretability and efficiency. Extensive experiments on four e-commerce datasets demonstrate state-of-the-art performance, with up to 12.6% improvement in Recall@10, validating the effectiveness of multimodal cognitive modeling for cross-domain interest transfer.
📝 Abstract
Cross-Domain Sequential Recommendation (CDSR) predicts user behavior by leveraging historical interactions across multiple domains, focusing on modeling cross-domain preferences through intra- and inter-sequence item relationships. Inspired by human cognitive processes, we propose Hierarchical Attention Fusion of Visual and Textual Representations (HAF-VT), a novel approach integrating visual and textual data to enhance cognitive modeling. Using the frozen CLIP model, we generate image and text embeddings, enriching item representations with multimodal data. A hierarchical attention mechanism jointly learns single-domain and cross-domain preferences, mimicking human information integration. Evaluated on four e-commerce datasets, HAF-VT outperforms existing methods in capturing cross-domain user interests, bridging cognitive principles with computational models and highlighting the role of multimodal data in sequential decision-making.