Hierarchical Attention Fusion of Visual and Textual Representations for Cross-Domain Sequential Recommendation

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Cross-domain sequential recommendation (CDSR) suffers from insufficient multimodal information utilization and coarse-grained cross-domain interest modeling. To address these challenges, we propose a cognition-inspired multimodal hierarchical attention framework. First, we freeze the CLIP model to extract lightweight vision–text joint representations and enforce cross-modal semantic consistency via embedding alignment. Second, we design a hierarchical self-attention mechanism to jointly capture intra-sequence dynamic preferences and inter-domain structural associations. Finally, we fuse intra-domain and cross-domain sequence encodings to enable fine-grained interest transfer. This work is the first to introduce human-cognition-motivated hierarchical attention into CDSR, achieving both interpretability and efficiency. Extensive experiments on four e-commerce datasets demonstrate state-of-the-art performance, with up to 12.6% improvement in Recall@10, validating the effectiveness of multimodal cognitive modeling for cross-domain interest transfer.

Technology Category

Application Category

📝 Abstract

Cross-Domain Sequential Recommendation (CDSR) predicts user behavior by leveraging historical interactions across multiple domains, focusing on modeling cross-domain preferences through intra- and inter-sequence item relationships. Inspired by human cognitive processes, we propose Hierarchical Attention Fusion of Visual and Textual Representations (HAF-VT), a novel approach integrating visual and textual data to enhance cognitive modeling. Using the frozen CLIP model, we generate image and text embeddings, enriching item representations with multimodal data. A hierarchical attention mechanism jointly learns single-domain and cross-domain preferences, mimicking human information integration. Evaluated on four e-commerce datasets, HAF-VT outperforms existing methods in capturing cross-domain user interests, bridging cognitive principles with computational models and highlighting the role of multimodal data in sequential decision-making.

Problem

Research questions and friction points this paper is trying to address.

Enhance cross-domain user behavior prediction using multimodal data

Model intra- and inter-sequence item relationships hierarchically

Integrate visual and textual representations for cognitive modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical attention fuses visual and textual data

CLIP model generates multimodal item embeddings

Mimics human cognition for cross-domain preferences

🔎 Similar Papers

No similar papers found.