Decoupled Multimodal Fusion for User Interest Modeling in Click-Through Rate Prediction

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing industrial CTR models typically adopt modality-isolated fusion, failing to capture fine-grained interactions between ID-based behavioral signals and multimodal content semantics. To address this, we propose Decoupled Multimodal Fusion (DMF): first, a target-aware feature bridge is constructed to align ID embeddings with multimodal representation spaces; second, a novel attention mechanism is designed for modality-enhanced modeling and inference optimization, decoupling target-feature computation from ID embedding lookup to alleviate computational bottlenecks. DMF integrates pretrained multimodal representations, target-aware feature construction, and multi-strategy interest fusion. Extensive experiments on public and industrial datasets validate its effectiveness. Deployed in the Lazada e-commerce system, DMF achieves a 5.30% CTR lift and a 7.43% GMV increase, with negligible computational overhead.

Technology Category

Application Category

📝 Abstract

Modern industrial recommendation systems improve recommendation performance by integrating multimodal representations from pre-trained models into ID-based Click-Through Rate (CTR) prediction frameworks. However, existing approaches typically adopt modality-centric modeling strategies that process ID-based and multimodal embeddings independently, failing to capture fine-grained interactions between content semantics and behavioral signals. In this paper, we propose Decoupled Multimodal Fusion (DMF), which introduces a modality-enriched modeling strategy to enable fine-grained interactions between ID-based collaborative representations and multimodal representations for user interest modeling. Specifically, we construct target-aware features to bridge the semantic gap across different embedding spaces and leverage them as side information to enhance the effectiveness of user interest modeling. Furthermore, we design an inference-optimized attention mechanism that decouples the computation of target-aware features and ID-based embeddings before the attention layer, thereby alleviating the computational bottleneck introduced by incorporating target-aware features. To achieve comprehensive multimodal integration, DMF combines user interest representations learned under the modality-centric and modality-enriched modeling strategies. Offline experiments on public and industrial datasets demonstrate the effectiveness of DMF. Moreover, DMF has been deployed on the product recommendation system of the international e-commerce platform Lazada, achieving relative improvements of 5.30% in CTCVR and 7.43% in GMV with negligible computational overhead.

Problem

Research questions and friction points this paper is trying to address.

Modeling fine-grained interactions between multimodal and ID-based representations

Bridging semantic gaps across different embedding spaces for CTR prediction

Optimizing computational efficiency when integrating target-aware multimodal features

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled Multimodal Fusion bridges ID and multimodal representations

Target-aware features connect semantic gaps across embedding spaces

Inference-optimized attention decouples computation before attention layer

🔎 Similar Papers

No similar papers found.