🤖 AI Summary
Traditional recommender systems rely on item IDs, limiting generalization to large-scale dynamic item catalogs and long-tail scenarios. While existing semantic-ID methods leverage multimodal content to improve recommendations for novel or rare items, they suffer from two key challenges: (1) imbalanced cross-modal collaboration versus modality-specific representation, and (2) misalignment between semantic representations and user behavioral preferences. To address these, we propose MMQ—a unified multimodal quantization framework. MMQ employs a shared-specific multi-expert tokenizer to jointly model cross-modal synergy and modality uniqueness; enforces modality disentanglement via orthogonal regularization; and bridges the semantic–behavioral gap through multimodal reconstruction loss and behavior-aware fine-tuning. The framework natively supports both generative retrieval and discriminative ranking. Extensive offline and online experiments demonstrate that MMQ significantly improves recommendation performance for both trending and long-tail items, achieving state-of-the-art results in scalability, generalization, and task adaptability.
📝 Abstract
Recommender systems traditionally represent items using unique identifiers (ItemIDs), but this approach struggles with large, dynamic item corpora and sparse long-tail data, limiting scalability and generalization. Semantic IDs, derived from multimodal content such as text and images, offer a promising alternative by mapping items into a shared semantic space, enabling knowledge transfer and improving recommendations for new or rare items. However, existing methods face two key challenges: (1) balancing cross-modal synergy with modality-specific uniqueness, and (2) bridging the semantic-behavioral gap, where semantic representations may misalign with actual user preferences. To address these challenges, we propose Multimodal Mixture-of-Quantization (MMQ), a two-stage framework that trains a novel multimodal tokenizer. First, a shared-specific tokenizer leverages a multi-expert architecture with modality-specific and modality-shared experts, using orthogonal regularization to capture comprehensive multimodal information. Second, behavior-aware fine-tuning dynamically adapts semantic IDs to downstream recommendation objectives while preserving modality information through a multimodal reconstruction loss. Extensive offline experiments and online A/B tests demonstrate that MMQ effectively unifies multimodal synergy, specificity, and behavioral adaptation, providing a scalable and versatile solution for both generative retrieval and discriminative ranking tasks.