MMQ: Multimodal Mixture-of-Quantization Tokenization for Semantic ID Generation and User Behavioral Adaptation

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional recommender systems rely on item IDs, limiting generalization to large-scale dynamic item catalogs and long-tail scenarios. While existing semantic-ID methods leverage multimodal content to improve recommendations for novel or rare items, they suffer from two key challenges: (1) imbalanced cross-modal collaboration versus modality-specific representation, and (2) misalignment between semantic representations and user behavioral preferences. To address these, we propose MMQ—a unified multimodal quantization framework. MMQ employs a shared-specific multi-expert tokenizer to jointly model cross-modal synergy and modality uniqueness; enforces modality disentanglement via orthogonal regularization; and bridges the semantic–behavioral gap through multimodal reconstruction loss and behavior-aware fine-tuning. The framework natively supports both generative retrieval and discriminative ranking. Extensive offline and online experiments demonstrate that MMQ significantly improves recommendation performance for both trending and long-tail items, achieving state-of-the-art results in scalability, generalization, and task adaptability.

Technology Category

Application Category

📝 Abstract
Recommender systems traditionally represent items using unique identifiers (ItemIDs), but this approach struggles with large, dynamic item corpora and sparse long-tail data, limiting scalability and generalization. Semantic IDs, derived from multimodal content such as text and images, offer a promising alternative by mapping items into a shared semantic space, enabling knowledge transfer and improving recommendations for new or rare items. However, existing methods face two key challenges: (1) balancing cross-modal synergy with modality-specific uniqueness, and (2) bridging the semantic-behavioral gap, where semantic representations may misalign with actual user preferences. To address these challenges, we propose Multimodal Mixture-of-Quantization (MMQ), a two-stage framework that trains a novel multimodal tokenizer. First, a shared-specific tokenizer leverages a multi-expert architecture with modality-specific and modality-shared experts, using orthogonal regularization to capture comprehensive multimodal information. Second, behavior-aware fine-tuning dynamically adapts semantic IDs to downstream recommendation objectives while preserving modality information through a multimodal reconstruction loss. Extensive offline experiments and online A/B tests demonstrate that MMQ effectively unifies multimodal synergy, specificity, and behavioral adaptation, providing a scalable and versatile solution for both generative retrieval and discriminative ranking tasks.
Problem

Research questions and friction points this paper is trying to address.

Balancing cross-modal synergy with modality-specific uniqueness in semantic IDs
Bridging semantic-behavioral gap between item representations and user preferences
Improving recommendation scalability and generalization for dynamic item corpora
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal tokenizer with shared-specific experts
Behavior-aware fine-tuning with reconstruction loss
Two-stage framework for semantic-behavioral alignment
🔎 Similar Papers
No similar papers found.
Y
Yi Xu
Alibaba Group
Moyu Zhang
Moyu Zhang
Beijing University of Posts and Telecommunications、Alibaba Group
Knowledge TracingInformation RetrievalRecommender System
C
Chenxuan Li
Peking University
Z
Zhihao Liao
Beijing University of Aeronautics and Astronautics
H
Haibo Xing
Alibaba Group
Hao Deng
Hao Deng
Engineer
recommendation system
Jinxin Hu
Jinxin Hu
Alibaba
Y
Yu Zhang
Alibaba Group
X
Xiaoyi Zeng
Alibaba Group
J
Jing Zhang
Wuhan University, School of Computer Science