MMQ: Multimodal Mixture-of-Quantization Tokenization for Semantic ID Generation and User Behavioral Adaptation

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Traditional recommender systems rely on item IDs, limiting generalization to large-scale dynamic item catalogs and long-tail scenarios. While existing semantic-ID methods leverage multimodal content to improve recommendations for novel or rare items, they suffer from two key challenges: (1) imbalanced cross-modal collaboration versus modality-specific representation, and (2) misalignment between semantic representations and user behavioral preferences. To address these, we propose MMQ—a unified multimodal quantization framework. MMQ employs a shared-specific multi-expert tokenizer to jointly model cross-modal synergy and modality uniqueness; enforces modality disentanglement via orthogonal regularization; and bridges the semantic–behavioral gap through multimodal reconstruction loss and behavior-aware fine-tuning. The framework natively supports both generative retrieval and discriminative ranking. Extensive offline and online experiments demonstrate that MMQ significantly improves recommendation performance for both trending and long-tail items, achieving state-of-the-art results in scalability, generalization, and task adaptability.

Technology Category

Application Category

📝 Abstract

Recommender systems traditionally represent items using unique identifiers (ItemIDs), but this approach struggles with large, dynamic item corpora and sparse long-tail data, limiting scalability and generalization. Semantic IDs, derived from multimodal content such as text and images, offer a promising alternative by mapping items into a shared semantic space, enabling knowledge transfer and improving recommendations for new or rare items. However, existing methods face two key challenges: (1) balancing cross-modal synergy with modality-specific uniqueness, and (2) bridging the semantic-behavioral gap, where semantic representations may misalign with actual user preferences. To address these challenges, we propose Multimodal Mixture-of-Quantization (MMQ), a two-stage framework that trains a novel multimodal tokenizer. First, a shared-specific tokenizer leverages a multi-expert architecture with modality-specific and modality-shared experts, using orthogonal regularization to capture comprehensive multimodal information. Second, behavior-aware fine-tuning dynamically adapts semantic IDs to downstream recommendation objectives while preserving modality information through a multimodal reconstruction loss. Extensive offline experiments and online A/B tests demonstrate that MMQ effectively unifies multimodal synergy, specificity, and behavioral adaptation, providing a scalable and versatile solution for both generative retrieval and discriminative ranking tasks.

Problem

Research questions and friction points this paper is trying to address.

Balancing cross-modal synergy with modality-specific uniqueness in semantic IDs

Bridging semantic-behavioral gap between item representations and user preferences

Improving recommendation scalability and generalization for dynamic item corpora

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal tokenizer with shared-specific experts

Behavior-aware fine-tuning with reconstruction loss

Two-stage framework for semantic-behavioral alignment

🔎 Similar Papers

No similar papers found.

Authors to Follow