MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding

📅 2025-11-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address three key challenges in e-commerce multimodal product understanding—modality imbalance, weak cross-modal alignment, and poor noise robustness—this paper proposes a dynamic modality balancing mechanism, a dual-level alignment framework, and a vision-language collaborative enhancement strategy. Specifically, we introduce a modality-driven Mixture-of-Experts (MoE) architecture for representation-level dynamic weight allocation; design fine-grained and coarse-grained dual-level alignment to capture intrinsic structural consistency between visual and textual modalities; and integrate dynamic sample filtering with multimodal large language model (MLLM)-guided vision-language mutual enhancement to improve robustness against noisy data. Evaluated on our proprietary benchmark MBE2.0 and multiple public datasets, the proposed method achieves state-of-the-art zero-shot transfer performance. Visualization analyses further confirm significant improvements in cross-modal semantic consistency and alignment accuracy.

Technology Category

Application Category

📝 Abstract
The rapid growth of e-commerce calls for multimodal models that comprehend rich visual and textual product information. Although recent multimodal large language models (MLLMs) for product understanding exhibit strong capability in representation learning for e-commerce, they still face three challenges: (i) the modality imbalance induced by modality mixed training; (ii) underutilization of the intrinsic alignment relationships among visual and textual information within a product; and (iii) limited handling of noise in e-commerce multimodal data. To address these, we propose MOON2.0, a dynamic modality-balanced multimodal representation learning framework for e-commerce product understanding. MOON2.0 comprises: (1) a Modality-driven Mixture-of-Experts (MoE) module that adaptively processes input samples by their modality composition, enabling Multimodal Joint Learning to mitigate the modality imbalance; (2) a Dual-level Alignment method to better leverage semantic alignment properties inside individual products; and (3) an MLLM-based Image-text Co-augmentation strategy that integrates textual enrichment with visual expansion, coupled with Dynamic Sample Filtering to improve training data quality. We further introduce MBE2.0, a co-augmented multimodal representation benchmark for e-commerce representation learning and evaluation. Experiments show that MOON2.0 delivers state-of-the-art zero-shot performance on MBE2.0 and multiple public datasets. Furthermore, attention-based heatmap visualization provides qualitative evidence of improved multimodal alignment of MOON2.0.
Problem

Research questions and friction points this paper is trying to address.

Addresses modality imbalance in multimodal e-commerce product understanding models
Improves utilization of intrinsic visual-textual alignment within product data
Mitigates noise handling limitations in e-commerce multimodal training data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality-driven MoE adaptively processes input samples
Dual-level Alignment leverages semantic alignment properties
MLLM-based Image-text Co-augmentation with Dynamic Filtering
🔎 Similar Papers
No similar papers found.
Z
Zhanheng Nie
Alibaba Group, Hangzhou, China
C
Chenghan Fu
Alibaba Group, Hangzhou, China
D
Daoze Zhang
Alibaba Group, Hangzhou, China
Junxian Wu
Junxian Wu
Peking Union Medical College Hospital; Southeast University
Whole Slide ImagesDeep learningMedical Image Analysis
W
Wanxian Guan
Alibaba Group, Hangzhou, China
P
Pengjie Wang
Alibaba Group, Hangzhou, China
J
Jian Xu
Alibaba Group, Hangzhou, China
B
Bo Zheng
Alibaba Group, Hangzhou, China