Valley2: Exploring Multimodal Models with Scalable Vision-Language Design

📅 2025-01-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low efficiency and poor generalization of multimodal understanding in e-commerce and short-video scenarios, this paper proposes a parameter-efficient and scalable vision-language large model architecture. Our method introduces three key innovations: (1) a lightweight, high-efficiency visual encoder; (2) a dynamic cross-modal alignment mechanism enabling fine-grained image-text interaction; and (3) a domain-adaptive training strategy tailored to real-world applications. Evaluated on an e-commerce multimodal benchmark, our model achieves a state-of-the-art score of 79.66. In the OpenCompass leaderboard, it ranks second among models with fewer than 10 billion parameters, attaining an average score of 67.4. The source code and pre-trained weights are publicly released, significantly broadening the practical applicability of lightweight cross-modal models in industrial settings.

Technology Category

Application Category

📝 Abstract
Recently, vision-language models have made remarkable progress, demonstrating outstanding capabilities in various tasks such as image captioning and video understanding. We introduce Valley2, a novel multimodal large language model designed to enhance performance across all domains and extend the boundaries of practical applications in e-commerce and short video scenarios. Notably, Valley2 achieves state-of-the-art (SOTA) performance on e-commerce benchmarks, surpassing open-source models of similar size by a large margin (79.66 vs. 72.76). Additionally, Valley2 ranks second on the OpenCompass leaderboard among models with fewer than 10B parameters, with an impressive average score of 67.4. The code and model weights are open-sourced at https://github.com/bytedance/Valley.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Modeling
Efficiency Improvement
E-commerce and Short Video Applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Valley2
Multimodal Large Model
E-commerce and Short Video Applications
🔎 Similar Papers
No similar papers found.
Ziheng Wu
Ziheng Wu
ByteDance
ComputerVision
Z
Zhenghao Chen
ByteDance
Ruipu Luo
Ruipu Luo
Bytedance
Nature Language Processing
C
Can Zhang
ByteDance
Y
Yuan Gao
ByteDance
Z
Zhentao He
ByteDance
X
Xian Wang
ByteDance
H
Haoran Lin
ByteDance
Minghui Qiu
Minghui Qiu
Alibaba Group
Deep LearningTransfer LearningChatbotsNLPArtificial Intelligence