jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images

📅 2024-12-11
🏛️ arXiv.org
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
Existing CLIP models suffer from monolingual (English-only) training, limited textual understanding due to unimodal text encoding, and inadequate modeling of rich visual documents. To address these limitations, we propose a multilingual multimodal unified embedding model, introducing a novel multi-stage, multi-task contrastive learning paradigm that jointly optimizes text pairs/triplets and image–text pairs. Our model integrates a text encoder supporting 29 languages, incorporates rich visual document image augmentation, and enforces fine-grained cross-modal alignment. Furthermore, it supports configurable embedding dimensions to accommodate diverse granularity requirements. Extensive experiments demonstrate state-of-the-art performance across zero-shot pure-text retrieval, semantic textual similarity, and cross-modal retrieval tasks—surpassing prior CLIP variants in both English and multilingual settings. The code and pretrained models are publicly available on Hugging Face.

Technology Category

Application Category

📝 Abstract
Contrastive Language-Image Pretraining (CLIP) has been widely used for crossmodal information retrieval and multimodal understanding tasks. However, CLIP models are mainly optimized for crossmodal vision-language tasks and underperform in single-mode text tasks. Moreover, these models are often trained on English datasets and therefore lack multilingual understanding. Additionally, from a visual understanding perspective, previous CLIP-based models exhibit insufficient understanding of visually rich documents. In this work, we propose jina-clip-v2, a contrastive vision-language model trained on text pairs, triplets and image-text pairs via a multi-task and multi-stage contrastive learning paradigm in order to support both text-only and crossmodal tasks. We employ a multilingual text encoder and expand the training dataset to include multilingual texts from 29 non-English languages, including Hindi, Chinese, German, French, and others, as well as images of visually rich documents. We evaluate the model's performance and show that jina-clip-v2 achieves notable improvements over state-of-the-art CLIP-based models in zero-shot text-only retrieval, semantic textual similarity, and crossmodal retrieval tasks in both English and multilingual settings. jina-clip-v2 also provides for flexibility in embedding dimensionality, enabling users to select the granularity of the representations. jina-clip-v2 is publicly available at https://huggingface.co/jinaai/jina-clip-v2.
Problem

Research questions and friction points this paper is trying to address.

Improves multilingual text and image embedding performance
Enhances single-mode text and crossmodal retrieval tasks
Supports 29 languages and visually rich documents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual text encoder supports 29 languages
Multi-task contrastive learning for text and images
Flexible embedding dimensionality for varied granularity
🔎 Similar Papers
A
Andreas Koukounas
Jina AI GmbH, Prinzessinnenstr. 19-20, 10969 Berlin, Germany
G
Georgios Mastrapas
Jina AI GmbH, Prinzessinnenstr. 19-20, 10969 Berlin, Germany
B
Bo Wang
Jina AI GmbH, Prinzessinnenstr. 19-20, 10969 Berlin, Germany
M
Mohammad Kalim Akram
Jina AI GmbH, Prinzessinnenstr. 19-20, 10969 Berlin, Germany
S
Sedigheh Eslami
Jina AI GmbH, Prinzessinnenstr. 19-20, 10969 Berlin, Germany
M
Michael Gunther
Jina AI GmbH, Prinzessinnenstr. 19-20, 10969 Berlin, Germany
Isabelle Mohr
Isabelle Mohr
Machine Learning Engineer, Jina AI
NLPcomputer visioncomputational linguistics
Saba Sturua
Saba Sturua
ML Research Engineer
Natural Language ProcessingMachine Learning
S
Scott Martens
Jina AI GmbH, Prinzessinnenstr. 19-20, 10969 Berlin, Germany
N
Nan Wang
Jina AI GmbH, Prinzessinnenstr. 19-20, 10969 Berlin, Germany
H
Han Xiao
Jina AI GmbH, Prinzessinnenstr. 19-20, 10969 Berlin, Germany