jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images

📅 2024-12-11

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 0

🤖 AI Summary

Existing CLIP models suffer from monolingual (English-only) training, limited textual understanding due to unimodal text encoding, and inadequate modeling of rich visual documents. To address these limitations, we propose a multilingual multimodal unified embedding model, introducing a novel multi-stage, multi-task contrastive learning paradigm that jointly optimizes text pairs/triplets and image–text pairs. Our model integrates a text encoder supporting 29 languages, incorporates rich visual document image augmentation, and enforces fine-grained cross-modal alignment. Furthermore, it supports configurable embedding dimensions to accommodate diverse granularity requirements. Extensive experiments demonstrate state-of-the-art performance across zero-shot pure-text retrieval, semantic textual similarity, and cross-modal retrieval tasks—surpassing prior CLIP variants in both English and multilingual settings. The code and pretrained models are publicly available on Hugging Face.

Technology Category

Application Category

📝 Abstract

Contrastive Language-Image Pretraining (CLIP) has been widely used for crossmodal information retrieval and multimodal understanding tasks. However, CLIP models are mainly optimized for crossmodal vision-language tasks and underperform in single-mode text tasks. Moreover, these models are often trained on English datasets and therefore lack multilingual understanding. Additionally, from a visual understanding perspective, previous CLIP-based models exhibit insufficient understanding of visually rich documents. In this work, we propose jina-clip-v2, a contrastive vision-language model trained on text pairs, triplets and image-text pairs via a multi-task and multi-stage contrastive learning paradigm in order to support both text-only and crossmodal tasks. We employ a multilingual text encoder and expand the training dataset to include multilingual texts from 29 non-English languages, including Hindi, Chinese, German, French, and others, as well as images of visually rich documents. We evaluate the model's performance and show that jina-clip-v2 achieves notable improvements over state-of-the-art CLIP-based models in zero-shot text-only retrieval, semantic textual similarity, and crossmodal retrieval tasks in both English and multilingual settings. jina-clip-v2 also provides for flexibility in embedding dimensionality, enabling users to select the granularity of the representations. jina-clip-v2 is publicly available at https://huggingface.co/jinaai/jina-clip-v2.

Problem

Research questions and friction points this paper is trying to address.

Improves multilingual text and image embedding performance

Enhances single-mode text and crossmodal retrieval tasks

Supports 29 languages and visually rich documents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual text encoder supports 29 languages

Multi-task contrastive learning for text and images

Flexible embedding dimensionality for varied granularity

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4

Authors to Follow