MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost

📅 2024-12-02

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

🤖 AI Summary

Achieving low-cost, high-quality multilingual text-to-image (T2I) generation remains challenging due to the scarcity of high-quality multilingual image-text pairs and the prohibitive cost of training large diffusion models from scratch. Method: We propose MuLan, a lightweight multilingual adapter (20M parameters) that reuses a frozen English diffusion model (Stable Diffusion) and an open-source multilingual text encoder—without requiring high-quality multilingual supervision. Contribution/Results: This work is the first to empirically validate the effectiveness of noise-robust, web-scale pre-trained multilingual text encoders for T2I. MuLan bridges modality gaps across 110+ languages, achieving CLIP similarity of 37.61 (vs. 38.61 for English), while reducing training costs by over 90%. The method is fully compatible with mainstream controllable generation extensions—including LoRA, ControlNet, and IP-Adapter—ensuring strong performance, computational efficiency, and deployment friendliness.

Technology Category

Application Category

📝 Abstract

In this work, we explore a cost-effective framework for multilingual image generation. We find that, unlike models tuned on high-quality images with multilingual annotations, leveraging text encoders pre-trained on widely available, noisy Internet image-text pairs significantly enhances data efficiency in text-to-image (T2I) generation across multiple languages. Based on this insight, we introduce MuLan, Multi-Language adapter, a lightweight language adapter with fewer than 20M parameters, trained alongside a frozen text encoder and image diffusion model. Compared to previous multilingual T2I models, this framework offers: (1) Cost efficiency. Using readily accessible English data and off-the-shelf multilingual text encoders minimizes the training cost; (2) High performance. Achieving comparable generation capabilities in over 110 languages with CLIP similarity scores nearly matching those in English (38.61 for English vs. 37.61 for other languages); and (3) Broad applicability. Seamlessly integrating with compatible community tools like LoRA, LCM, ControlNet, and IP-Adapter, expanding its potential use cases.

Problem

Research questions and friction points this paper is trying to address.

Cost-effective multilingual image generation framework

Enhancing data efficiency in text-to-image generation

Lightweight adapter for 110+ languages with high performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight multilingual adapter with <20M parameters

Utilizes pre-trained noisy Internet text encoders

Seamless integration with community tools like LoRA

🔎 Similar Papers

No similar papers found.

Authors to Follow