DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models

📅 2024-08-08

🏛️ arXiv.org

📈 Citations: 10

✨ Influential: 0

career value

211K/year

🤖 AI Summary

To address inefficiency and poor scalability in training multimodal large language models (MLLMs) caused by model heterogeneity (divergent cross-modal parameter counts and computational demands) and data heterogeneity (uneven sequence lengths and sampling frequencies across modalities), this paper proposes DistTrain, a decoupled training framework. Its core innovation lies in the first-ever integration of “model orchestration”—which dynamically adapts to heterogeneous modal compute loads—and “data reordering”—which dynamically balances cross-modal data distributions. DistTrain further incorporates GPU computation-communication overlap and a lightweight system design. Evaluated on a 1,000-GPU cluster training a 72B MLLM, it achieves 54.7% model flops utilization (MFU) and up to 2.2× higher throughput than Megatron-LM. The results demonstrate DistTrain’s effectiveness, strong scalability, and engineering efficiency.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (LLMs) have demonstrated significant potential in a wide range of AI applications. Yet, training multimodal LLMs suffers from low efficiency and scalability, due to the inherent model heterogeneity and data heterogeneity across different modalities. We present DistTrain, an efficient and adaptive framework to reform the training of multimodal large language models on large-scale clusters. The core of DistTrain is the disaggregated training technique that exploits the characteristics of multimodal LLM training to achieve high efficiency and scalability. Specifically, it leverages disaggregated model orchestration and disaggregated data reordering to address model and data heterogeneity respectively. We also tailor system optimization for multimodal LLM training to overlap GPU communication and computation. We evaluate DistTrain across different sizes of multimodal LLMs on a large-scale production cluster with thousands of GPUs. The experimental results show that DistTrain achieves 54.7% Model FLOPs Utilization (MFU) when training a 72B multimodal LLM on 1172 GPUs and outperforms Megatron-LM by up to 2.2$ imes$ on throughput. The ablation study shows the main techniques of DistTrain are both effective and lightweight.

Problem

Research questions and friction points this paper is trying to address.

Addressing model heterogeneity in multimodal LLM training

Resolving data heterogeneity challenges in multimodal systems

Eliminating resource contention between preprocessing and training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disaggregated model orchestration for adaptive resource allocation

Disaggregated data preprocessing to eliminate resource contention

Efficient data reordering to mitigate stragglers in training

🔎 Similar Papers

DEPT: Decoupled Embeddings for Pre-training Language Models