Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining

📅 2024-10-10

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 0

career value

244K/year

🤖 AI Summary

Existing data selection methods for large language model (LLM) pretraining suffer from conflicting objectives and poor interoperability, hindering effective integration. Method: This paper proposes a multi-agent collaborative data selection framework that models mainstream data selection strategies as heterogeneous agents and employs an online training feedback-driven dynamic agent scheduling mechanism to fuse their evaluations in real time and adaptively adjust their weights. Contribution/Results: To our knowledge, this is the first framework enabling end-to-end differentiable and scalable integration of heterogeneous selection strategies, overcoming inherent trade-offs between performance and generalization imposed by single-method approaches. Evaluated across multiple LLM benchmarks, the framework achieves an average improvement of 10.5%, significantly accelerates convergence, and substantially enhances data utilization efficiency under fixed computational budgets.

Technology Category

Application Category

📝 Abstract

Efficient data selection is crucial to accelerate the pretraining of large language models (LLMs). While various methods have been proposed to enhance data efficiency, limited research has addressed the inherent conflicts between these approaches to achieve optimal data selection for LLM pretraining. To tackle this problem, we propose a novel multi-agent collaborative data selection mechanism. In this framework, each data selection method serves as an independent agent, and an agent console is designed to dynamically integrate the information from all agents throughout the LLM training process. We conduct extensive empirical studies to evaluate our multi-agent framework. The experimental results demonstrate that our approach significantly improves data efficiency, accelerates convergence in LLM training, and achieves an average performance gain up to 10.5% across multiple language model benchmarks compared to the state-of-the-art methods.

Problem

Research questions and friction points this paper is trying to address.

Addresses conflicts in data selection methods for LM pretraining

Proposes multi-actor collaboration to optimize data prioritization

Improves data efficiency and accelerates LM pretraining convergence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-actor collaborative data selection mechanism

Dynamic integration of prioritization rules

Console adjusts actor impacts dynamically

🔎 Similar Papers

Improving Pretraining Data Using Perplexity Correlations