π€ AI Summary
This work addresses the challenge of limited labeled data in low-resource multilingual settings, which severely hampers the performance of small classification models. The authors propose a novel paradigm that leverages large multilingual language models as "teachers" to generate high-quality synthetic data through instruction tuning and in-context learning, enabling cross-lingual knowledge distillation into lightweight student models. Experimental results demonstrate that student models trained on only a small amount of such synthesized data consistently outperform the original large language model across 11 languages and four text classification tasks, with particularly pronounced gains in low-resource languages. These findings validate the efficacy and efficiency of employing large language models as data generators rather than direct classifiers in resource-constrained multilingual scenarios.
π Abstract
Large Language Models (LLMs) have demonstrated remarkable multilingual capabilities, making them promising tools in both high- and low-resource languages. One particularly valuable use case is generating synthetic samples that can be used to train smaller models in low-resource scenarios where human-labelled data is scarce. In this work, we investigate whether these synthetic data generation capabilities can serve as a form of distillation, producing smaller models that perform on par with or even better than massive LLMs across languages and tasks. To this end, we use a state-of-the-art multilingual LLM to generate synthetic datasets covering 11 languages and 4 classification tasks. These datasets are then used to train smaller models via fine-tuning or instruction tuning, or as synthetic in-context examples for compact LLMs. Our experiments show that even small amounts of synthetic data enable smaller models to outperform the large generator itself, particularly in low-resource languages. Overall, the results suggest that LLMs are best utilised as generators (teachers) rather than classifiers, producing data that empowers smaller and more efficient multilingual models.