🤖 AI Summary
This work addresses the inefficiency of small-scale draft models in speculative decoding for non-English languages, which hinders multilingual generation performance. The authors propose three optimization strategies: task-specific fine-tuning, monolingual corpus fine-tuning, and a lightweight n-gram draft model, systematically evaluating them on translation and story generation tasks across 11 languages. Their findings reveal that while task distillation achieves high efficiency on specific tasks, its generalization remains limited. In contrast, the n-gram model—despite its lower acceptance rate—delivers consistent and significant speedups across all evaluated languages due to its minimal inference overhead, underscoring its practical utility in multilingual speculative decoding scenarios.
📝 Abstract
Speculative decoding has become a crucial component of large language model (LLM) inference, enabling faster generation by drafting multiple tokens and verifying them in parallel. However, small draft models tend to suffer from disproportionately poor multilingual capabilities. Thus, when generating text in a non-English language, speculative decoding is far less effective.
We compare three strategies to improve speculative decoding efficiency for eleven languages: finetuning the draft model on task-specific data (translation); finetuning the draft model on unlabeled monolingual corpora; and training simple n-gram draft models on the same monolingual corpora. We evaluate efficiency on translation (from English into the target language) and the held-out task of story generation. We find that while task-specific distillation can significantly improve efficiency, distilled models generalize poorly to a new task. Meanwhile, n-gram draft models, despite lower acceptance rates, consistently provide large speed-ups due to much faster draft generation.