🤖 AI Summary
To address low data quality, inefficient tokenization, and challenges in processing PB-scale corpora for Indian multilingual large language models (LLMs), this work establishes a high-quality, Indic-language-focused multilingual data infrastructure—spanning Common Crawl, Wikipedia, news, and books—and introduces a customized preprocessing pipeline featuring domain-aware cleaning, cross-source deduplication, and language-specific adaptation. We propose a novel, multi-stage tokenizer training framework that jointly optimizes subword statistics and linguistic features, enabling the first dedicated tokenizer training strategy for Indic languages. The resulting Indic tokenizer consistently outperforms Tiktoken on 3B- and 7B-scale models: it achieves a 23% average token compression gain and an 18% inference speedup on Indic-language tasks, significantly mitigating computational overhead caused by excessively long token sequences in Indic languages.
📝 Abstract
We present a novel approach to data preparation for developing multilingual Indic large language model. Our meticulous data acquisition spans open-source and proprietary sources, including Common Crawl, Indic books, news articles, and Wikipedia, ensuring a diverse and rich linguistic representation. For each Indic language, we design a custom preprocessing pipeline to effectively eliminate redundant and low-quality text content. Additionally, we perform deduplication on Common Crawl data to address the redundancy present in 70% of the crawled web pages. This study focuses on developing high-quality data, optimizing tokenization for our multilingual dataset for Indic large language models with 3B and 7B parameters, engineered for superior performance in Indic languages. We introduce a novel multilingual tokenizer training strategy, demonstrating our custom-trained Indic tokenizer outperforms the state-of-the-art OpenAI Tiktoken tokenizer, achieving a superior token-to-word ratio for Indic languages.