Krutrim LLM: A Novel Tokenization Strategy for Multilingual Indic Languages with Petabyte-Scale Data Processing

📅 2024-07-17

📈 Citations: 1

✨ Influential: 0

career value

153K/year

🤖 AI Summary

To address low data quality, inefficient tokenization, and challenges in processing PB-scale corpora for Indian multilingual large language models (LLMs), this work establishes a high-quality, Indic-language-focused multilingual data infrastructure—spanning Common Crawl, Wikipedia, news, and books—and introduces a customized preprocessing pipeline featuring domain-aware cleaning, cross-source deduplication, and language-specific adaptation. We propose a novel, multi-stage tokenizer training framework that jointly optimizes subword statistics and linguistic features, enabling the first dedicated tokenizer training strategy for Indic languages. The resulting Indic tokenizer consistently outperforms Tiktoken on 3B- and 7B-scale models: it achieves a 23% average token compression gain and an 18% inference speedup on Indic-language tasks, significantly mitigating computational overhead caused by excessively long token sequences in Indic languages.

Technology Category

Application Category

📝 Abstract

We present a novel approach to data preparation for developing multilingual Indic large language model. Our meticulous data acquisition spans open-source and proprietary sources, including Common Crawl, Indic books, news articles, and Wikipedia, ensuring a diverse and rich linguistic representation. For each Indic language, we design a custom preprocessing pipeline to effectively eliminate redundant and low-quality text content. Additionally, we perform deduplication on Common Crawl data to address the redundancy present in 70% of the crawled web pages. This study focuses on developing high-quality data, optimizing tokenization for our multilingual dataset for Indic large language models with 3B and 7B parameters, engineered for superior performance in Indic languages. We introduce a novel multilingual tokenizer training strategy, demonstrating our custom-trained Indic tokenizer outperforms the state-of-the-art OpenAI Tiktoken tokenizer, achieving a superior token-to-word ratio for Indic languages.

Problem

Research questions and friction points this paper is trying to address.

Develops multilingual Indic LLM with optimized tokenization

Processes petabyte-scale data for diverse linguistic representation

Introduces custom tokenizer outperforming OpenAI Tiktoken

Innovation

Methods, ideas, or system contributions that make the work stand out.

Custom preprocessing pipeline for Indic languages

Deduplication of Common Crawl data

Novel multilingual tokenizer training strategy

🔎 Similar Papers

No similar papers found.