Krutrim LLM: A Novel Tokenization Strategy for Multilingual Indic Languages with Petabyte-Scale Data Processing

📅 2024-07-17
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address low data quality, inefficient tokenization, and challenges in processing PB-scale corpora for Indian multilingual large language models (LLMs), this work establishes a high-quality, Indic-language-focused multilingual data infrastructure—spanning Common Crawl, Wikipedia, news, and books—and introduces a customized preprocessing pipeline featuring domain-aware cleaning, cross-source deduplication, and language-specific adaptation. We propose a novel, multi-stage tokenizer training framework that jointly optimizes subword statistics and linguistic features, enabling the first dedicated tokenizer training strategy for Indic languages. The resulting Indic tokenizer consistently outperforms Tiktoken on 3B- and 7B-scale models: it achieves a 23% average token compression gain and an 18% inference speedup on Indic-language tasks, significantly mitigating computational overhead caused by excessively long token sequences in Indic languages.

Technology Category

Application Category

📝 Abstract
We present a novel approach to data preparation for developing multilingual Indic large language model. Our meticulous data acquisition spans open-source and proprietary sources, including Common Crawl, Indic books, news articles, and Wikipedia, ensuring a diverse and rich linguistic representation. For each Indic language, we design a custom preprocessing pipeline to effectively eliminate redundant and low-quality text content. Additionally, we perform deduplication on Common Crawl data to address the redundancy present in 70% of the crawled web pages. This study focuses on developing high-quality data, optimizing tokenization for our multilingual dataset for Indic large language models with 3B and 7B parameters, engineered for superior performance in Indic languages. We introduce a novel multilingual tokenizer training strategy, demonstrating our custom-trained Indic tokenizer outperforms the state-of-the-art OpenAI Tiktoken tokenizer, achieving a superior token-to-word ratio for Indic languages.
Problem

Research questions and friction points this paper is trying to address.

Develops multilingual Indic LLM with optimized tokenization
Processes petabyte-scale data for diverse linguistic representation
Introduces custom tokenizer outperforming OpenAI Tiktoken
Innovation

Methods, ideas, or system contributions that make the work stand out.

Custom preprocessing pipeline for Indic languages
Deduplication of Common Crawl data
Novel multilingual tokenizer training strategy
🔎 Similar Papers
No similar papers found.
R
Rahul Kumar
S
Shubham Kakde
D
Divyansh Rajput
D
Daud Ibrahim
R
Rishabh Nahata
P
Pidathala Sowjanya
D
Deepak Kumarr
G
Gautam Bhargava
Chandra Khatri
Chandra Khatri
Ola Krutrim AI
Artificial IntelligenceMulti-Modal AIConversational AIDeep LearningMachine Learning