AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation

📅 2025-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational overhead of large language models (LLMs) in domain-specific applications—caused by autoregressive decoding—the paper proposes a lightweight, vocabulary-level domain adaptation method. It replaces generic tokens with domain-specific n-gram tokens, substantially shortening input and output sequence lengths. This work introduces the first end-to-end vocabulary reconstruction paradigm that requires no architectural modifications or tokenizer alterations, ensuring compatibility with arbitrary LLMs and tokenization schemes. The approach employs exponentially weighted embedding initialization and efficient fine-tuning on a single GPU. Experiments across three vertical domains using two 7B-parameter models demonstrate over 25% reduction in token count, significant inference latency improvement, and preservation of downstream task performance and generation quality.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have shown impressive versatility as general purpose models. However, their broad applicability comes at a high-cost computational overhead, particularly in auto-regressive decoding where each step requires a forward pass. In domain-specific settings, general-purpose capabilities are unnecessary and can be exchanged for efficiency. In this work, we take a novel perspective on domain adaptation, reducing latency and computational costs by adapting the vocabulary to focused domains of interest. We introduce AdaptiVocab, an end-to-end approach for vocabulary adaptation, designed to enhance LLM efficiency in low-resource domains. AdaptiVocab can be applied to any tokenizer and architecture, modifying the vocabulary by replacing tokens with domain-specific n-gram-based tokens, thereby reducing the number of tokens required for both input processing and output generation. AdaptiVocab initializes new n-token embeddings using an exponentially weighted combination of existing embeddings and employs a lightweight fine-tuning phase that can be efficiently performed on a single GPU. We evaluate two 7B LLMs across three niche domains, assessing efficiency, generation quality, and end-task performance. Our results show that AdaptiVocab reduces token usage by over 25% without compromising performance
Problem

Research questions and friction points this paper is trying to address.

Reducing LLM computational costs in domain-specific settings
Adapting vocabulary to enhance efficiency in low-resource domains
Minimizing token usage without sacrificing model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight vocabulary adaptation for efficiency
Domain-specific n-gram-based token replacement
Single GPU lightweight fine-tuning phase
🔎 Similar Papers
No similar papers found.