AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the high computational overhead of large language models (LLMs) in domain-specific applications—caused by autoregressive decoding—the paper proposes a lightweight, vocabulary-level domain adaptation method. It replaces generic tokens with domain-specific n-gram tokens, substantially shortening input and output sequence lengths. This work introduces the first end-to-end vocabulary reconstruction paradigm that requires no architectural modifications or tokenizer alterations, ensuring compatibility with arbitrary LLMs and tokenization schemes. The approach employs exponentially weighted embedding initialization and efficient fine-tuning on a single GPU. Experiments across three vertical domains using two 7B-parameter models demonstrate over 25% reduction in token count, significant inference latency improvement, and preservation of downstream task performance and generation quality.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have shown impressive versatility as general purpose models. However, their broad applicability comes at a high-cost computational overhead, particularly in auto-regressive decoding where each step requires a forward pass. In domain-specific settings, general-purpose capabilities are unnecessary and can be exchanged for efficiency. In this work, we take a novel perspective on domain adaptation, reducing latency and computational costs by adapting the vocabulary to focused domains of interest. We introduce AdaptiVocab, an end-to-end approach for vocabulary adaptation, designed to enhance LLM efficiency in low-resource domains. AdaptiVocab can be applied to any tokenizer and architecture, modifying the vocabulary by replacing tokens with domain-specific n-gram-based tokens, thereby reducing the number of tokens required for both input processing and output generation. AdaptiVocab initializes new n-token embeddings using an exponentially weighted combination of existing embeddings and employs a lightweight fine-tuning phase that can be efficiently performed on a single GPU. We evaluate two 7B LLMs across three niche domains, assessing efficiency, generation quality, and end-task performance. Our results show that AdaptiVocab reduces token usage by over 25% without compromising performance

Problem

Research questions and friction points this paper is trying to address.

Reducing LLM computational costs in domain-specific settings

Adapting vocabulary to enhance efficiency in low-resource domains

Minimizing token usage without sacrificing model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight vocabulary adaptation for efficiency

Domain-specific n-gram-based token replacement

Single GPU lightweight fine-tuning phase

🔎 Similar Papers

No similar papers found.

Authors to Follow