🤖 AI Summary
To address the limited representational capacity of token embeddings and the prohibitive computational overhead induced by deepening transformer layers in large language models (LLMs), this paper proposes SCONE: a novel Scalable, Contextualized, and Offloaded n-gram Embedding framework. SCONE introduces contextualized n-gram representations—derived from high-frequency n-grams—directly into the embedding layer, enabling dual-path independent scaling of embedding cache size and backbone model parameters. Leveraging CPU/memory offloading for inference and a FLOPS-invariant architectural design, SCONE enhances representational power without increasing inference computation. Experiments demonstrate that SCONE significantly outperforms a 1.9B-parameter baseline across multiple benchmark corpora while reducing inference FLOPS by 50%, thereby achieving simultaneous gains in both efficiency and accuracy.
📝 Abstract
We propose SCONE ($ extbf{S}$calable, $ extbf{C}$ontextualized, $ extbf{O}$ffloaded, $ extbf{N}$-gram $ extbf{E}$mbedding), a method for extending input embedding layers to enhance language model performance as layer size scales. To avoid increased decoding costs, SCONE retains the original vocabulary while introducing embeddings for a set of frequent $n$-grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. During inference, they are precomputed and stored in off-accelerator memory with minimal impact on inference speed. SCONE enables two new scaling strategies: increasing the number of cached $n$-gram embeddings and scaling the model used to learn them, all while maintaining fixed inference-time FLOPS. We show that scaling both aspects allows SCONE to outperform a 1.9B parameter baseline across diverse corpora, while using only half the inference-time FLOPS.