Tensorizing Engram: Sharing Latents Across N-Gram Embeddings is Beneficial in LLMs

📅 2026-06-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of existing large language models, which rely on discrete token embeddings and struggle to explicitly model multi-token patterns. Conventional n-gram memory approaches suffer from hash collisions due to independent hash tables and fail to enable structural sharing among nested n-grams. To overcome these issues, the paper proposes TN-gram, the first method to incorporate Canonical Polyadic (CP) tensor decomposition into n-gram embedding modeling. By sharing token-position latent factors and employing order-absorbing vectors, TN-gram enables representation sharing across n-grams of varying orders. This approach substantially reduces parameter count while integrating seamlessly into the Transformer architecture. Experimental results demonstrate that TN-gram matches or surpasses state-of-the-art Engram-based methods on multiple language modeling benchmarks, achieving comparable or superior performance with significantly lower parameter overhead.

📝 Abstract

Modern language models represent text using discrete token-level embeddings, which forces recurring multi-token patterns to be learned implicitly across Transformer layers. Both Over-tokenized Transformers and Engram attempt to address this limitation by explicitly incorporating multi-token (n-gram) memories. However, they rely on separate hash tables for each n-gram order, which introduces hash collisions and prevents nested n-grams from sharing the underlying latent structures. To address these issues, we propose Tensorized Engram (TN-gram), a compact memory module that represents tensorized n-gram embeddings through shared factors in the Canonical Polyadic (CP) form. TN-gram learns shared token-position factors together with order-absorption vectors to encode the embeddings of different n-gram order. Comprehensive experiments demonstrate that TN-gram matches or even outperforms Engram-style n-gram modules while requiring much fewer parameters.

Problem

Research questions and friction points this paper is trying to address.

n-gram embeddings

hash collisions

latent sharing

language models

multi-token patterns

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tensorized Engram

n-gram embeddings

Canonical Polyadic decomposition