Tensorizing Engram: Sharing Latents Across N-Gram Embeddings is Beneficial in LLMs

📅 2026-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing large language models, which rely on discrete token embeddings and struggle to explicitly model multi-token patterns. Conventional n-gram memory approaches suffer from hash collisions due to independent hash tables and fail to enable structural sharing among nested n-grams. To overcome these issues, the paper proposes TN-gram, the first method to incorporate Canonical Polyadic (CP) tensor decomposition into n-gram embedding modeling. By sharing token-position latent factors and employing order-absorbing vectors, TN-gram enables representation sharing across n-grams of varying orders. This approach substantially reduces parameter count while integrating seamlessly into the Transformer architecture. Experimental results demonstrate that TN-gram matches or surpasses state-of-the-art Engram-based methods on multiple language modeling benchmarks, achieving comparable or superior performance with significantly lower parameter overhead.
📝 Abstract
Modern language models represent text using discrete token-level embeddings, which forces recurring multi-token patterns to be learned implicitly across Transformer layers. Both Over-tokenized Transformers and Engram attempt to address this limitation by explicitly incorporating multi-token (n-gram) memories. However, they rely on separate hash tables for each n-gram order, which introduces hash collisions and prevents nested n-grams from sharing the underlying latent structures. To address these issues, we propose Tensorized Engram (TN-gram), a compact memory module that represents tensorized n-gram embeddings through shared factors in the Canonical Polyadic (CP) form. TN-gram learns shared token-position factors together with order-absorption vectors to encode the embeddings of different n-gram order. Comprehensive experiments demonstrate that TN-gram matches or even outperforms Engram-style n-gram modules while requiring much fewer parameters.
Problem

Research questions and friction points this paper is trying to address.

n-gram embeddings
hash collisions
latent sharing
language models
multi-token patterns
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tensorized Engram
n-gram embeddings
Canonical Polyadic decomposition
shared latent factors
language models