LLMs Need Encoders for Semantic IDs Too

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This work addresses the limitation of current large language models (LLMs), which treat semantic IDs (SIDs) as ordinary tokens and thereby ignore their inherent hierarchical structure and prefix-based contextual dependencies. To overcome this, the authors propose PrefixMem—a lightweight, structure-aware encoder specifically designed for SIDs. PrefixMem is the first approach to model SIDs as non-linguistic inputs analogous to visual modalities, generating context-conditioned representations via a prefix n-gram memory table. This module supports standalone pretraining and can be seamlessly integrated into any LLM for joint fine-tuning. Evaluated on large-scale Pinterest data, PrefixMem substantially improves performance: accuracy on the deepest-level SIDs increases by up to 46%, full-SID retrieval recall rises by 22%, and on challenging samples where greedy decoding fails, accuracy improves by as much as 77%.

📝 Abstract

Multimodal LLMs use dedicated encoders to bridge non-language modalities (vision encoders for images, depth models for audio codec tokens) because raw token embeddings alone cannot capture modality-specific structure. We argue that Semantic IDs (SIDs), the hierarchical codes used in generative recommendation, constitute another such modality: a SID level token's meaning depends on its prefix context, yet current systems simply add SID tokens to the vocabulary and rely on training to learn these context-dependent meanings from scratch. We propose PrefixMem, a lightweight SID encoder based on prefix n-gram memory tables that provides the LLM with structured, prefix-conditioned representations at SID token positions. Like vision encoders in multimodal LLMs, PrefixMem can be pre-trained independently and then attached to any LLM for joint training. We evaluate on large-scale data from Pinterest across multiple LLM families and show that PrefixMem improves deepest-level SID accuracy by up to 46% relative and full-SID retrieval recall by up to 22% relative at matched training compute. The encoder's benefit concentrates on hard examples where greedy decoding fails, with up to 77% relative accuracy gains, confirming that SID tokens benefit from a dedicated encoder just as other non-language modalities do.

Problem

Research questions and friction points this paper is trying to address.

Semantic IDs

context-dependent meaning

hierarchical codes

multimodal LLMs

token embedding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic IDs

PrefixMem

prefix-conditioned representations