LLMs Need Encoders for Semantic IDs Too

πŸ“… 2026-05-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

170K/year
πŸ€– AI Summary
This work addresses the limitation of current large language models (LLMs), which treat semantic IDs (SIDs) as ordinary tokens and thereby ignore their inherent hierarchical structure and prefix-based contextual dependencies. To overcome this, the authors propose PrefixMemβ€”a lightweight, structure-aware encoder specifically designed for SIDs. PrefixMem is the first approach to model SIDs as non-linguistic inputs analogous to visual modalities, generating context-conditioned representations via a prefix n-gram memory table. This module supports standalone pretraining and can be seamlessly integrated into any LLM for joint fine-tuning. Evaluated on large-scale Pinterest data, PrefixMem substantially improves performance: accuracy on the deepest-level SIDs increases by up to 46%, full-SID retrieval recall rises by 22%, and on challenging samples where greedy decoding fails, accuracy improves by as much as 77%.
πŸ“ Abstract
Multimodal LLMs use dedicated encoders to bridge non-language modalities (vision encoders for images, depth models for audio codec tokens) because raw token embeddings alone cannot capture modality-specific structure. We argue that Semantic IDs (SIDs), the hierarchical codes used in generative recommendation, constitute another such modality: a SID level token's meaning depends on its prefix context, yet current systems simply add SID tokens to the vocabulary and rely on training to learn these context-dependent meanings from scratch. We propose PrefixMem, a lightweight SID encoder based on prefix n-gram memory tables that provides the LLM with structured, prefix-conditioned representations at SID token positions. Like vision encoders in multimodal LLMs, PrefixMem can be pre-trained independently and then attached to any LLM for joint training. We evaluate on large-scale data from Pinterest across multiple LLM families and show that PrefixMem improves deepest-level SID accuracy by up to 46% relative and full-SID retrieval recall by up to 22% relative at matched training compute. The encoder's benefit concentrates on hard examples where greedy decoding fails, with up to 77% relative accuracy gains, confirming that SID tokens benefit from a dedicated encoder just as other non-language modalities do.
Problem

Research questions and friction points this paper is trying to address.

Semantic IDs
context-dependent meaning
hierarchical codes
multimodal LLMs
token embedding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic IDs
PrefixMem
prefix-conditioned representations
multimodal LLMs
n-gram memory