🤖 AI Summary
This work addresses the limitations of conventional text encoders in generative recommendation systems, where fragmented tokenization disrupts semantic coherence in item descriptions and misaligns textual embeddings with the geometric structure of visual embeddings, thereby degrading multimodal fusion. To overcome this, the authors propose rendering item text as images and encoding them using a vision-based OCR model to construct semantic IDs grounded in visual signals. This approach represents the first systematic exploration of treating text as a visual modality for semantic representation, yielding more consistent and stable embeddings in both unimodal and multimodal generative recommendation settings. Experiments across four datasets and two backbone architectures demonstrate that OCR-based text representations match or surpass standard text encoders, maintaining robustness even under extreme resolution compression and significantly enhancing cross-modal alignment stability and deployment efficiency.
📝 Abstract
Semantic ID learning is a key interface in Generative Recommendation (GR) models, mapping items to discrete identifiers grounded in side information, most commonly via a pretrained text encoder. However, these text encoders are primarily optimized for well-formed natural language. In real-world recommendation data, item descriptions are often symbolic and attribute-centric, containing numerals, units, and abbreviations. These text encoders can break these signals into fragmented tokens, weakening semantic coherence and distorting relationships among attributes. Worse still, when moving to multimodal GR, relying on standard text encoders introduces an additional obstacle: text and image embeddings often exhibit mismatched geometric structures, making cross-modal fusion less effective and less stable. In this paper, we revisit representation design for Semantic ID learning by treating text as a visual signal. We conduct a systematic empirical study of OCR-based text representations, obtained by rendering item descriptions into images and encoding them with vision-based OCR models. Experiments across four datasets and two generative backbones show that OCR-text consistently matches or surpasses standard text embeddings for Semantic ID learning in both unimodal and multimodal settings. Furthermore, we find that OCR-based Semantic IDs remain robust under extreme spatial-resolution compression, indicating strong robustness and efficiency in practical deployments.