When Text-as-Vision Meets Semantic IDs in Generative Recommendation: An Empirical Study

📅 2026-01-21

📈 Citations: 2

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of conventional text encoders in generative recommendation systems, where fragmented tokenization disrupts semantic coherence in item descriptions and misaligns textual embeddings with the geometric structure of visual embeddings, thereby degrading multimodal fusion. To overcome this, the authors propose rendering item text as images and encoding them using a vision-based OCR model to construct semantic IDs grounded in visual signals. This approach represents the first systematic exploration of treating text as a visual modality for semantic representation, yielding more consistent and stable embeddings in both unimodal and multimodal generative recommendation settings. Experiments across four datasets and two backbone architectures demonstrate that OCR-based text representations match or surpass standard text encoders, maintaining robustness even under extreme resolution compression and significantly enhancing cross-modal alignment stability and deployment efficiency.

Technology Category

Application Category

📝 Abstract

Semantic ID learning is a key interface in Generative Recommendation (GR) models, mapping items to discrete identifiers grounded in side information, most commonly via a pretrained text encoder. However, these text encoders are primarily optimized for well-formed natural language. In real-world recommendation data, item descriptions are often symbolic and attribute-centric, containing numerals, units, and abbreviations. These text encoders can break these signals into fragmented tokens, weakening semantic coherence and distorting relationships among attributes. Worse still, when moving to multimodal GR, relying on standard text encoders introduces an additional obstacle: text and image embeddings often exhibit mismatched geometric structures, making cross-modal fusion less effective and less stable. In this paper, we revisit representation design for Semantic ID learning by treating text as a visual signal. We conduct a systematic empirical study of OCR-based text representations, obtained by rendering item descriptions into images and encoding them with vision-based OCR models. Experiments across four datasets and two generative backbones show that OCR-text consistently matches or surpasses standard text embeddings for Semantic ID learning in both unimodal and multimodal settings. Furthermore, we find that OCR-based Semantic IDs remain robust under extreme spatial-resolution compression, indicating strong robustness and efficiency in practical deployments.

Problem

Research questions and friction points this paper is trying to address.

Generative Recommendation

Semantic ID

Text-as-Vision

Multimodal Fusion

Tokenization Fragmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic ID

Generative Recommendation

OCR-based text representation

Text-as-Vision

Multimodal Fusion

🔎 Similar Papers

No similar papers found.

Authors to Follow