Integrating Vision-Centric Text Understanding for Conversational Recommender Systems

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of traditional conversational recommender systems in accurately modeling user preferences due to constraints such as input length, stylistic inconsistency, and noise when processing multi-turn heterogeneous text. To overcome these challenges, the authors propose STARCRS, a novel system that introduces, for the first time, a vision-centric screen reading path complemented by a large language model–driven key textual reasoning path, forming a dual-path architecture. Through a knowledge-anchored fusion mechanism, STARCRS effectively integrates visual token encoding, contrastive alignment, cross-attention, and adaptive gating to achieve efficient multimodal context understanding and fusion. Experimental results on two mainstream benchmarks demonstrate significant improvements in both recommendation accuracy and dialogue generation quality, confirming the method’s effectiveness and robustness.

Technology Category

Application Category

📝 Abstract

Conversational Recommender Systems (CRSs) have attracted growing attention for their ability to deliver personalized recommendations through natural language interactions. To more accurately infer user preferences from multi-turn conversations, recent works increasingly expand conversational context (e.g., by incorporating diverse entity information or retrieving related dialogues). While such context enrichment can assist preference modeling, it also introduces longer and more heterogeneous inputs, leading to practical issues such as input length constraints, text style inconsistency, and irrelevant textual noise, thereby raising the demand for stronger language understanding ability. In this paper, we propose STARCRS, a Screen-Text-AwaRe Conversational Recommender System that integrates two complementary text understanding modes: (1) a screen-reading pathway that encodes auxiliary textual information as visual tokens, mimicking skim reading on a screen, and (2) an LLM-based textual pathway that focuses on a limited set of critical content for fine-grained reasoning. We design a knowledge-anchored fusion framework that combines contrastive alignment, cross-attention interaction, and adaptive gating to integrate the two modes for improved preference modeling and response generation. Extensive experiments on two widely used benchmarks demonstrate that STARCRS consistently improves both recommendation accuracy and generated response quality.

Problem

Research questions and friction points this paper is trying to address.

Conversational Recommender Systems

Text Understanding

Input Heterogeneity

Language Understanding

Preference Modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Conversational Recommender Systems

Vision-Centric Text Understanding

Multimodal Fusion

Large Language Models

Preference Modeling

🔎 Similar Papers

No similar papers found.

Authors to Follow