Integrating Vision-Centric Text Understanding for Conversational Recommender Systems

📅 2026-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of traditional conversational recommender systems in accurately modeling user preferences due to constraints such as input length, stylistic inconsistency, and noise when processing multi-turn heterogeneous text. To overcome these challenges, the authors propose STARCRS, a novel system that introduces, for the first time, a vision-centric screen reading path complemented by a large language model–driven key textual reasoning path, forming a dual-path architecture. Through a knowledge-anchored fusion mechanism, STARCRS effectively integrates visual token encoding, contrastive alignment, cross-attention, and adaptive gating to achieve efficient multimodal context understanding and fusion. Experimental results on two mainstream benchmarks demonstrate significant improvements in both recommendation accuracy and dialogue generation quality, confirming the method’s effectiveness and robustness.

Technology Category

Application Category

📝 Abstract
Conversational Recommender Systems (CRSs) have attracted growing attention for their ability to deliver personalized recommendations through natural language interactions. To more accurately infer user preferences from multi-turn conversations, recent works increasingly expand conversational context (e.g., by incorporating diverse entity information or retrieving related dialogues). While such context enrichment can assist preference modeling, it also introduces longer and more heterogeneous inputs, leading to practical issues such as input length constraints, text style inconsistency, and irrelevant textual noise, thereby raising the demand for stronger language understanding ability. In this paper, we propose STARCRS, a Screen-Text-AwaRe Conversational Recommender System that integrates two complementary text understanding modes: (1) a screen-reading pathway that encodes auxiliary textual information as visual tokens, mimicking skim reading on a screen, and (2) an LLM-based textual pathway that focuses on a limited set of critical content for fine-grained reasoning. We design a knowledge-anchored fusion framework that combines contrastive alignment, cross-attention interaction, and adaptive gating to integrate the two modes for improved preference modeling and response generation. Extensive experiments on two widely used benchmarks demonstrate that STARCRS consistently improves both recommendation accuracy and generated response quality.
Problem

Research questions and friction points this paper is trying to address.

Conversational Recommender Systems
Text Understanding
Input Heterogeneity
Language Understanding
Preference Modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Conversational Recommender Systems
Vision-Centric Text Understanding
Multimodal Fusion
Large Language Models
Preference Modeling
🔎 Similar Papers
No similar papers found.
Wei Yuan
Wei Yuan
The University of Queensland
Natural Language ProcessingRecommendationUrban ComputingEdge Intelligence
S
Shutong Qiao
The University of Queensland
T
Tong Chen
The University of Queensland
Q
Quoc Viet Hung Nguyen
Griffith University
Z
Zi-Liang Huang
The University of Queensland
Hongzhi Yin
Hongzhi Yin
Professor and ARC Future Fellow, University of Queensland
Recommender SystemGraph LearningSpatial-temporal PredictionEdge IntelligenceLLM