Investigating Thematic Patterns and User Preferences in LLM Interactions using BERTopic

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study investigates topic-level patterns between user prompts and LLM responses in the LMSYS-Chat-1M dataset and their correlation with human model preferences. Method: We pioneer the application of BERTopic to multilingual LLM comparative evaluation data, integrating dialogue cleaning, multilingual preprocessing, and topic distribution visualization to construct a model–topic preference matrix. Contribution/Results: We identify 29 semantically coherent topics and discover consistent user preference advantages for specific LLMs across domains such as technology, programming, and ethics—revealing a topic-dependent distribution of model strengths. This work establishes an interpretable, topic-level analytical framework for LLM capability assessment and enables domain-aware model selection and targeted fine-tuning, thereby advancing personalized LLM deployment.

Technology Category

Application Category

📝 Abstract

This study applies BERTopic, a transformer-based topic modeling technique, to the lmsys-chat-1m dataset, a multilingual conversational corpus built from head-to-head evaluations of large language models (LLMs). Each user prompt is paired with two anonymized LLM responses and a human preference label, used to assess user evaluation of competing model outputs. The main objective is uncovering thematic patterns in these conversations and examining their relation to user preferences, particularly if certain LLMs are consistently preferred within specific topics. A robust preprocessing pipeline was designed for multilingual variation, balancing dialogue turns, and cleaning noisy or redacted data. BERTopic extracted over 29 coherent topics including artificial intelligence, programming, ethics, and cloud infrastructure. We analysed relationships between topics and model preferences to identify trends in model-topic alignment. Visualization techniques included inter-topic distance maps, topic probability distributions, and model-versus-topic matrices. Our findings inform domain-specific fine-tuning and optimization strategies for improving real-world LLM performance and user satisfaction.

Problem

Research questions and friction points this paper is trying to address.

Uncovering thematic patterns in multilingual LLM conversations

Examining relationships between topics and user preferences

Identifying model-topic alignment trends for LLM optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Applied BERTopic for topic modeling

Used multilingual preprocessing pipeline

Analyzed topic-preference relationships with visualizations

🔎 Similar Papers

No similar papers found.

Authors to Follow