Investigating Thematic Patterns and User Preferences in LLM Interactions using BERTopic

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates topic-level patterns between user prompts and LLM responses in the LMSYS-Chat-1M dataset and their correlation with human model preferences. Method: We pioneer the application of BERTopic to multilingual LLM comparative evaluation data, integrating dialogue cleaning, multilingual preprocessing, and topic distribution visualization to construct a model–topic preference matrix. Contribution/Results: We identify 29 semantically coherent topics and discover consistent user preference advantages for specific LLMs across domains such as technology, programming, and ethics—revealing a topic-dependent distribution of model strengths. This work establishes an interpretable, topic-level analytical framework for LLM capability assessment and enables domain-aware model selection and targeted fine-tuning, thereby advancing personalized LLM deployment.

Technology Category

Application Category

📝 Abstract
This study applies BERTopic, a transformer-based topic modeling technique, to the lmsys-chat-1m dataset, a multilingual conversational corpus built from head-to-head evaluations of large language models (LLMs). Each user prompt is paired with two anonymized LLM responses and a human preference label, used to assess user evaluation of competing model outputs. The main objective is uncovering thematic patterns in these conversations and examining their relation to user preferences, particularly if certain LLMs are consistently preferred within specific topics. A robust preprocessing pipeline was designed for multilingual variation, balancing dialogue turns, and cleaning noisy or redacted data. BERTopic extracted over 29 coherent topics including artificial intelligence, programming, ethics, and cloud infrastructure. We analysed relationships between topics and model preferences to identify trends in model-topic alignment. Visualization techniques included inter-topic distance maps, topic probability distributions, and model-versus-topic matrices. Our findings inform domain-specific fine-tuning and optimization strategies for improving real-world LLM performance and user satisfaction.
Problem

Research questions and friction points this paper is trying to address.

Uncovering thematic patterns in multilingual LLM conversations
Examining relationships between topics and user preferences
Identifying model-topic alignment trends for LLM optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Applied BERTopic for topic modeling
Used multilingual preprocessing pipeline
Analyzed topic-preference relationships with visualizations
🔎 Similar Papers
No similar papers found.
A
Abhay Bhandarkar
Ramaiah Institute of Technology, Bengaluru, India
Gaurav Mishra
Gaurav Mishra
Senior Scientist at Merck
Computer VisionFace RecognitionKinship VerificationExplainable AI
K
Khushi Juchani
Ecofy Finance Private Limited, Bengaluru, India
H
Harsh Singhal
Ramaiah Institute of Technology, Bengaluru, India