ShareChat: A Dataset of Chatbot Conversations in the Wild

📅 2025-12-19

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Existing public datasets treat large language models (LLMs) as generic text generators, neglecting how platform-specific interfaces and functionalities shape authentic human–AI interaction. To address this gap, we introduce the first large-scale, cross-platform real-world chat dataset, covering five major platforms—including ChatGPT and Claude—with 143K dialogues and 660K turns. The dataset preserves native interaction artifacts: reasoning traces, clickable URLs, execution outputs, code snippets, and multilingual context across 101 languages. We propose a Native Interaction Feature Preservation framework, integrating URL-driven automated crawling, multilingual detection, dialogue completeness validation, and timestamp normalization. Leveraging this dataset, we conduct three empirical analyses: (1) user intent satisfaction assessment, (2) content citation behavior modeling, and (3) temporal usage pattern tracking—collectively enhancing the authenticity, granularity, and reproducibility of human–model interaction research.

Technology Category

Application Category

📝 Abstract

While Large Language Models (LLMs) have evolved into distinct platforms with unique interface designs and capabilities, existing public datasets treat models as generic text generators, stripping away the interface context that actively shapes user interaction. To address this limitation, we present ShareChat, a large-scale, cross-platform corpus comprising 142,808 conversations and over 660,000 turns collected from publicly shared URLs across five major platforms: ChatGPT, Claude, Gemini, Perplexity, and Grok. ShareChat distinguishes itself by preserving native platform affordances often lost in standard logs, including reasoning traces, source links, and code artifacts, while spanning 101 languages over the period from April 2023 to October 2025. Furthermore, ShareChat offers substantially longer context windows and greater interaction depth than prior datasets. We demonstrate the dataset's multifaceted utility through three representative analyses: (1) analyzing conversation completeness to measure user intent satisfaction; (2) evaluating source citation behaviors in content generation; and (3) conducting temporal analysis to track evolving usage patterns. This work provides the community with a vital and timely resource for understanding authentic user-LLM chatbot interactions in the wild.

Problem

Research questions and friction points this paper is trying to address.

Addresses the lack of interface context in existing chatbot datasets

Presents a cross-platform corpus preserving native platform affordances

Enables analysis of user intent, citations, and evolving usage patterns

Innovation

Methods, ideas, or system contributions that make the work stand out.

Collects cross-platform chatbot conversations preserving native interface elements

Includes reasoning traces, source links, and code artifacts from shared URLs

Enables analysis of user intent satisfaction and citation behaviors

🔎 Similar Papers

No similar papers found.