🤖 AI Summary
Existing public datasets treat large language models (LLMs) as generic text generators, neglecting how platform-specific interfaces and functionalities shape authentic human–AI interaction. To address this gap, we introduce the first large-scale, cross-platform real-world chat dataset, covering five major platforms—including ChatGPT and Claude—with 143K dialogues and 660K turns. The dataset preserves native interaction artifacts: reasoning traces, clickable URLs, execution outputs, code snippets, and multilingual context across 101 languages. We propose a Native Interaction Feature Preservation framework, integrating URL-driven automated crawling, multilingual detection, dialogue completeness validation, and timestamp normalization. Leveraging this dataset, we conduct three empirical analyses: (1) user intent satisfaction assessment, (2) content citation behavior modeling, and (3) temporal usage pattern tracking—collectively enhancing the authenticity, granularity, and reproducibility of human–model interaction research.
📝 Abstract
While Large Language Models (LLMs) have evolved into distinct platforms with unique interface designs and capabilities, existing public datasets treat models as generic text generators, stripping away the interface context that actively shapes user interaction. To address this limitation, we present ShareChat, a large-scale, cross-platform corpus comprising 142,808 conversations and over 660,000 turns collected from publicly shared URLs across five major platforms: ChatGPT, Claude, Gemini, Perplexity, and Grok. ShareChat distinguishes itself by preserving native platform affordances often lost in standard logs, including reasoning traces, source links, and code artifacts, while spanning 101 languages over the period from April 2023 to October 2025. Furthermore, ShareChat offers substantially longer context windows and greater interaction depth than prior datasets. We demonstrate the dataset's multifaceted utility through three representative analyses: (1) analyzing conversation completeness to measure user intent satisfaction; (2) evaluating source citation behaviors in content generation; and (3) conducting temporal analysis to track evolving usage patterns. This work provides the community with a vital and timely resource for understanding authentic user-LLM chatbot interactions in the wild.