🤖 AI Summary
This study identifies a structural imbalance in interdisciplinary data sharing: high reuse rates in STEM fields contrast sharply with low adoption in humanities and social sciences, while persistent undercitation of datasets impedes evidence-based policy and infrastructure development. Leveraging full-text PubMed articles, we construct the first multidisciplinary dataset—enabling simultaneous identification of data mentions and classification of data-related intents—by integrating natural language processing, full-text pattern recognition, cross-disciplinary bibliometrics, and time-series modeling. Key findings include: (1) a marked acceleration in data publication post-2012; (2) highest data publishing activity in business/management and creative arts, yet highest reuse in biological and agricultural sciences; and (3) consistently low dataset citation rates, revealing critical bottlenecks in discoverability and format interoperability. These empirically grounded insights advance data governance frameworks and support formal recognition of datasets as independent scholarly outputs.
📝 Abstract
Data sharing is fundamental to scientific progress, enhancing transparency, reproducibility, and innovation across disciplines. Despite its growing significance, the variability of data-sharing practices across research fields remains insufficiently understood, limiting the development of effective policies and infrastructure. This study investigates the evolving landscape of data-sharing practices, specifically focusing on the intentions behind data release, reuse, and referencing. Leveraging the PubMed open dataset, we developed a model to identify mentions of datasets in the full-text of publications. Our analysis reveals that data release is the most prevalent sharing mode, particularly in fields such as Commerce, Management, and the Creative Arts. In contrast, STEM fields, especially the Biological and Agricultural Sciences, show significantly higher rates of data reuse. However, the humanities and social sciences are slower to adopt these practices. Notably, dataset referencing remains low across most disciplines, suggesting that datasets are not yet fully recognized as research outputs. A temporal analysis highlights an acceleration in data releases after 2012, yet obstacles such as data discoverability and compatibility for reuse persist. Our findings can inform institutional and policy-level efforts to improve data-sharing practices, enhance dataset accessibility, and promote broader adoption of open science principles across research domains.