Research Data in Scientific Publications: A Cross-Field Analysis

📅 2025-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study identifies a structural imbalance in interdisciplinary data sharing: high reuse rates in STEM fields contrast sharply with low adoption in humanities and social sciences, while persistent undercitation of datasets impedes evidence-based policy and infrastructure development. Leveraging full-text PubMed articles, we construct the first multidisciplinary dataset—enabling simultaneous identification of data mentions and classification of data-related intents—by integrating natural language processing, full-text pattern recognition, cross-disciplinary bibliometrics, and time-series modeling. Key findings include: (1) a marked acceleration in data publication post-2012; (2) highest data publishing activity in business/management and creative arts, yet highest reuse in biological and agricultural sciences; and (3) consistently low dataset citation rates, revealing critical bottlenecks in discoverability and format interoperability. These empirically grounded insights advance data governance frameworks and support formal recognition of datasets as independent scholarly outputs.

Technology Category

Application Category

📝 Abstract
Data sharing is fundamental to scientific progress, enhancing transparency, reproducibility, and innovation across disciplines. Despite its growing significance, the variability of data-sharing practices across research fields remains insufficiently understood, limiting the development of effective policies and infrastructure. This study investigates the evolving landscape of data-sharing practices, specifically focusing on the intentions behind data release, reuse, and referencing. Leveraging the PubMed open dataset, we developed a model to identify mentions of datasets in the full-text of publications. Our analysis reveals that data release is the most prevalent sharing mode, particularly in fields such as Commerce, Management, and the Creative Arts. In contrast, STEM fields, especially the Biological and Agricultural Sciences, show significantly higher rates of data reuse. However, the humanities and social sciences are slower to adopt these practices. Notably, dataset referencing remains low across most disciplines, suggesting that datasets are not yet fully recognized as research outputs. A temporal analysis highlights an acceleration in data releases after 2012, yet obstacles such as data discoverability and compatibility for reuse persist. Our findings can inform institutional and policy-level efforts to improve data-sharing practices, enhance dataset accessibility, and promote broader adoption of open science principles across research domains.
Problem

Research questions and friction points this paper is trying to address.

Data Sharing
Interdisciplinary Differences
Data Infrastructure
Innovation

Methods, ideas, or system contributions that make the work stand out.

Data Sharing Patterns
Interdisciplinary Analysis
Open Science Policy Implications
🔎 Similar Papers
No similar papers found.
P
Puyu Yang
Institute for Logic, Language and Computation (ILLC), University of Amsterdam, 1098XH, Amsterdam, The Netherlands
Giovanni Colavizza
Giovanni Colavizza
University of Copenhagen and University of Bologna
Digital HumanitiesData ScienceArtificial Intelligence