Implications of construction decisions in keyword-based networks: an empirical assessment

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study identifies a critical reproducibility crisis in keyword co-occurrence network analysis, stemming from the high sensitivity of analytical outcomes to preprocessing decisions—including keyword extraction methods, co-occurrence definitions (e.g., document- vs. sentence-level), and filtering thresholds—exacerbated by widespread lack of transparency in network construction protocols. Through systematic multi-scenario experiments on bibliometric and social media datasets, we quantitatively assess how alternative preprocessing pathways perturb key network topology metrics: degree centrality, clustering coefficient, and modularity exhibit fluctuations exceeding 200% under varying configurations. To address this, we introduce the “data provenance transparency” framework, advocating standardized, granular documentation of all network construction decisions. Our findings deliver both a theoretical caution against methodological opacity in complex network science and concrete, actionable guidelines for enhancing analytical credibility and interpretability.

Technology Category

Application Category

📝 Abstract
The large amounts of data continuously generated online offer opportunities to identify and analyse trends in various aspects of society. For instance, data from online social media are frequently used as a means of analysing informal interactions, opinions, and feelings of groups of people. Additionally, bibliometric data can be used to investigate more formal trends that occur in scientific research. A popular approach to analysing such complex semi-structured data is the construction of complex networks based on keywords or concept extraction. However, such keyword-based complex network data are often shared in a preprocessed form, with little information about the underlying process used to construct it. Indeed, key decisions are normally made at an early stage in the construction of complex networks from raw data, and can have a significant impact on subsequent analysis and interpretation. In this paper, we highlight the sensitivity of results to data preprocessing decisions by looking at two different case studies which employ networks constructed from underlying semi-structured data. The experiments conducted show high sensitivity to data preprocessing for many commonly adopted metrics. These results demonstrate the need for transparent reporting of data lineage and preprocessing decisions.
Problem

Research questions and friction points this paper is trying to address.

Impact of preprocessing decisions on keyword-based network analysis.
Sensitivity of network metrics to data preprocessing choices.
Need for transparency in reporting data preprocessing steps.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Keyword-based complex network construction
Sensitivity analysis of data preprocessing
Transparent reporting of data lineage
🔎 Similar Papers
No similar papers found.
J
James Nevin
Informatics Institute, University of Amsterdam, Amsterdam, Netherlands
S
S. F. Pileggi
Faculty of Engineering and IT, University of Technology Sydney, Sydney, Australia
Michael Lees
Michael Lees
Associate Professor, University of Amsterdam
Modeling and SimulationAgent-based modelingComplex SystemsDistributed Simulation
Paul Groth
Paul Groth
Professor, INDE Lab, University of Amsterdam
provenanceinformation integrationweb dataknowledge graphsdata engineering