LEAD: LLM-enhanced Engine for Author Disambiguation

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the author name disambiguation (AND) challenge across heterogeneous academic databases—CercaUniversità and Scopus. We propose LEAD, a lightweight hybrid framework that jointly leverages semantic features extracted by large language models (LLMs) and structural signals from collaboration/citation networks—including bibliographic coupling, co-occurrence analysis, and label propagation—to build an efficient learning pipeline. Evaluated on 606 real-world ambiguous cases, LEAD achieves 96.7% F1-score and 95.7% accuracy, significantly outperforming single-modality baselines in cross-source matching precision and scalability. Its key contribution lies in the first principled integration of LLM-driven semantic understanding with graph-structured evidence, enabling high performance while reducing computational overhead. LEAD establishes a novel paradigm for integrating multi-source scholarly data and enabling robust bibliometric assessment.

Technology Category

Application Category

📝 Abstract
Author Name Disambiguation (AND) is a long-standing challenge in bibliometrics and scientometrics, as name ambiguity undermines the accuracy of bibliographic databases and the reliability of research evaluation. This study addresses the problem of cross-source disambiguation by linking academic career records from CercaUniversit`a, the official registry of Italian academics, with author profiles in Scopus. We introduce LEAD (LLM-enhanced Engine for Author Disambiguation), a novel hybrid framework that combines semantic features extracted through Large Language Models (LLMs) with structural evidence derived from co-authorship and citation networks. Using a gold standard of 606 ambiguous cases, we compare five methods: (i) Label Spreading on co-authorship networks; (ii) Bibliographic Coupling on citation networks; (iii) a standalone LLM-based approach; (iv) an LLM-enriched configuration; and (v) the proposed hybrid pipeline. LEAD achieves the best performance (F1 = 96.7%, accuracy = 95.7%) with lower computational cost than full LLM models. Bibliographic Coupling emerges as the fastest and strongest single-source method. These findings demonstrate that integrating semantic and structural signals within a selective hybrid strategy offers a robust and scalable solution to cross-database author identification. Beyond the Italian case, this work highlights the potential of hybrid LLM-based methods to improve data quality and reliability in scientometric analyses.
Problem

Research questions and friction points this paper is trying to address.

Resolving author name ambiguity across bibliographic databases and academic registries
Linking academic career records with author profiles for accurate identification
Developing hybrid methods combining semantic and structural evidence for disambiguation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines semantic features from LLMs with structural evidence
Integrates co-authorship and citation networks for disambiguation
Uses hybrid framework for cross-source author identification
🔎 Similar Papers
No similar papers found.
G
G. Tuccari
Institute of Cognitive Sciences and Technologies (ISTC), National Research Council of Italy (CNR), Italy.
L
Lorenzo Giammei
Research Institute on Sustainable Economic Growth (IRCrES), National Research Council of Italy (CNR), Italy.
Andrea Giovanni Nuzzolese
Andrea Giovanni Nuzzolese
Senior Researcher CNR-ISTC
Web ScienceSemantic WebLinked DataOntology DesignKnowledge Extraction
M
M. Mongiovì
Institute of Cognitive Sciences and Technologies (ISTC), National Research Council of Italy (CNR), Italy.; Department of Mathematics and Computer Science, University of Catania, Italy.
A
Antonio Zinilli
Research Institute on Sustainable Economic Growth (IRCrES), National Research Council of Italy (CNR), Italy.
F
Francesco Poggi
Institute of Cognitive Sciences and Technologies (ISTC), National Research Council of Italy (CNR), Italy.