🤖 AI Summary
This paper addresses the author name disambiguation (AND) challenge across heterogeneous academic databases—CercaUniversità and Scopus. We propose LEAD, a lightweight hybrid framework that jointly leverages semantic features extracted by large language models (LLMs) and structural signals from collaboration/citation networks—including bibliographic coupling, co-occurrence analysis, and label propagation—to build an efficient learning pipeline. Evaluated on 606 real-world ambiguous cases, LEAD achieves 96.7% F1-score and 95.7% accuracy, significantly outperforming single-modality baselines in cross-source matching precision and scalability. Its key contribution lies in the first principled integration of LLM-driven semantic understanding with graph-structured evidence, enabling high performance while reducing computational overhead. LEAD establishes a novel paradigm for integrating multi-source scholarly data and enabling robust bibliometric assessment.
📝 Abstract
Author Name Disambiguation (AND) is a long-standing challenge in bibliometrics and scientometrics, as name ambiguity undermines the accuracy of bibliographic databases and the reliability of research evaluation. This study addresses the problem of cross-source disambiguation by linking academic career records from CercaUniversit`a, the official registry of Italian academics, with author profiles in Scopus. We introduce LEAD (LLM-enhanced Engine for Author Disambiguation), a novel hybrid framework that combines semantic features extracted through Large Language Models (LLMs) with structural evidence derived from co-authorship and citation networks. Using a gold standard of 606 ambiguous cases, we compare five methods: (i) Label Spreading on co-authorship networks; (ii) Bibliographic Coupling on citation networks; (iii) a standalone LLM-based approach; (iv) an LLM-enriched configuration; and (v) the proposed hybrid pipeline. LEAD achieves the best performance (F1 = 96.7%, accuracy = 95.7%) with lower computational cost than full LLM models. Bibliographic Coupling emerges as the fastest and strongest single-source method. These findings demonstrate that integrating semantic and structural signals within a selective hybrid strategy offers a robust and scalable solution to cross-database author identification. Beyond the Italian case, this work highlights the potential of hybrid LLM-based methods to improve data quality and reliability in scientometric analyses.