LEAD: LLM-enhanced Engine for Author Disambiguation

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This paper addresses the author name disambiguation (AND) challenge across heterogeneous academic databases—CercaUniversità and Scopus. We propose LEAD, a lightweight hybrid framework that jointly leverages semantic features extracted by large language models (LLMs) and structural signals from collaboration/citation networks—including bibliographic coupling, co-occurrence analysis, and label propagation—to build an efficient learning pipeline. Evaluated on 606 real-world ambiguous cases, LEAD achieves 96.7% F1-score and 95.7% accuracy, significantly outperforming single-modality baselines in cross-source matching precision and scalability. Its key contribution lies in the first principled integration of LLM-driven semantic understanding with graph-structured evidence, enabling high performance while reducing computational overhead. LEAD establishes a novel paradigm for integrating multi-source scholarly data and enabling robust bibliometric assessment.

Technology Category

Application Category

📝 Abstract

Author Name Disambiguation (AND) is a long-standing challenge in bibliometrics and scientometrics, as name ambiguity undermines the accuracy of bibliographic databases and the reliability of research evaluation. This study addresses the problem of cross-source disambiguation by linking academic career records from CercaUniversit`a, the official registry of Italian academics, with author profiles in Scopus. We introduce LEAD (LLM-enhanced Engine for Author Disambiguation), a novel hybrid framework that combines semantic features extracted through Large Language Models (LLMs) with structural evidence derived from co-authorship and citation networks. Using a gold standard of 606 ambiguous cases, we compare five methods: (i) Label Spreading on co-authorship networks; (ii) Bibliographic Coupling on citation networks; (iii) a standalone LLM-based approach; (iv) an LLM-enriched configuration; and (v) the proposed hybrid pipeline. LEAD achieves the best performance (F1 = 96.7%, accuracy = 95.7%) with lower computational cost than full LLM models. Bibliographic Coupling emerges as the fastest and strongest single-source method. These findings demonstrate that integrating semantic and structural signals within a selective hybrid strategy offers a robust and scalable solution to cross-database author identification. Beyond the Italian case, this work highlights the potential of hybrid LLM-based methods to improve data quality and reliability in scientometric analyses.

Problem

Research questions and friction points this paper is trying to address.

Resolving author name ambiguity across bibliographic databases and academic registries

Linking academic career records with author profiles for accurate identification

Developing hybrid methods combining semantic and structural evidence for disambiguation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines semantic features from LLMs with structural evidence

Integrates co-authorship and citation networks for disambiguation

Uses hybrid framework for cross-source author identification

🔎 Similar Papers

Authorship Attribution in the Era of LLMs: Problems, Methodologies, and Challenges