Bridging the Language Gap in Scholarly Data I: Enhancing Author Disambiguation Algorithms for Chinese Names

📅 2026-04-04

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This study addresses the limited effectiveness of existing author disambiguation methods when applied to Chinese names, particularly in their romanized (pinyin) form. The authors propose the first rule-based disambiguation framework capable of uniformly handling both Chinese characters and pinyin, integrating co-authorship networks, citation graphs, institutional affiliations, and content similarity to achieve script-agnostic disambiguation. Evaluated on a manually annotated dataset of 80 author name pairs, the method achieves F1 scores of 0.88 and 0.89 for pinyin and character-based names, respectively—substantially outperforming baseline approaches. The primary gain stems from improved recall, demonstrating that the framework significantly enhances disambiguation accuracy and applicability for scholarly data involving non-Latin scripts.

Technology Category

Application Category

📝 Abstract

Disambiguating scholars with identical names is essential for accurate authorship assignment and robust large-scale scientometric research. Existing methods are often designed for Latin-script metadata and perform poorly on Chinese names. In international publications, Chinese names typically appear as Romanized Pinyin, which is highly ambiguous as it can map to multiple distinct characters. Chinese characters, in contrast, reduce but do not eliminate this ambiguity, and are rarely available in international records. To address both challenges, we propose a rule-based disambiguation framework that integrates co-authorship networks, citation networks, author affiliations, and content similarity. We apply this framework to 65,241 physics papers from the China National Knowledge Infrastructure (CNKI), spanning over 70 years of data. On a human annotated sample of 80 name pairs, our method achieves F1-scores of 0.88 for Pinyin names and 0.89 for character-based names, outperforming two baseline approaches, with improvements driven primarily by higher recall. The comparable performance across both writing systems shows that our approach is script-agnostic, enabling reliable large-scale scientometric analyses.

Problem

Research questions and friction points this paper is trying to address.

author disambiguation

Chinese names

Pinyin ambiguity

scientometrics

name ambiguity

Innovation

Methods, ideas, or system contributions that make the work stand out.

author disambiguation

Chinese names

Pinyin ambiguity