Hidden Entity Detection from GitHub Leveraging Large Language Models

📅 2025-01-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Implicit technical entities—particularly software repositories represented as URLs—in GitHub textual content are challenging to annotate, and conventional named entity recognition (NER) methods rely heavily on large-scale labeled datasets, which are scarce for such implicit constructs. Method: This paper proposes the first few-shot prompt-learning framework specifically designed for implicit technical entity recognition. It innovatively incorporates URL-represented online resources into the named entity taxonomy and integrates structured output constraints, cross-modal (text + URL) alignment, and repository-level contextual modeling. Contribution/Results: Evaluated on real-world GitHub corpora, our approach achieves an F1 score of 78.3%, outperforming traditional NER by 32.1%. It is the first work to empirically validate the effectiveness and generalizability of large language models (LLMs) in zero- and few-shot implicit technical entity recognition. This significantly extends the applicability boundary of LLMs to unlabeled knowledge base construction in software engineering contexts.

Technology Category

Application Category

📝 Abstract
Named entity recognition is an important task when constructing knowledge bases from unstructured data sources. Whereas entity detection methods mostly rely on extensive training data, Large Language Models (LLMs) have paved the way towards approaches that rely on zero-shot learning (ZSL) or few-shot learning (FSL) by taking advantage of the capabilities LLMs acquired during pretraining. Specifically, in very specialized scenarios where large-scale training data is not available, ZSL / FSL opens new opportunities. This paper follows this recent trend and investigates the potential of leveraging Large Language Models (LLMs) in such scenarios to automatically detect datasets and software within textual content from GitHub repositories. While existing methods focused solely on named entities, this study aims to broaden the scope by incorporating resources such as repositories and online hubs where entities are also represented by URLs. The study explores different FSL prompt learning approaches to enhance the LLMs' ability to identify dataset and software mentions within repository texts. Through analyses of LLM effectiveness and learning strategies, this paper offers insights into the potential of advanced language models for automated entity detection.
Problem

Research questions and friction points this paper is trying to address.

Automated Recognition
Limited Sample Size
Knowledge Base Construction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
Few-shot Learning
Entity Recognition
🔎 Similar Papers
No similar papers found.
L
Lu Gan
GESIS – Leibniz Institute for the Social Sciences, Köln, Germany; Heinrich Heine University Düsseldorf, Germany
M
Martin Blum
University of Trier, Trier, Germany
D
Danilo Dessì
GESIS – Leibniz Institute for the Social Sciences, Köln, Germany
Brigitte Mathiak
Brigitte Mathiak
GESIS - Leibnizinstitut for Social Sciences
Information Retrieval
Ralf Schenkel
Ralf Schenkel
Professor of Computer Science, Universitaet Trier
Information RetrievalDatabasesSemantic Web
Stefan Dietze
Stefan Dietze
Full Professor (Heinrich-Heine-University Düsseldorf) & Scientific Director (KTS, GESIS)
Knowledge GraphsInformation RetrievalWeb ScienceNLP