Hidden Entity Detection from GitHub Leveraging Large Language Models

📅 2025-01-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Implicit technical entities—particularly software repositories represented as URLs—in GitHub textual content are challenging to annotate, and conventional named entity recognition (NER) methods rely heavily on large-scale labeled datasets, which are scarce for such implicit constructs. Method: This paper proposes the first few-shot prompt-learning framework specifically designed for implicit technical entity recognition. It innovatively incorporates URL-represented online resources into the named entity taxonomy and integrates structured output constraints, cross-modal (text + URL) alignment, and repository-level contextual modeling. Contribution/Results: Evaluated on real-world GitHub corpora, our approach achieves an F1 score of 78.3%, outperforming traditional NER by 32.1%. It is the first work to empirically validate the effectiveness and generalizability of large language models (LLMs) in zero- and few-shot implicit technical entity recognition. This significantly extends the applicability boundary of LLMs to unlabeled knowledge base construction in software engineering contexts.

Technology Category

Application Category

📝 Abstract

Named entity recognition is an important task when constructing knowledge bases from unstructured data sources. Whereas entity detection methods mostly rely on extensive training data, Large Language Models (LLMs) have paved the way towards approaches that rely on zero-shot learning (ZSL) or few-shot learning (FSL) by taking advantage of the capabilities LLMs acquired during pretraining. Specifically, in very specialized scenarios where large-scale training data is not available, ZSL / FSL opens new opportunities. This paper follows this recent trend and investigates the potential of leveraging Large Language Models (LLMs) in such scenarios to automatically detect datasets and software within textual content from GitHub repositories. While existing methods focused solely on named entities, this study aims to broaden the scope by incorporating resources such as repositories and online hubs where entities are also represented by URLs. The study explores different FSL prompt learning approaches to enhance the LLMs' ability to identify dataset and software mentions within repository texts. Through analyses of LLM effectiveness and learning strategies, this paper offers insights into the potential of advanced language models for automated entity detection.

Problem

Research questions and friction points this paper is trying to address.

Automated Recognition

Limited Sample Size

Knowledge Base Construction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Few-shot Learning

Entity Recognition

🔎 Similar Papers

No similar papers found.

Authors to Follow