An Automated Grey Literature Extraction Tool for Software Engineering

📅 2025-12-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of heterogeneity and irreproducibility in collecting software engineering grey literature, this paper proposes GLiSE—the first prompt-driven, cross-platform grey literature retrieval framework. GLiSE automatically generates platform-specific queries from research topics, integrates GitHub, Stack Overflow, and Google APIs for multi-source harvesting, and employs Sentence-BERT embeddings with cosine similarity for semantic relevance filtering and ranking. Its key contributions are: (1) the first configurable, traceable, and fully open-source retrieval pipeline; (2) the release of the first software engineering grey literature dataset comprising over 12K semantically annotated samples; and (3) empirical evaluation demonstrating an F1-score of 0.82 in multi-source scenarios—significantly improving retrieval accuracy and reproducibility compared to prior approaches.

Technology Category

Application Category

📝 Abstract
Grey literature is essential to software engineering research as it captures practices and decisions that rarely appear in academic venues. However, collecting and assessing it at scale remains difficult because of their heterogeneous sources, formats, and APIs that impede reproducible, large-scale synthesis. To address this issue, we present GLiSE, a prompt-driven tool that turns a research topic prompt into platform-specific queries, gathers results from common software-engineering web sources (GitHub, Stack Overflow) and Google Search, and uses embedding-based semantic classifiers to filter and rank results according to their relevance. GLiSE is designed for reproducibility with all settings being configuration-based, and every generated query being accessible. In this paper, (i) we present the GLiSE tool, (ii) provide a curated dataset of software engineering grey-literature search results classified by semantic relevance to their originating search intent, and (iii) conduct an empirical study on the usability of our tool.
Problem

Research questions and friction points this paper is trying to address.

Automates collection of software engineering grey literature
Filters and ranks results by relevance using semantic classifiers
Enables reproducible large-scale synthesis from heterogeneous sources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt-driven tool for generating platform-specific queries
Embedding-based semantic classifiers for filtering and ranking results
Configuration-based design ensuring reproducibility and query accessibility
🔎 Similar Papers