Comparing how Large Language Models perform against keyword-based searches for social science research data discovery

📅 2026-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limitations of traditional keyword-based search in social science data discovery, which struggles with natural language expressions, spelling errors, geographic context, and complex queries. The authors develop and evaluate a large language model (LLM)-based semantic search system, conducting the first systematic comparison against the UK’s Consumer Data Research Centre (CDRC) keyword tool using real user queries. Through multi-dimensional analysis—including BERT embedding cosine similarity, Jaccard index, exact match rate, and human evaluation—of 131 high-frequency queries, the results demonstrate that semantic search significantly outperforms keyword-based methods for place names, misspellings, niche topics, and intricate queries, yielding richer and more relevant results. Although coverage differs slightly between approaches, their complementary strengths suggest potential for combined use to enhance overall data discovery effectiveness.

Technology Category

Application Category

📝 Abstract
This paper evaluates the performance of a large language model (LLM) based semantic search tool relative to a traditional keyword-based search for data discovery. Using real-world search behaviour, we compare outputs from a bespoke semantic search system applied to UKRI data services with the Consumer Data Research Centre (CDRC) keyword search. Analysis is based on 131 of the most frequently used search terms extracted from CDRC search logs between December 2023 and October 2024. We assess differences in the volume, overlap, ranking, and relevance of returned datasets using descriptive statistics, qualitative inspection, and quantitative similarity measures, including exact dataset overlap, Jaccard similarity, and cosine similarity derived from BERT embeddings. Results show that the semantic search consistently returns a larger number of results than the keyword search and performs particularly well for place based, misspelled, obscure, or complex queries. While the semantic search does not capture all keyword based results, the datasets returned are overwhelmingly semantically similar, with high cosine similarity scores despite lower exact overlap. Rankings of the most relevant results differ substantially between tools, reflecting contrasting prioritisation strategies. Case studies demonstrate that the LLM based tool is robust to spelling errors, interprets geographic and contextual relevance effectively, and supports natural-language queries that keyword search fails to resolve. Overall, the findings suggest that LLM driven semantic search offers a substantial improvement for data discovery, complementing rather than fully replacing traditional keyword-based approaches.
Problem

Research questions and friction points this paper is trying to address.

large language models
semantic search
keyword-based search
data discovery
social science research
Innovation

Methods, ideas, or system contributions that make the work stand out.

large language models
semantic search
data discovery
keyword search
BERT embeddings
🔎 Similar Papers
No similar papers found.
M
Mark Green
Department of Geography and Planning, University of Liverpool, Liverpool, UK
M
Maura Halstead
Department of Computer Science, School of Engineering, University of Manchester, Manchester, UK
Caroline Jay
Caroline Jay
Professor of Computer Science, University of Manchester, UK
human-computer interactionsoftware sustainabilityresearch software engineering
Richard Kingston
Richard Kingston
Prof. of Urban Planning and GISc, University of Manchester
Urban PlanningGIScPublic Participation GISPlanning Support SystemsSmart Cities
A
Alex D. Singleton
Department of Geography and Planning, University of Liverpool, Liverpool, UK
D
David Topping
Department of Earth and Environmental Sciences, University of Manchester, Manchester, UK