🤖 AI Summary
Geographic databases commonly lack historical, cultural, and socio-geographic origin information for place names, and manual verification is labor-intensive and inefficient.
Method: This paper proposes a retrieval-augmented generation (RAG) framework specifically designed for place-name origin identification. It constructs a DBpedia-based knowledge base and introduces a novel spatially aware subgraph extraction and re-ranking mechanism. The framework integrates fine-tuned ColBERTv2 for semantic retrieval and Llama2 for generative reasoning, explicitly modeling spatial-semantic relationships among geographic entities.
Contribution/Results: Experiments demonstrate substantial improvements in both accuracy and interpretability of origin identification. Critically, the study reveals— for the first time—a key bottleneck in large language models’ underutilization of spatial semantics for geographic understanding. Moreover, it establishes a scalable technical pathway for automated geographic knowledge completion.
📝 Abstract
Who is the "Batman" behind "Batman Street" in Melbourne? Understanding the historical, cultural, and societal narratives behind place names can reveal the rich context that has shaped a community. Although place names serve as essential spatial references in gazetteers, they often lack information about place name origins. Enriching these place names in today's gazetteers is a time-consuming, manual process that requires extensive exploration of a vast archive of documents and text sources. Recent advances in natural language processing and language models (LMs) hold the promise of significant automation of identifying place name origins due to their powerful capability to exploit the semantics of the stored documents. This chapter presents a retrieval augmented generation pipeline designed to search for place name origins over a broad knowledge base, DBpedia. Given a spatial query, our approach first extracts sub-graphs that may contain knowledge relevant to the query; then ranks the extracted sub-graphs to generate the final answer to the query using fine-tuned LM-based models (i.e., ColBERTv2 and Llama2). Our results highlight the key challenges facing automated retrieval of place name origins, especially the tendency of language models to under-use the spatial information contained in texts as a discriminating factor. Our approach also frames the wider implications for geographic information retrieval using retrieval augmented generation.