🤖 AI Summary
Existing research on travel text analysis is hindered by the scarcity, irreproducibility, and incomparability of Japanese travelogue data. To address this, we introduce JTRD—the first large-scale, open-source Japanese travelogue dataset—comprising 4,672 domestic and 9,607 overseas travelogues, totaling over 31 million words. JTRD is constructed via web crawling, manual verification, and standardized preprocessing to ensure high quality, consistency, and reusability. This dataset bridges a critical gap in Japanese NLP for tourism, substantially enhancing experimental reproducibility, cross-study comparability, and methodological transparency. It supports diverse NLP tasks, including tokenization, named entity recognition, and stylistic analysis. As the first publicly available benchmark resource for Japanese travel text within the NLP community, JTRD advances multilingual travel understanding, geocultural modeling, and generative AI research.
📝 Abstract
We have constructed Arukikata Travelogue Dataset and released it free of charge for academic research. This dataset is a Japanese text dataset with a total of over 31 million words, comprising 4,672 Japanese domestic travelogues and 9,607 overseas travelogues. Before providing our dataset, there was a scarcity of widely available travelogue data for research purposes, and each researcher had to prepare their own data. This hinders the replication of existing studies and fair comparative analysis of experimental results. Our dataset enables any researchers to conduct investigation on the same data and to ensure transparency and reproducibility in research. In this paper, we describe the academic significance, characteristics, and prospects of our dataset.