Arukikata Travelogue Dataset

📅 2023-05-19

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing research on travel text analysis is hindered by the scarcity, irreproducibility, and incomparability of Japanese travelogue data. To address this, we introduce JTRD—the first large-scale, open-source Japanese travelogue dataset—comprising 4,672 domestic and 9,607 overseas travelogues, totaling over 31 million words. JTRD is constructed via web crawling, manual verification, and standardized preprocessing to ensure high quality, consistency, and reusability. This dataset bridges a critical gap in Japanese NLP for tourism, substantially enhancing experimental reproducibility, cross-study comparability, and methodological transparency. It supports diverse NLP tasks, including tokenization, named entity recognition, and stylistic analysis. As the first publicly available benchmark resource for Japanese travel text within the NLP community, JTRD advances multilingual travel understanding, geocultural modeling, and generative AI research.

📝 Abstract

We have constructed Arukikata Travelogue Dataset and released it free of charge for academic research. This dataset is a Japanese text dataset with a total of over 31 million words, comprising 4,672 Japanese domestic travelogues and 9,607 overseas travelogues. Before providing our dataset, there was a scarcity of widely available travelogue data for research purposes, and each researcher had to prepare their own data. This hinders the replication of existing studies and fair comparative analysis of experimental results. Our dataset enables any researchers to conduct investigation on the same data and to ensure transparency and reproducibility in research. In this paper, we describe the academic significance, characteristics, and prospects of our dataset.

Problem

Research questions and friction points this paper is trying to address.

Addresses scarcity of standardized Japanese travelogue datasets

Enables reproducible tourism research with 31-million-word corpus

Facilitates transparent comparative analysis across academic studies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Created large Japanese travelogue text dataset

Collected over 31 million words from travelogues

Enables transparent reproducible travelogue research

🔎 Similar Papers

No similar papers found.