DHPLT: large-scale multilingual diachronic corpora and word representations for semantic change modelling

πŸ“… 2026-02-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the limited language coverage in current semantic change research, which has largely been confined to a dozen high-resource languages due to the scarcity of multilingual diachronic corpora. To overcome this constraint, the authors construct DHPLTβ€”a large-scale, open-source diachronic corpus spanning 41 languages across three time periods (2011–2015, 2020–2021, and 2024–present)β€”by leveraging timestamped documents from the HPLT web-crawled dataset. Each language-period combination contains one million documents. The resource further includes precomputed token- and type-level embeddings as well as lexical replacement statistics. DHPLT substantially expands the linguistic scope of semantic change studies, particularly by filling critical gaps for low-resource languages, and enables new experimental paradigms for cross-lingual diachronic semantic modeling.

Technology Category

Application Category

πŸ“ Abstract
In this resource paper, we present DHPLT, an open collection of diachronic corpora in 41 diverse languages. DHPLT is based on the web-crawled HPLT datasets; we use web crawl timestamps as the approximate signal of document creation time. The collection covers three time periods: 2011-2015, 2020-2021 and 2024-present (1 million documents per time period for each language). We additionally provide pre-computed word type and token embeddings and lexical substitutions for our chosen target words, while at the same time leaving it open for the other researchers to come up with their own target words using the same datasets. DHPLT aims at filling in the current lack of multilingual diachronic corpora for semantic change modelling (beyond a dozen of high-resource languages). It opens the way for a variety of new experimental setups in this field. All the resources described in this paper are available at https://data.hplt-project.org/three/diachronic/, sorted by language.
Problem

Research questions and friction points this paper is trying to address.

semantic change
diachronic corpora
multilingual
word representations
lexical semantics
Innovation

Methods, ideas, or system contributions that make the work stand out.

diachronic corpora
multilingual semantic change
word embeddings
web-crawled data
temporal NLP
πŸ”Ž Similar Papers
No similar papers found.