đ¤ AI Summary
Historical document summarization for low-resource Czech has long been hindered by linguistic complexity and scarcity of annotated data. To address this, we introduce *Posel od Äerchova*, the first annotated summarization dataset for historical Czech, and achieve state-of-the-art performance on the modern Czech benchmark SumeCzech. Our approach integrates transfer learning with multilingual pretraining, adapting large language modelsâincluding Mistral and mT5âto jointly handle both modern and historical Czech texts, enabling the first unified cross-era summarization framework for the language. Key contributions are: (1) releasing the first open-source summarization dataset for historical Czech; (2) establishing the current strongest baseline for Czech summarization; and (3) empirically validating the efficacy of large language models in low-resource historical text NLP tasks, thereby providing a reusable methodological paradigm for analogous under-resourced languages.
đ Abstract
Text summarization is the task of shortening a larger body of text into a concise version while retaining its essential meaning and key information. While summarization has been significantly explored in English and other high-resource languages, Czech text summarization, particularly for historical documents, remains underexplored due to linguistic complexities and a scarcity of annotated datasets. Large language models such as Mistral and mT5 have demonstrated excellent results on many natural language processing tasks and languages. Therefore, we employ these models for Czech summarization, resulting in two key contributions: (1) achieving new state-of-the-art results on the modern Czech summarization dataset SumeCzech using these advanced models, and (2) introducing a novel dataset called Posel od Äerchova for summarization of historical Czech documents with baseline results. Together, these contributions provide a great potential for advancing Czech text summarization and open new avenues for research in Czech historical text processing.