Large Language Models for Summarizing Czech Historical Documents and Beyond

📅 2025-08-14
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
Historical document summarization for low-resource Czech has long been hindered by linguistic complexity and scarcity of annotated data. To address this, we introduce *Posel od Čerchova*, the first annotated summarization dataset for historical Czech, and achieve state-of-the-art performance on the modern Czech benchmark SumeCzech. Our approach integrates transfer learning with multilingual pretraining, adapting large language models—including Mistral and mT5—to jointly handle both modern and historical Czech texts, enabling the first unified cross-era summarization framework for the language. Key contributions are: (1) releasing the first open-source summarization dataset for historical Czech; (2) establishing the current strongest baseline for Czech summarization; and (3) empirically validating the efficacy of large language models in low-resource historical text NLP tasks, thereby providing a reusable methodological paradigm for analogous under-resourced languages.

Technology Category

Application Category

📝 Abstract
Text summarization is the task of shortening a larger body of text into a concise version while retaining its essential meaning and key information. While summarization has been significantly explored in English and other high-resource languages, Czech text summarization, particularly for historical documents, remains underexplored due to linguistic complexities and a scarcity of annotated datasets. Large language models such as Mistral and mT5 have demonstrated excellent results on many natural language processing tasks and languages. Therefore, we employ these models for Czech summarization, resulting in two key contributions: (1) achieving new state-of-the-art results on the modern Czech summarization dataset SumeCzech using these advanced models, and (2) introducing a novel dataset called Posel od Čerchova for summarization of historical Czech documents with baseline results. Together, these contributions provide a great potential for advancing Czech text summarization and open new avenues for research in Czech historical text processing.
Problem

Research questions and friction points this paper is trying to address.

Exploring Czech text summarization for historical documents
Addressing scarcity of annotated datasets for Czech summarization
Applying large language models to improve Czech summarization performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using Mistral and mT5 for Czech summarization
Achieving state-of-the-art on SumeCzech dataset
Introducing Posel od Čerchova historical dataset
🔎 Similar Papers
No similar papers found.
V
VĂĄclav Tran
Department of Computer Science and Engineering, University of West Bohemia in Pilsen, UniverzitnĂ­, Pilsen, Czech Republic; NTIS - New Technologies for the Information Society, University of West Bohemia in Pilsen, UniverzitnĂ­, Pilsen, Czech Republic
J
Jakub Ĺ mĂ­d
Department of Computer Science and Engineering, University of West Bohemia in Pilsen, UniverzitnĂ­, Pilsen, Czech Republic; NTIS - New Technologies for the Information Society, University of West Bohemia in Pilsen, UniverzitnĂ­, Pilsen, Czech Republic
Jiří Martínek
Jiří Martínek
University of West Bohemia
Ladislav Lenc
Ladislav Lenc
University of West Bohemia
P
Pavel KrĂĄl
Department of Computer Science and Engineering, University of West Bohemia in Pilsen, UniverzitnĂ­, Pilsen, Czech Republic; NTIS - New Technologies for the Information Society, University of West Bohemia in Pilsen, UniverzitnĂ­, Pilsen, Czech Republic