Large Language Models for Summarizing Czech Historical Documents and Beyond

📅 2025-08-14

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Historical document summarization for low-resource Czech has long been hindered by linguistic complexity and scarcity of annotated data. To address this, we introduce *Posel od Čerchova*, the first annotated summarization dataset for historical Czech, and achieve state-of-the-art performance on the modern Czech benchmark SumeCzech. Our approach integrates transfer learning with multilingual pretraining, adapting large language models—including Mistral and mT5—to jointly handle both modern and historical Czech texts, enabling the first unified cross-era summarization framework for the language. Key contributions are: (1) releasing the first open-source summarization dataset for historical Czech; (2) establishing the current strongest baseline for Czech summarization; and (3) empirically validating the efficacy of large language models in low-resource historical text NLP tasks, thereby providing a reusable methodological paradigm for analogous under-resourced languages.

Technology Category

Application Category

📝 Abstract

Text summarization is the task of shortening a larger body of text into a concise version while retaining its essential meaning and key information. While summarization has been significantly explored in English and other high-resource languages, Czech text summarization, particularly for historical documents, remains underexplored due to linguistic complexities and a scarcity of annotated datasets. Large language models such as Mistral and mT5 have demonstrated excellent results on many natural language processing tasks and languages. Therefore, we employ these models for Czech summarization, resulting in two key contributions: (1) achieving new state-of-the-art results on the modern Czech summarization dataset SumeCzech using these advanced models, and (2) introducing a novel dataset called Posel od Čerchova for summarization of historical Czech documents with baseline results. Together, these contributions provide a great potential for advancing Czech text summarization and open new avenues for research in Czech historical text processing.

Problem

Research questions and friction points this paper is trying to address.

Exploring Czech text summarization for historical documents

Addressing scarcity of annotated datasets for Czech summarization

Applying large language models to improve Czech summarization performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using Mistral and mT5 for Czech summarization

Achieving state-of-the-art on SumeCzech dataset

Introducing Posel od Čerchova historical dataset

🔎 Similar Papers

No similar papers found.