Large Language Models for the Summarization of Czech Documents: From History to the Present

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

Historical Czech text summarization—particularly for 19th-century documents—has long been hindered by the language’s rich morphology and the scarcity of high-quality annotated data. To address this, we propose the first multi-stage summarization framework tailored to historical Czech: (1) We introduce *Posel od Čerchova*, the first publicly available dataset of manually annotated 19th-century Czech historical document summaries; (2) We design a translation-mediated cross-lingual transfer strategy, leveraging state-of-the-art English abstractive models (Mistral, mT5) to enhance Czech summary generation via intermediate English translation and back-translation; (3) We integrate end-to-end fine-tuning with prompt engineering for domain-adaptive inference. Our approach achieves new state-of-the-art results on the SumeCzech benchmark, establishing the first reproducible baseline for historical Czech summarization. This work significantly advances automatic summarization research for low-resource, morphologically rich languages.

Technology Category

Application Category

📝 Abstract

Text summarization is the task of automatically condensing longer texts into shorter, coherent summaries while preserving the original meaning and key information. Although this task has been extensively studied in English and other high-resource languages, Czech summarization, particularly in the context of historical documents, remains underexplored. This is largely due to the inherent linguistic complexity of Czech and the lack of high-quality annotated datasets. In this work, we address this gap by leveraging the capabilities of Large Language Models (LLMs), specifically Mistral and mT5, which have demonstrated strong performance across a wide range of natural language processing tasks and multilingual settings. In addition, we also propose a translation-based approach that first translates Czech texts into English, summarizes them using an English-language model, and then translates the summaries back into Czech. Our study makes the following main contributions: We demonstrate that LLMs achieve new state-of-the-art results on the SumeCzech dataset, a benchmark for modern Czech text summarization, showing the effectiveness of multilingual LLMs even for morphologically rich, medium-resource languages like Czech. We introduce a new dataset, Posel od Čerchova, designed for the summarization of historical Czech texts. This dataset is derived from digitized 19th-century publications and annotated for abstractive summarization. We provide initial baselines using modern LLMs to facilitate further research in this underrepresented area. By combining cutting-edge models with both modern and historical Czech datasets, our work lays the foundation for further progress in Czech summarization and contributes valuable resources for future research in Czech historical document processing and low-resource summarization more broadly.

Problem

Research questions and friction points this paper is trying to address.

Addressing Czech text summarization gap for historical documents

Overcoming linguistic complexity and dataset scarcity in Czech NLP

Developing methods for summarizing both modern and historical Czech texts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging multilingual LLMs for Czech text summarization

Proposing translation-based summarization via English intermediary

Introducing historical Czech dataset with LLM baselines

🔎 Similar Papers

No similar papers found.