🤖 AI Summary
This work addresses the limitation of Polish BERT-style encoders in processing long documents due to their short context windows. We propose a two-stage training strategy: first extending positional embeddings to support a context length of 8,192 tokens, followed by full-parameter continued pretraining; we further distill this model into a lightweight variant via knowledge distillation. To our knowledge, this is the first high-performance long-context encoder for Polish, accompanied by FinBench—a newly introduced benchmark comprising long financial documents. Evaluated across 25 tasks, including KLEJ and FinBench, our model outperforms existing Polish and multilingual baselines on average, demonstrating substantial gains on long-context tasks while preserving strong performance on short-text understanding.
📝 Abstract
While decoder-only Large Language Models (LLMs) have recently dominated the NLP landscape, encoder-only architectures remain a cost-effective and parameter-efficient standard for discriminative tasks. However, classic encoders like BERT are limited by a short context window, which is insufficient for processing long documents. In this paper, we address this limitation for the Polish by introducing a high-quality Polish model capable of processing sequences of up to 8192 tokens. The model was developed by employing a two-stage training procedure that involves positional embedding adaptation and full parameter continuous pre-training. Furthermore, we propose compressed model variants trained via knowledge distillation. The models were evaluated on 25 tasks, including the KLEJ benchmark, a newly introduced financial task suite (FinBench), and other classification and regression tasks, specifically those requiring long-document understanding. The results demonstrate that our model achieves the best average performance among Polish and multilingual models, significantly outperforming competitive solutions in long-context tasks while maintaining comparable quality on short texts.