🤖 AI Summary
This study investigates the restructuring of the grammatical gender system from a three-way (masculine, feminine, neuter) to a two-way (masculine, feminine) classification during the transition from Latin to Occitan. To this end, we propose an interpretable deep learning framework that integrates morphological and contextual features to quantify the contribution of different lexical categories to gender prediction. We also introduce a novel tokenizer specifically designed for low-resource historical languages. Experimental results demonstrate that our tokenizer substantially outperforms baseline approaches and effectively disentangles the relative influence of word form and syntactic context in gender assignment. All code, data, and experimental results are publicly released.
📝 Abstract
The diachronic evolution from Latin to the Romance languages involved a restructuring of the grammatical gender system from a tripartite configuration (masculine, feminine, neuter) to a bipartite one (masculine, feminine). In this work, we introduce an interpretable deep learning framework to investigate this phenomenon at both lexical and contextual levels. First, we show that conventional tokenization strategies are insufficiently robust for this low-resource historical setting, and that our proposed tokenizer improves performance over these baselines. At the lexical level, we evaluate the contribution of morphological features to gender prediction. At the contextual level, we quantify the contributions of different part-of-speech categories to grammatical gender prediction. Together, these analyses characterize the distribution of gender information between the lemma and its sentential context. We make our codebase, datasets, and results publicly available.