🤖 AI Summary
This work investigates how inflectional morphology—exemplified by Polish—affects the adversarial robustness of pretrained language models. Addressing the observed failure of mainstream adversarial attacks on inflecting languages, we propose Edge Attribution Patching (EAP), a mechanistic interpretability evaluation protocol, and introduce MultiEmo-Inflect—the first adversarial robustness benchmark tailored to inflectional languages. Through EAP-based circuit analysis, parallel corpus comparison, adapted TextBugger/TextFooler transfer attacks, and multi-task evaluations, we find that inflectional morphology substantially reduces attack success rates. We identify stem–affix coupling as a critical computational circuit underlying model vulnerability and reveal syncretism (morphological ambiguity where distinct grammatical features share identical surface forms) as the principal bottleneck to robustness. Our findings establish a novel paradigm for analyzing the interplay between morphological complexity and model fragility, providing both theoretical insight and empirical grounding for future robust NLP research on morphologically rich languages.
📝 Abstract
Various techniques are used in the generation of adversarial examples, including methods such as TextBugger which introduce minor, hardly visible perturbations to words leading to changes in model behaviour. Another class of techniques involves substituting words with their synonyms in a way that preserves the text's meaning but alters its predicted class, with TextFooler being a prominent example of such attacks. Most adversarial example generation methods are developed and evaluated primarily on non-inflectional languages, typically English. In this work, we evaluate and explain how adversarial attacks perform in inflectional languages. To explain the impact of inflection on model behaviour and its robustness under attack, we designed a novel protocol inspired by mechanistic interpretability, based on Edge Attribution Patching (EAP) method. The proposed evaluation protocol relies on parallel task-specific corpora that include both inflected and syncretic variants of texts in two languages -- Polish and English. To analyse the models and explain the relationship between inflection and adversarial robustness, we create a new benchmark based on task-oriented dataset MultiEmo, enabling the identification of mechanistic inflection-related elements of circuits within the model and analyse their behaviour under attack.