Word Boundary Information Isn’t Useful for Encoder Language Models

📅 2024-01-15

🏛️ Workshop on Representation Learning for NLP

📈 Citations: 2

✨ Influential: 0

career value

142K/year

🤖 AI Summary

This work investigates whether explicit word-boundary markers (e.g., subword delimiters such as “##” or “_”) meaningfully contribute to the performance of Transformer language models. Method: We systematically evaluate the impact of removing such boundary tokens across four pretraining scales in English and Finnish, measuring effects on diverse downstream tasks—including sentence- and token-level classification, complex word identification, and named entity recognition. We train 35 Transformer encoder models using whitespace-agnostic subword tokenizers and benchmark them on multilingual, multitask suites including SUPERBIZARRE and FLOTA. Contribution/Results: Explicit word-boundary encoding yields no statistically significant performance gains—and never harms model performance—across all experimental settings. To our knowledge, this is the first large-scale pretrained empirical study challenging the conventional practice of retaining explicit boundary markers in subword tokenization. Our findings provide robust evidence supporting simpler, morphologically more principled tokenization schemes.

Technology Category

Application Category

📝 Abstract

All existing transformer-based approaches to NLP using subword tokenisation algorithms encode whitespace (word boundary information) through the use of special space symbols (such as ## or _) forming part of tokens. These symbols have been shown to a) lead to reduced morphological validity of tokenisations, and b) give substantial vocabulary redundancy. As such, removing these symbols has been shown to have a beneficial effect on the processing of morphologically complex words for transformer encoders in the pretrain-finetune paradigm. In this work, we explore whether word boundary information is at all useful to such models. In particular, we train transformer encoders across four different training scales, and investigate several alternative approaches to including word boundary information, evaluating on two languages (English and Finnish) with a range of tasks across different domains and problem set-ups: sentence classification datasets, NER (for token-level classification), and two classification datasets involving complex words (Superbizarre and FLOTA). Overall, through an extensive experimental setup that includes the pre-training of 35 models, we find no substantial improvements from our alternative approaches, suggesting that modifying tokenisers to remove word boundary information isn’t leading to a loss of useful information.

Problem

Research questions and friction points this paper is trying to address.

Impact of word boundary information

Transformer encoder efficiency

Morphologically complex word processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Remove special space symbols

Train transformer encoders

Evaluate multiple NLP tasks

🔎 Similar Papers

No similar papers found.