Small Language Models Also Work With Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas

📅 2024-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of enhancing language model acquisition under data-limited conditions, with particular emphasis on aligning lexical and sublexical modeling more closely with human early language learning mechanisms. To this end, we construct pure character-level models based on a lightweight Llama architecture, employing either grapheme- or phoneme-based vocabularies—explicitly eschewing subword tokenization. This work presents the first systematic empirical validation that small-scale phoneme- and grapheme-level models can match state-of-the-art subword models in syntactic generalization and novel-word pronunciation tasks. Experimental results demonstrate that the grapheme-level model achieves superior performance across multiple syntactic and out-of-vocabulary word benchmarks; the phoneme-level model attains comparable performance, confirming robust linguistic generalization even with minimal vocabulary size. These findings advance the neurocognitive plausibility of computational models of language acquisition by grounding them in biologically and developmentally motivated representational units.

Technology Category

Application Category

📝 Abstract
Recent work investigates whether LMs learn human-like linguistic generalizations and representations from developmentally plausible amounts of data. Yet, the basic linguistic units processed in these LMs are determined by subword-based tokenization, which limits their validity as models of learning at and below the word level. In this paper, we explore the potential of tokenization-free, phoneme- and grapheme-based language models. We demonstrate that small models based on the Llama architecture can achieve strong linguistic performance on standard syntactic and novel lexical/phonetic benchmarks when trained with character-level vocabularies. We further show that phoneme-based models almost match grapheme-based models in standard tasks and novel evaluations. Our findings suggest a promising direction for creating more linguistically plausible language models that are better suited for computational studies of language acquisition and processing.
Problem

Research questions and friction points this paper is trying to address.

Machine Learning
Natural Language Processing
Human-like Language Acquisition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Llama Model
Phoneme-based Learning
Character-based Learning
🔎 Similar Papers
No similar papers found.