🤖 AI Summary
This study addresses the challenge of enhancing language model acquisition under data-limited conditions, with particular emphasis on aligning lexical and sublexical modeling more closely with human early language learning mechanisms. To this end, we construct pure character-level models based on a lightweight Llama architecture, employing either grapheme- or phoneme-based vocabularies—explicitly eschewing subword tokenization. This work presents the first systematic empirical validation that small-scale phoneme- and grapheme-level models can match state-of-the-art subword models in syntactic generalization and novel-word pronunciation tasks. Experimental results demonstrate that the grapheme-level model achieves superior performance across multiple syntactic and out-of-vocabulary word benchmarks; the phoneme-level model attains comparable performance, confirming robust linguistic generalization even with minimal vocabulary size. These findings advance the neurocognitive plausibility of computational models of language acquisition by grounding them in biologically and developmentally motivated representational units.
📝 Abstract
Recent work investigates whether LMs learn human-like linguistic generalizations and representations from developmentally plausible amounts of data. Yet, the basic linguistic units processed in these LMs are determined by subword-based tokenization, which limits their validity as models of learning at and below the word level. In this paper, we explore the potential of tokenization-free, phoneme- and grapheme-based language models. We demonstrate that small models based on the Llama architecture can achieve strong linguistic performance on standard syntactic and novel lexical/phonetic benchmarks when trained with character-level vocabularies. We further show that phoneme-based models almost match grapheme-based models in standard tasks and novel evaluations. Our findings suggest a promising direction for creating more linguistically plausible language models that are better suited for computational studies of language acquisition and processing.