🤖 AI Summary
Current large language models lack effective evaluation of phonological understanding, as existing benchmarks are often susceptible to memorization or conflated with other linguistic competencies, making it difficult to assess genuine phonological reasoning. This work proposes Phun-Bench, the first multidimensional benchmark for evaluating Chinese phonological understanding, encompassing homophony, rhyme, and phonetic similarity. Through carefully designed, diverse tasks, Phun-Bench enables disentangled and systematic assessment of distinct phonological capabilities. Empirical results reveal that while large language models can accurately recall pronunciations, they significantly underperform humans in flexibly applying phonological knowledge, highlighting a gap between model-based speech perception and human phonological intuition. This benchmark opens new avenues for interdisciplinary research at the intersection of speech and language modeling.
📝 Abstract
Language is a vehicle for thought, intricately tied to sounds, symbols, and meaning. However, most large language model (LLM) research focuses on meaning (semantics) and symbols (spelling) while largely overlooking sounds. Existing benchmarks on LLMs' phonological abilities are either solvable through rote memorization or intertwined with other abilities, making them inadequate to measure LLMs' genuine ability in phonological understanding. Here, we present Phun-Bench, a purpose-built Chinese benchmark with diverse tasks and settings across three dimensions (Homophony, Rhyme, and Phonetic Similarity), designed to systematically evaluate LLMs' phonological understanding. Our results show that while LLMs excel at recalling correct pronunciations, they generally struggle to leverage phonological knowledge in the flexible and intuitive way that human speakers do. Moreover, through detailed analyses, we propose a hypothesis regarding the underlying mechanism of LLMs' phonological understanding and "perception", highlighting an underexplored frontier for future research.