Language in Vivo vs. in Silico: Size Matters but Larger Language Models Still Do Not Comprehend Language on a Par with Humans

📅 2024-04-23
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether scaling up large language models (LLMs) can bridge the quantitative and qualitative gaps between LLMs and humans in linguistic comprehension, specifically on grammaticality judgment tasks (anaphora, center-embedding, comparatives, and negative polarity). Method: We systematically compare Bard, ChatGPT-3.5, ChatGPT-4, and 80 human participants using repeated-prompt stability analysis, cross-model normalized evaluation, and controlled experiments. Contribution/Results: Although ChatGPT-4 achieves marginally higher overall accuracy (80%) than humans (76%), this advantage holds only for grammatical sentences. For ungrammatical ones, ChatGPT-4 exhibits markedly lower sensitivity—its answer oscillation rate (22.1%) exceeds that of humans (9.6%) by 12.5 percentage points—and manifests more severe semantic hallucinations. Critically, this work provides the first empirical evidence that increasing model scale fails to replicate humans’ essential sensitivity to (un)grammaticality. It identifies three fundamental divergences between in vivo and in silico language acquisition: stability, robustness, and underlying cognitive mechanisms—thereby falsifying the hypothesis that mere scaling suffices for human-level linguistic understanding.

Technology Category

Application Category

📝 Abstract
Understanding the limits of language is a prerequisite for Large Language Models (LLMs) to act as theories of natural language. LLM performance in some language tasks presents both quantitative and qualitative differences from that of humans, however it remains to be determined whether such differences are amenable to model size. This work investigates the critical role of model scaling, determining whether increases in size make up for such differences between humans and models. We test three LLMs from different families (Bard, 137 billion parameters; ChatGPT-3.5, 175 billion; ChatGPT-4, 1.5 trillion) on a grammaticality judgment task featuring anaphora, center embedding, comparatives, and negative polarity. N=1,200 judgments are collected and scored for accuracy, stability, and improvements in accuracy upon repeated presentation of a prompt. Results of the best performing LLM, ChatGPT-4, are compared to results of n=80 humans on the same stimuli. We find that humans are overall less accurate than ChatGPT-4 (76% vs. 80% accuracy, respectively), but that this is due to ChatGPT-4 outperforming humans only in one task condition, namely on grammatical sentences. Additionally, ChatGPT-4 wavers more than humans in its answers (12.5% vs. 9.6% likelihood of an oscillating answer, respectively). Thus, while increased model size may lead to better performance, LLMs are still not sensitive to (un)grammaticality the same way as humans are. It seems possible but unlikely that scaling alone can fix this issue. We interpret these results by comparing language learning in vivo and in silico, identifying three critical differences concerning (i) the type of evidence, (ii) the poverty of the stimulus, and (iii) the occurrence of semantic hallucinations due to impenetrable linguistic reference.
Problem

Research questions and friction points this paper is trying to address.

Investigates if larger language models match human language comprehension.
Compares human and LLM performance on grammaticality judgment tasks.
Explores scaling impact on LLM sensitivity to grammaticality.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Investigates model scaling impact on performance
Compares human and LLM grammaticality judgments
Identifies key differences in language learning
🔎 Similar Papers
No similar papers found.