On the Same Wavelength? Evaluating Pragmatic Reasoning in Language Models across Broad Concepts

📅 2025-09-08

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This study systematically evaluates large language models’ (LLMs) pragmatic reasoning capabilities—specifically their ability to model communicative goals and contextual norms. Method: We construct a fine-grained conceptual reasoning benchmark based on the communication game *Wavelength*, and propose an evaluation framework integrating Rational Speech Act (RSA) modeling with Bayesian inference. We compare three generation strategies: direct prompting, chain-of-thought (CoT), and RSA-augmented generation. Contribution/Results: In comprehension tasks, state-of-the-art LLMs achieve near-human accuracy (r > 0.92) and strong alignment with human judgments. In generation tasks, RSA-augmented prompting significantly outperforms both CoT and direct prompting (+12.7% accuracy). This work provides the first empirical evidence that LLMs implicitly possess pragmatic inference capacities approximating those of humans; moreover, RSA-based guidance explicitly steers LLM outputs toward contextually appropriate expressions. Our findings establish a novel paradigm for pragmatic modeling and controllable language generation.

Technology Category

Application Category

📝 Abstract

Language use is shaped by pragmatics -- i.e., reasoning about communicative goals and norms in context. As language models (LMs) are increasingly used as conversational agents, it becomes ever more important to understand their pragmatic reasoning abilities. We propose an evaluation framework derived from Wavelength, a popular communication game where a speaker and a listener communicate about a broad range of concepts in a granular manner. We study a range of LMs on both language comprehension and language production using direct and Chain-of-Thought (CoT) prompting, and further explore a Rational Speech Act (RSA) approach to incorporating Bayesian pragmatic reasoning into LM inference. We find that state-of-the-art LMs, but not smaller ones, achieve strong performance on language comprehension, obtaining similar-to-human accuracy and exhibiting high correlations with human judgments even without CoT prompting or RSA. On language production, CoT can outperform direct prompting, and using RSA provides significant improvements over both approaches. Our study helps identify the strengths and limitations in LMs' pragmatic reasoning abilities and demonstrates the potential for improving them with RSA, opening up future avenues for understanding conceptual representation, language understanding, and social reasoning in LMs and humans.

Problem

Research questions and friction points this paper is trying to address.

Evaluating pragmatic reasoning in language models

Assessing LM performance on language comprehension and production

Improving pragmatic abilities with RSA and CoT approaches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating pragmatic reasoning in language models

Using Chain-of-Thought and Rational Speech Act

Bayesian pragmatic reasoning for language production

🔎 Similar Papers

No similar papers found.