Aligned but Not Partner-Specific: Distinguishing How Multimodal LLM Agents Succeed in Reference Games Without Human-Like Conventions

📅 2026-06-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study investigates whether multimodal large language model (MLLM) agents achieve efficient communication in repeated reference tasks through partner-specific conventions or merely rely on generic vocabulary to attain superficial alignment. To disentangle the source of alignment, we introduce a constrained pseudo-dialogue baseline that disrupts partner interaction history and employ a multi-level analytical framework combining the KTH Tangrams human dialogue dataset, MLLM agent simulations, and behavioral analyses across task performance, referring strategies, and alignment dynamics. Our findings reveal that humans progressively compress their expressions and develop partner-specific alignment through interaction, whereas MLLM agents consistently produce verbose descriptions. Critically, the high lexical overlap exhibited by MLLMs shows no significant difference between real and pseudo-dialogue pairs, indicating that their apparent success stems from exhaustive rather than compact, convention-based communication mechanisms.

📝 Abstract

Repeated reference games test whether interlocutors replace their initially long descriptions with shorter, partner-specific conventions grounded in shared interaction history. Prior work shows that multimodal LLMs fail to become more efficient across rounds, although they align on the labels they use. How can we determine whether this alignment reflects partner-specific grounding rather than a shared task vocabulary? We address this question by comparing capable multimodal agent dyads with human dyads from the KTH Tangrams corpus. Our novel methodological contribution is a constrained pseudo-dyad baseline that matches the original referential task structure, but breaks partner history. This baseline enables us to test whether the observed label alignment depends on interaction with a specific partner. Across three analytic layers (task competence, description strategy, alignment dynamics), we find clear differences. Humans reduce effort through entrainment, compressing descriptions and increasing label alignment with partners. Agents instead maintain fixed effort levels, producing verbose descriptions from round one, with near-ceiling label overlap that is statistically indistinguishable between real and pseudo dyads. MLLMs thus achieve coordination without convention, succeeding by verbose description rather than by forming the compact, history-dependent referring expressions characteristic of human dialogue.

Problem

Research questions and friction points this paper is trying to address.

reference games

multimodal LLMs

partner-specific conventions

label alignment

grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal LLMs

reference games

partner-specific conventions