Navigating Rifts in Human-LLM Grounding: Study and Benchmark

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the critical collaboration bottleneck of “grounding failure” in human–LLM dialogue. Grounding failures—manifested as inadequate clarification initiation or follow-up requests—lead to misunderstandings and potentially high-risk outcomes, with LLMs significantly underperforming humans in these behaviors. Methodologically, the study constructs, for the first time, a fine-grained grounding behavior taxonomy and predictive model grounded in WildChat, MultiWOZ, and Bing Chat logs; identifies early grounding failures as strong predictors of subsequent dialogue collapse; and introduces RIFTS, the first benchmark dedicated to evaluating LLM grounding failures. Empirical results show that state-of-the-art LLMs perform poorly on RIFTS: their clarification initiation rate is only one-third that of humans, and follow-up request rate merely 1/16. Crucially, lightweight intervention strategies are empirically validated to effectively mitigate such failures.

Technology Category

Application Category

📝 Abstract
Language models excel at following instructions but often struggle with the collaborative aspects of conversation that humans naturally employ. This limitation in grounding -- the process by which conversation participants establish mutual understanding -- can lead to outcomes ranging from frustrated users to serious consequences in high-stakes scenarios. To systematically study grounding challenges in human-LLM interactions, we analyze logs from three human-assistant datasets: WildChat, MultiWOZ, and Bing Chat. We develop a taxonomy of grounding acts and build models to annotate and forecast grounding behavior. Our findings reveal significant differences in human-human and human-LLM grounding: LLMs were three times less likely to initiate clarification and sixteen times less likely to provide follow-up requests than humans. Additionally, early grounding failures predicted later interaction breakdowns. Building on these insights, we introduce RIFTS: a benchmark derived from publicly available LLM interaction data containing situations where LLMs fail to initiate grounding. We note that current frontier models perform poorly on RIFTS, highlighting the need to reconsider how we train and prompt LLMs for human interaction. To this end, we develop a preliminary intervention that mitigates grounding failures.
Problem

Research questions and friction points this paper is trying to address.

Study grounding challenges in human-LLM interactions.
Analyze differences in grounding acts between humans and LLMs.
Develop benchmark and intervention to mitigate grounding failures.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed taxonomy for grounding acts
Created RIFTS benchmark for LLM failures
Designed intervention to reduce grounding failures
🔎 Similar Papers
No similar papers found.