๐ค AI Summary
This paper identifies an inherent performance trade-off in large language models (LLMs) between coreference resolution and ambiguity detection: while LLMs can be individually optimized for either task, they struggle to excel at both simultaneously. To formalize this tension, we introduce the โCORRECT-DETECT trade-offโโthe first systematic characterization of the intrinsic conflict between referential accuracy and ambiguity awareness in LLMs. Using a minimal-prompt paradigm, we evaluate both tasks jointly within a unified framework and conduct empirical analysis on human-annotated data. Experiments across mainstream LLMs reveal a significant negative correlation between coreference resolution and ambiguity detection performance, with no model achieving Pareto-optimality on both. This finding exposes a fundamental limitation in LLMsโ deep semantic modeling of referential structure and provides critical theoretical insight and empirical evidence for designing robust coreference resolution systems.
๐ Abstract
Large Language Models (LLMs) are intended to reflect human linguistic competencies. But humans have access to a broad and embodied context, which is key in detecting and resolving linguistic ambiguities, even in isolated text spans. A foundational case of semantic ambiguity is found in the task of coreference resolution: how is a pronoun related to an earlier person mention? This capability is implicit in nearly every downstream task, and the presence of ambiguity at this level can alter performance significantly. We show that LLMs can achieve good performance with minimal prompting in both coreference disambiguation and the detection of ambiguity in coreference, however, they cannot do both at the same time. We present the CORRECT-DETECT trade-off: though models have both capabilities and deploy them implicitly, successful performance balancing these two abilities remains elusive.