🤖 AI Summary
This study addresses significant judgment shifts in large language models (LLMs) during dialogue evaluation caused by framing effects—such as whether a statement is presented as a direct assertion or attributed to a speaker—undermining their consistency and fairness. The work introduces and quantifies the phenomenon of “Dialogic Deference,” proposing a directional bias metric, the Dialogic Deference Score (DDS), to reveal how LLMs differentially respond to human versus AI speakers. Analyzing over 3,000 dialogues across nine domains and four models, the research demonstrates pronounced framing effects, with |DDS| reaching up to 87 percentage points and amplifying two- to four-fold in real-world Reddit conversations. Although proposed mitigation strategies reduce deference tendencies, they often lead to overcorrection, highlighting the persistent challenge of model calibration.
📝 Abstract
LLMs are increasingly used as third-party judges, yet their reliability when evaluating speakers in dialogue remains poorly understood. We show that LLMs judge identical claims differently depending on framing: the same content elicits different verdicts when presented as a statement to verify ("Is this statement correct?") versus attributed to a speaker ("Is this speaker correct?"). We call this dialogic deference and introduce DialDefer, a framework for detecting and mitigating these framing-induced judgment shifts. Our Dialogic Deference Score (DDS) captures directional shifts that aggregate accuracy obscures. Across nine domains, 3k+ instances, and four models, conversational framing induces large shifts (|DDS| up to 87pp, p<.0001) while accuracy remains stable (<2pp), with effects amplifying 2-4x on naturalistic Reddit conversations. Models can shift toward agreement (deference) or disagreement (skepticism) depending on domain -- the same model ranges from DDS = -53 on graduate-level science to +58 on social judgment. Ablations reveal that human-vs-LLM attribution drives the largest shifts (17.7pp swing), suggesting models treat disagreement with humans as more costly than with AI. Mitigation attempts reduce deference but can over-correct into skepticism, framing this as a calibration problem beyond accuracy optimization.