🤖 AI Summary
This study investigates fairness concerns in large language models (LLMs) deployed in high-stakes settings, where differential outputs across user groups may arise from conversational history. Integrating sociolinguistic analysis, psycholinguistic feature extraction, and behavioral model evaluation, the work systematically examines how users’ sociodemographic attributes and dialogue characteristics—such as topic, sentiment, and readability—influence model outputs. Findings indicate that LLMs struggle to accurately infer users’ true sociodemographic identities from single-turn interactions, resulting in minimal output disparities across groups. Instead, dialogue topic emerges as the primary driver of output variation, inadvertently serving as a proxy for social group membership and thereby unpredictably shaping the content of model-generated recommendations.
📝 Abstract
When large language models (LLMs) are used in high-stakes scenarios, such as legal, medical and financial advice, even a single conversation history is enough to drive differences in outcomes between users. Prior work has demonstrated that this results in outcome disparities between sociodemographic groups, with some groups receiving more advantageous outcomes than others. In this work, we demonstrate that LLMs actually struggle to infer user sociodemographics from a single conversation history and that although there are disparities between sociodemographic groups, they are minimal in magnitude. To investigate what the main driver of these disparities is, we compare user sociodemographics to a range of (psycho)linguistic features of conversations, including conversation topic, emotions, and readability. We find that conversation topics are most predictive of LLM-generated advice within a conversational context, which, to some extent, function as proxies for sociodemographic groups and often affect advice in unpredictable ways. This is cause for concern and highlights the need for future research to better understand and, if needed, mitigate the effect of conversational context on LLM outputs in high-stakes scenarios.