🤖 AI Summary
This study investigates whether large language models genuinely integrate multidimensional user context when generating personalized investment advice in high-stakes domains or instead rely on heuristic decision-making driven by a single explicit feature. We uncover a previously undocumented “heuristic collapse” phenomenon: model recommendations are predominantly governed by users’ self-reported risk preferences, with other critical contextual information exerting minimal influence. Through interpretable proxy models, input ablation studies, and web-search augmentation techniques, we demonstrate that merely scaling model size or incorporating retrieval-augmented generation fails to fundamentally mitigate this issue. These findings highlight a significant limitation in the current capability of large language models to perform nuanced, compliant-sensitive personalization, raising concerns about their reliability in regulated financial advisory applications.
📝 Abstract
Large language models are increasingly deployed as advisors in high-stakes domains -- answering medical questions, interpreting legal documents, recommending financial products -- where good advice requires integrating a user's full context rather than responding to salient surface features. We investigate whether frontier LLMs actually do this, or whether they instead exhibit heuristic collapse: a systematic reduction of complex, multi-factor decisions to a small number of dominant inputs. We study the phenomenon in investment advice, where legal standards explicitly require individualized reasoning over a client's full circumstances. Applying interpretable surrogate models to LLM outputs, we find systematic heuristic collapse: investment allocation decisions are largely determined by self-reported risk tolerance, while other relevant factors contribute minimally. We further find that web search partially attenuates heuristic collapse but does not resolve it. These findings suggest that heuristic collapse is not resolved by web search augmentation or model scale alone, and that deploying LLMs as advisors requires auditing input sensitivity, not just output quality.