🤖 AI Summary
This study challenges the ecological validity of directly applying human psychometric instruments—such as the Big Five Inventory (BFI) and Portrait Values Questionnaire (PVQ)—to assess personality and values in large language models (LLMs), arguing that static questionnaire items lack contextual grounding in authentic interactive scenarios. Using a comparative design, we evaluate traditional versus ecologically enhanced questionnaires through response behavior analysis, reliability and validity testing, and situated simulation tasks. Results reveal systematic measurement distortions with conventional instruments: inflated personality profiles, poor cross-prompt response stability, and attenuated item sensitivity. Crucially, this work provides the first empirical demonstration of how standard psychometric tools induce construct-irrelevant variance in LLM assessment. Our primary contribution is proposing a task-embedded, interaction-driven evaluation paradigm to replace static self-report formats—establishing a methodological foundation and practical framework for modeling psychological attributes in AI systems. (149 words)
📝 Abstract
Researchers have applied established psychometric questionnaires (e.g., BFI, PVQ) to measure the personality traits and values reflected in the responses of Large Language Models (LLMs). However, concerns have been raised about applying these human-designed questionnaires to LLMs. One such concern is their lack of ecological validity--the extent to which survey questions adequately reflect and resemble real-world contexts in which LLMs generate texts in response to user queries. However, it remains unclear how established questionnaires and ecologically valid questionnaires differ in their outcomes, and what insights these differences may provide. In this paper, we conduct a comprehensive comparative analysis of the two types of questionnaires. Our analysis reveals that established questionnaires (1) yield substantially different profiles of LLMs from ecologically valid ones, deviating from the psychological characteristics expressed in the context of user queries, (2) suffer from insufficient items for stable measurement, (3) create misleading impressions that LLMs possess stable constructs, and (4) yield exaggerated profiles for persona-prompted LLMs. Overall, our work cautions against the use of established psychological questionnaires for LLMs. Our code will be released upon publication.