Probing the Preferences of a Language Model: Integrating Verbal and Behavioral Tests of AI Welfare

📅 2025-09-09
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether language models (LMs) possess measurable welfare states—specifically, whether their verbally reported preferences align with behavioral choices in simulated environments, how such alignment is modulated by cost/reward structures, and whether responses to semantic-equivalent prompts on well-being scales (e.g., autonomy, purpose in life) remain stable. Method: We introduce the first integrated paradigm combining verbal reports (well-being scales + preference statements) with behavioral measures (virtual navigation, topic selection, utility sensitivity tests), augmented by semantic-equivalent prompt perturbations to assess consistency. Results: Significant correlation emerges between stated preferences and behavioral choices; some models exhibit cross-task consistency, but stability under prompt perturbation remains limited—insufficient to confirm substantive welfare states. Contribution: We propose “preference satisfaction” as a tractable proxy for LM welfare and establish the first empirically grounded, multimodal (verbal–behavioral–contextual) framework for AI well-being assessment, enabling future standardized, quantifiable research.

Technology Category

Application Category

📝 Abstract
We develop new experimental paradigms for measuring welfare in language models. We compare verbal reports of models about their preferences with preferences expressed through behavior when navigating a virtual environment and selecting conversation topics. We also test how costs and rewards affect behavior and whether responses to an eudaimonic welfare scale - measuring states such as autonomy and purpose in life - are consistent across semantically equivalent prompts. Overall, we observed a notable degree of mutual support between our measures. The reliable correlations observed between stated preferences and behavior across conditions suggest that preference satisfaction can, in principle, serve as an empirically measurable welfare proxy in some of today's AI systems. Furthermore, our design offered an illuminating setting for qualitative observation of model behavior. Yet, the consistency between measures was more pronounced in some models and conditions than others and responses were not consistent across perturbations. Due to this, and the background uncertainty about the nature of welfare and the cognitive states (and welfare subjecthood) of language models, we are currently uncertain whether our methods successfully measure the welfare state of language models. Nevertheless, these findings highlight the feasibility of welfare measurement in language models, inviting further exploration.
Problem

Research questions and friction points this paper is trying to address.

Measuring welfare in language models through verbal and behavioral tests
Testing consistency between stated preferences and actual behavior patterns
Evaluating how costs and rewards affect AI model decision-making
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrating verbal reports with behavioral tests
Testing eudaimonic welfare scale consistency
Measuring preference-behavior correlation as welfare proxy
🔎 Similar Papers
No similar papers found.