🤖 AI Summary
This study systematically evaluates the validity of large language models (LLMs) in generating responses to self-regulated learning (SRL) psychological scales—specifically the Motivated Strategies for Learning Questionnaire (MSLQ)—to assess their utility for theoretical validation and intervention design. Method: Responses were generated by five LLMs (GPT-4o, Claude 3.7 Sonnet, Gemini 2 Flash, LLaMA 3.1-8B, Mistral Large) and evaluated using an integrated validation framework combining confirmatory and exploratory factor analysis (CFA/EFA) with psychometric network analysis to jointly assess structural validity and theoretical alignment. Contribution/Results: Gemini 2 Flash demonstrated superior performance, successfully replicating MSLQ’s established factor structure and inter-dimensional relationships; however, it exhibited sampling instability. Overall, LLMs show promising capacity to simulate SRL-related psychological data, yet critical limitations in response consistency, dimensionality fidelity, and construct validity boundaries are revealed—highlighting both potential and current constraints for leveraging LLMs in SRL measurement and theory testing.
📝 Abstract
Large language models (LLMs) offer the potential to simulate human-like responses and behaviors, creating new opportunities for psychological science. In the context of self-regulated learning (SRL), if LLMs can reliably simulate survey responses at scale and speed, they could be used to test intervention scenarios, refine theoretical models, augment sparse datasets, and represent hard-to-reach populations. However, the validity of LLM-generated survey responses remains uncertain, with limited research focused on SRL and existing studies beyond SRL yielding mixed results. Therefore, in this study, we examined LLM-generated responses to the 44-item Motivated Strategies for Learning Questionnaire (MSLQ; Pintrich &De Groot, 1990), a widely used instrument assessing students' learning strategies and academic motivation. Particularly, we used the LLMs GPT-4o, Claude 3.7 Sonnet, Gemini 2 Flash, LLaMA 3.1-8B, and Mistral Large. We analyzed item distributions, the psychological network of the theoretical SRL dimensions, and psychometric validity based on the latent factor structure. Our results suggest that Gemini 2 Flash was the most promising LLM, showing considerable sampling variability and producing underlying dimensions and theoretical relationships that align with prior theory and empirical findings. At the same time, we observed discrepancies and limitations, underscoring both the potential and current constraints of using LLMs for simulating psychological survey data and applying it in educational contexts.