🤖 AI Summary
This study addresses the challenge of applying unstructured text data directly to psychometric analysis. We propose a novel paradigm wherein documents are treated as respondents and words as test items; context-aware word embeddings are extracted using encoder-only Transformer models to construct contextualized response data. Subsequently, multivariate factor analytic techniques—including exploratory factor analysis and bifactor modeling—are employed to uncover latent knowledge dimensions and structural patterns. Our key contribution lies in the first systematic integration of large language model–generated contextual embeddings into a psychometric framework, enabling naturalistic, interpretable measurement of semantic variation in text. Experiments on the Wiki STEM corpus successfully identify coherent, interpretable knowledge dimensions, demonstrating the method’s validity and generalizability for text analysis in education, psychology, and legal domains.
📝 Abstract
This research introduces a novel psychometric method for analyzing textual data using large language models. By leveraging contextual embeddings to create contextual scores, we transform textual data into response data suitable for psychometric analysis. Treating documents as individuals and words as items, this approach provides a natural psychometric interpretation under the assumption that certain keywords, whose contextual meanings vary significantly across documents, can effectively differentiate documents within a corpus. The modeling process comprises two stages: obtaining contextual scores and performing psychometric analysis. In the first stage, we utilize natural language processing techniques and encoder based transformer models to identify common keywords and generate contextual scores. In the second stage, we employ various types of factor analysis, including exploratory and bifactor models, to extract and define latent factors, determine factor correlations, and identify the most significant words associated with each factor. Applied to the Wiki STEM corpus, our experimental results demonstrate the method's potential to uncover latent knowledge dimensions and patterns within textual data. This approach not only enhances the psychometric analysis of textual data but also holds promise for applications in fields rich in textual information, such as education, psychology, and law.