🤖 AI Summary
This study addresses semantic inconsistency in theoretical constructs (e.g., housing, employment) and diachronic lexical–structural evolution across longitudinal social science surveys. We propose the first task of cross-item-and-response semantic equivalence identification anchored in theoretical concepts. Innovatively, we systematically adapt the information retrieval (IR) paradigm to survey harmonization, designing a multi-stage re-ranking framework integrating BM25, LDA, linear probes over language models, and IR-specialized neural models (ColBERT, ANCE). Experiments show IR-specialized models achieve the highest F1 scores, yet re-ranking yields only marginal gains (≤0.07). Expert evaluation reveals systematic biases at the sub-concept level. Our work establishes a novel task definition, methodological benchmark, and empirical foundation for interpretable, theory-driven semantic alignment of longitudinal survey instruments.
📝 Abstract
Automated detection of semantically equivalent questions in longitudinal social science surveys is crucial for long-term studies informing empirical research in the social, economic, and health sciences. Retrieving equivalent questions faces dual challenges: inconsistent representation of theoretical constructs (i.e. concept/sub-concept) across studies as well as between question and response options, and the evolution of vocabulary and structure in longitudinal text. To address these challenges, our multi-disciplinary collaboration of computer scientists and survey specialists presents a new information retrieval (IR) task of identifying concept (e.g. Housing, Job, etc.) equivalence across question and response options to harmonise longitudinal population studies. This paper investigates multiple unsupervised approaches on a survey dataset spanning 1946-2020, including probabilistic models, linear probing of language models, and pre-trained neural networks specialised for IR. We show that IR-specialised neural models achieve the highest overall performance with other approaches performing comparably. Additionally, the re-ranking of the probabilistic model's results with neural models only introduces modest improvements of 0.07 at most in F1-score. Qualitative post-hoc evaluation by survey specialists shows that models generally have a low sensitivity to questions with high lexical overlap, particularly in cases where sub-concepts are mismatched. Altogether, our analysis serves to further research on harmonising longitudinal studies in social science.