🤖 AI Summary
Existing object-naming datasets suffer from opaque annotations and structural heterogeneity, severely hindering cross-linguistic comparative research. To address this, we introduce the first standardized multilingual object-naming dataset covering 30 languages across 10 language families. Leveraging computer-assisted concept alignment, we uniformly map image–word pairs from 17 existing datasets onto a shared semantic concept space. We propose a framework for enhancing transparency and comparability in multilingual naming data, integrating historical linguistics’ core vocabulary lists with cognitive naming data for the first time. Our fully automated alignment pipeline employs semantic ontology mapping, multilingual word sense disambiguation, and consistency verification. It identifies high-frequency cross-linguistically stable concepts (e.g., “sleep”) and demonstrates strong correlation between naming-space coverage and classical basic vocabulary lists (r = 0.82), validating its linguistic and cognitive grounding.
📝 Abstract
Object naming - the act of identifying an object with a word or a phrase - is a fundamental skill in interpersonal communication, relevant to many disciplines, such as psycholinguistics, cognitive linguistics, or language and vision research. Object naming datasets, which consist of concept lists with picture pairings, are used to gain insights into how humans access and select names for objects in their surroundings and to study the cognitive processes involved in converting visual stimuli into semantic concepts. Unfortunately, object naming datasets often lack transparency and have a highly idiosyncratic structure. Our study tries to make current object naming data transparent and comparable by using a multilingual, computer-assisted approach that links individual items of object naming lists to unified concepts. Our current sample links 17 object naming datasets that cover 30 languages from 10 different language families. We illustrate how the comparative dataset can be explored by searching for concepts that recur across the majority of datasets and comparing the conceptual spaces of covered object naming datasets with classical basic vocabulary lists from historical linguistics and linguistic typology. Our findings can serve as a basis for enhancing cross-linguistic object naming research and as a guideline for future studies dealing with object naming tasks.