🤖 AI Summary
This work investigates whether bilingual visual-grounded speech (VGS) models exhibit weaker mutual exclusivity (ME) bias than monolingual models—particularly under cross-lingual ambiguity. To this end, we construct multilingual VGS models covering English, French, and Dutch, and employ embedding-space variance analysis alongside cross-lingual ablation experiments. We find, for the first time, that bilingual VGS models significantly attenuate ME bias. The key mechanism is that joint multilingual training compresses visual embedding variance, thereby reducing sensitivity to confusion between novel and familiar concepts. Our study not only provides empirical evidence of systematic ME bias reduction in bilingual VGS but also introduces, for the first time, a computational generative explanation grounded in representational geometry. This advances understanding of the cognitive mechanisms and modeling principles underlying multilingual vision–speech alignment, establishing a novel paradigm for analyzing language–vision–speech interactions.
📝 Abstract
Mutual exclusivity (ME) is a strategy where a novel word is associated with a novel object rather than a familiar one, facilitating language learning in children. Recent work has found an ME bias in a visually grounded speech (VGS) model trained on English speech with paired images. But ME has also been studied in bilingual children, who may employ it less due to cross-lingual ambiguity. We explore this pattern computationally using bilingual VGS models trained on combinations of English, French, and Dutch. We find that bilingual models generally exhibit a weaker ME bias than monolingual models, though exceptions exist. Analyses show that the combined visual embeddings of bilingual models have a smaller variance for familiar data, partly explaining the increase in confusion between novel and familiar concepts. We also provide new insights into why the ME bias exists in VGS models in the first place. Code and data: https://github.com/danoneata/me-vgs