Would you still call this Dax? Novel Visual References in VLMs and Humans

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This study investigates how vision-language models map novel visual concepts—conflicting with their pretraining knowledge—to linguistic labels, and compares this behavior to human generalization. To enable direct comparison between models and humans in concept learning and label generalization, we introduce the NVRD dataset, comprising 90 synthetically generated novel visual concepts and their perturbed variants. Combining large-scale image perturbation generation, human psychophysical experiments, and evaluation across five multimodal models (three open-source and two closed-source), our analysis reveals that models are strongly biased by prior knowledge during in-context learning of new concepts and exhibit significantly greater overgeneralization than humans, erroneously extending labels to visual stimuli that human participants consistently reject.

📝 Abstract

Vision-language models (VLMs), like human learners, are frequently exposed to new visual concepts, but how they map novel visual references to language after exposure remains largely underexplored, particularly when those references contradict prior knowledge from pre-training. To study this, we present the Novel Visual References Dataset (NVRD): 19,176 images spanning 90 visual concepts across different levels of visual novelty, each with up to 20 increasingly perturbed versions of the original object to probe generalization. Unlike prior work on visual augmentations of familiar concepts, NVRD comprises entirely novel, open-ended stimuli constructed from scratch, mirroring how humans encounter genuinely new concepts. We evaluate 3 open- and 2 closed-source models alongside 2,400 human judgments for direct human-model comparison, and find that (i) models struggle to acquire novel concepts in-context when they contradict prior knowledge, and (ii) while models and humans show correlated sensitivity to visual perturbations, models significantly overgeneralize, extending learned labels to stimuli that humans reject. We contribute NVRD as a corpus and benchmark for research on visual concept learning in both humans and machines.

Problem

Research questions and friction points this paper is trying to address.

novel visual references

vision-language models

concept learning

visual generalization

prior knowledge conflict

Innovation

Methods, ideas, or system contributions that make the work stand out.

novel visual references

vision-language models

concept learning