Objects Before Words: Object-First Inductive Biases for Grounding Language in Child-View Video

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of learning embodied word meanings from sparse, weakly synchronized infant-view video and audio, where object–speech alignment suffers from spatiotemporal ambiguity. The authors propose BabyMind, a framework that incorporates an object-centric inductive bias by extracting candidate objects via offline masked region proposals, tracking them within short speech windows to construct lightweight object files, and aligning speech segments with object bundles through multi-instance contrastive learning in a prototype space. The method further injects structured object information into frame representations by regularizing with trajectory consistency and global object consistency constraints. Evaluated on the SAYCam-S Labeled-S benchmark, BabyMind achieves a 2.6 percentage point improvement over CVCL across 15 forced-choice tasks and demonstrates consistent gains under in-vocabulary out-of-distribution settings.

📝 Abstract

Learning grounded word meaning from natural experience requires resolving two ambiguities in infant-view recordings: when the named referent appears and where it is in a cluttered frame. In SAYCam-style data, caregiver speech is sparse and weakly synchronized with egocentric video, so single-frame contrastive pairing yields noisy positives in which the intended object is absent or entangled with distractors. We propose BabyMind, an object-first bias for child-view contrastive learning under sparse, noisy supervision. BabyMind extracts candidate object embeddings using an offline mask-based region interface, links candidates across a short utterance-centered window into lightweight object files via tracking, and aligns utterances to bags of object files with a prototype-space multiple-instance contrastive objective. Track-coherence and global-object agreement regularizers stabilize learning and transfer object-file structure into the global frame embedding used at evaluation. On SAYCam-S, BabyMind improves Labeled-S 15 forced-choice accuracy by +2.6 points over CVCL and yields consistent gains on in-vocabulary out-of-distribution benchmarks. Code is available at https://github.com/sathiiii/BabyMind.

Problem

Research questions and friction points this paper is trying to address.

grounded language learning

infant-view video

word-referent ambiguity

sparse supervision

object grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

object-first bias

contrastive learning

object tracking