Objects Before Words: Object-First Inductive Biases for Grounding Language in Child-View Video

📅 2026-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of learning embodied word meanings from sparse, weakly synchronized infant-view video and audio, where object–speech alignment suffers from spatiotemporal ambiguity. The authors propose BabyMind, a framework that incorporates an object-centric inductive bias by extracting candidate objects via offline masked region proposals, tracking them within short speech windows to construct lightweight object files, and aligning speech segments with object bundles through multi-instance contrastive learning in a prototype space. The method further injects structured object information into frame representations by regularizing with trajectory consistency and global object consistency constraints. Evaluated on the SAYCam-S Labeled-S benchmark, BabyMind achieves a 2.6 percentage point improvement over CVCL across 15 forced-choice tasks and demonstrates consistent gains under in-vocabulary out-of-distribution settings.
📝 Abstract
Learning grounded word meaning from natural experience requires resolving two ambiguities in infant-view recordings: when the named referent appears and where it is in a cluttered frame. In SAYCam-style data, caregiver speech is sparse and weakly synchronized with egocentric video, so single-frame contrastive pairing yields noisy positives in which the intended object is absent or entangled with distractors. We propose BabyMind, an object-first bias for child-view contrastive learning under sparse, noisy supervision. BabyMind extracts candidate object embeddings using an offline mask-based region interface, links candidates across a short utterance-centered window into lightweight object files via tracking, and aligns utterances to bags of object files with a prototype-space multiple-instance contrastive objective. Track-coherence and global-object agreement regularizers stabilize learning and transfer object-file structure into the global frame embedding used at evaluation. On SAYCam-S, BabyMind improves Labeled-S 15 forced-choice accuracy by +2.6 points over CVCL and yields consistent gains on in-vocabulary out-of-distribution benchmarks. Code is available at https://github.com/sathiiii/BabyMind.
Problem

Research questions and friction points this paper is trying to address.

grounded language learning
infant-view video
word-referent ambiguity
sparse supervision
object grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

object-first bias
contrastive learning
object tracking
multi-instance learning
language grounding
S
Sathira Silva
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
A
Abrham Kahsay Gebreselasie
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
M
Muhammad Umer Sheikh
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
K
Kartik Kuckreja
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
Daniel Harari
Daniel Harari
Research Associate, Weizmann Institute of Science
Computer visionDeep and Machine learningArtificial intelligenceScene understandingHuman
M
Muhammad Haris Khan
Weizmann Institute of Science, Rehovot, Israel