Continual Visual and Verbal Learning Through a Child's Egocentric Input

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the challenge of achieving effective vision–language joint learning from a single pass of egocentric, temporally continuous visual input that closely mirrors children’s real-world experiences. The authors propose BabyCL, a streaming learning framework that jointly optimizes image–text contrastive loss while traversing the SAYCam dataset only once. Its key innovation lies in a multi-stage temporal segmentation strategy coupled with dual replay buffers—one dedicated to visual history and the other to multimodal history—enabling coordinated training on a shared backbone network. Under identical optimization budgets, BabyCL substantially outperforms existing streaming baselines on the SAYCam Labeled-S 4AFC benchmark and significantly narrows the performance gap with offline training upper bounds.

📝 Abstract

Children learn the meanings of words from a continuous, temporally structured stream of egocentric experience. Recent work shows that neural networks can also learn word-referent mappings from a child's egocentric video recordings, but they cycle through the shuffled data for hundreds of epochs, contrasting with how children actually encounter their environment. We introduce BabyCL, a continual multimodal learning framework that processes the SAYCam dataset in a single chronological pass, combining streaming visual representation learning with an image-text contrastive objective. BabyCL combines a multi-stage temporal segmentation of the stream with a dual replay buffer that independently manages visual and multimodal histories, and it is jointly trained with three contrastive losses on a shared backbone. Under a matched optimization budget, BabyCL outperforms streaming learning baselines on the SAYCam Labeled-S 4AFC benchmark, substantially narrowing the gap to an upper bound of offline training. Ablations show that the gains are robust to the length of the online temporal segmentation window and the eviction rule of the replay buffer. Together, these results show that meaningful word-referent mappings can emerge under training conditions much closer to a child's actual experience.

Problem

Research questions and friction points this paper is trying to address.

continual learning

egocentric vision

word-referent learning

multimodal learning

child development

Innovation

Methods, ideas, or system contributions that make the work stand out.

continual learning

multimodal learning

egocentric vision