🤖 AI Summary
This study investigates whether foundation models can end-to-end learn implicit, individual life traits from long-duration first-person video. We collected 54 hours of real-world egocentric footage via wearable cameras and constructed hierarchical summaries—spanning minute-, hour-, and day-level granularities—to supervise fine-tuning of GPT-4o and GPT-4o-mini for personalized modeling. To our knowledge, this is the first work integrating long-term first-person video with hierarchical summarization for supervised fine-tuning to infer latent attributes—including geographic location, occupation, handedness, and pet ownership. Experiments show that GPT-4o accurately inferred the author’s city, Carnegie Mellon University PhD student status, right-handedness, and cat ownership; however, both models exhibited name hallucination, exposing reliability bottlenecks in fine-grained social information modeling. Our work delineates the capability boundaries and hallucination mechanisms of foundation models in personalized learning from egocentric visual data.
📝 Abstract
Motivated by recent improvements in generative AI and wearable camera devices (e.g. smart glasses and AI-enabled pins), I investigate the ability of foundation models to learn about the wearer's personal life through first-person camera data. To test this, I wore a camera headset for 54 hours over the course of a week, generated summaries of various lengths (e.g. minute-long, hour-long, and day-long summaries), and fine-tuned both GPT-4o and GPT-4o-mini on the resulting summary hierarchy. By querying the fine-tuned models, we are able to learn what the models learned about me. The results are mixed: Both models learned basic information about me (e.g. approximate age, gender). Moreover, GPT-4o correctly deduced that I live in Pittsburgh, am a PhD student at CMU, am right-handed, and have a pet cat. However, both models also suffered from hallucination and would make up names for the individuals present in the video footage of my life.