VISTA: A Generative Egocentric Video Framework for Daily Assistance

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

228K/year
🤖 AI Summary
This work addresses the scarcity of high-fidelity, scalable data for everyday assistive scenarios, which hinders the training and evaluation of proactive AI assistants—due to the difficulty of collecting real-world data and the limited visual realism of physics-based simulators. To bridge this gap, the authors propose a generative framework that synthesizes logically coherent and diverse first-person intervention videos through a five-step script generation pipeline driven by causal inverse reasoning. For the first time, this approach systematically encompasses both reactive and proactive (including implicit) assistance modalities. By integrating high-fidelity video synthesis with parameterized scene control, the framework enables the construction of customizable task benchmarks. The generated data effectively compensates for the lack of real-world recordings, providing a high-quality foundation for training and evaluating AI assistants’ ability to intervene preemptively before users explicitly express their needs.
📝 Abstract
Training AI agents to proactively assist humans in daily activities, from routine household tasks to urgent safety situations, requires large-scale visual data. However, capturing such scenarios in the real world is often difficult, costly, or unsafe, and physics-based simulators lack the visual fidelity needed to transfer learned behaviors to real settings. Therefore, we introduce VISTA, a video synthesis system that produces high-fidelity egocentric videos as training and evaluation data for AI agents. VISTA employs a 5-step script generation pipeline with causal reverse reasoning to create diverse, logically grounded intervention modes. These scenarios span two levels of agent autonomy: reactive and proactive. In reactive modes, the user explicitly asks the agent for help. In proactive modes, the agent offers help without receiving a direct request. We further divide proactive modes into explicit and implicit types. In explicit proactive scenarios, the user is aware of needing help but does not directly address the agent. In implicit proactive scenarios, the agent intervenes before the user even realizes that help is needed. VISTA allows users to customize and refine scenarios to generate video benchmarks for daily tasks, offering a scalable and controllable alternative to real-world data collection for training and evaluating AI agents in realistic environments.
Problem

Research questions and friction points this paper is trying to address.

egocentric video
AI agent assistance
video synthesis
proactive intervention
visual fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

egocentric video synthesis
proactive assistance
causal reverse reasoning
autonomous agent intervention
synthetic training data
🔎 Similar Papers
No similar papers found.