V2V: Scaling Event-Based Vision through Efficient Video-to-Voxel Simulation

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Event vision suffers from a dual bottleneck: scarcity of real-world event data and high storage/I/O overhead of synthetic event data, hindering scalable model training and generalization. To address this, we propose the Video-to-Voxel (V2V) paradigm—a direct, end-to-end voxelization framework that bypasses conventional event stream generation entirely. Leveraging timestamp-aligned video frames, V2V constructs a parameterized dynamic voxel grid, enabling zero-event-stream voxelization for the first time. This yields 150× storage compression, supports real-time motion modeling and stochastic augmentation, and facilitates the largest event vision training set to date (52 hours). Using a lightweight encoder-decoder architecture, our method achieves state-of-the-art performance on video reconstruction and optical flow estimation. With training data exceeding existing benchmarks by an order of magnitude, we empirically demonstrate—uniquely—that large-scale pretraining substantially enhances generalization in event vision.

Technology Category

Application Category

📝 Abstract

Event-based cameras offer unique advantages such as high temporal resolution, high dynamic range, and low power consumption. However, the massive storage requirements and I/O burdens of existing synthetic data generation pipelines and the scarcity of real data prevent event-based training datasets from scaling up, limiting the development and generalization capabilities of event vision models. To address this challenge, we introduce Video-to-Voxel (V2V), an approach that directly converts conventional video frames into event-based voxel grid representations, bypassing the storage-intensive event stream generation entirely. V2V enables a 150 times reduction in storage requirements while supporting on-the-fly parameter randomization for enhanced model robustness. Leveraging this efficiency, we train several video reconstruction and optical flow estimation model architectures on 10,000 diverse videos totaling 52 hours--an order of magnitude larger than existing event datasets, yielding substantial improvements.

Problem

Research questions and friction points this paper is trying to address.

Reducing storage needs for event-based vision training data

Overcoming scarcity of real event-based camera datasets

Enabling large-scale training for event vision models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Converts video frames to voxel grids

Reduces storage needs by 150 times

Supports on-the-fly parameter randomization

🔎 Similar Papers

No similar papers found.

Authors to Follow