V2V: Scaling Event-Based Vision through Efficient Video-to-Voxel Simulation

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Event vision suffers from a dual bottleneck: scarcity of real-world event data and high storage/I/O overhead of synthetic event data, hindering scalable model training and generalization. To address this, we propose the Video-to-Voxel (V2V) paradigm—a direct, end-to-end voxelization framework that bypasses conventional event stream generation entirely. Leveraging timestamp-aligned video frames, V2V constructs a parameterized dynamic voxel grid, enabling zero-event-stream voxelization for the first time. This yields 150× storage compression, supports real-time motion modeling and stochastic augmentation, and facilitates the largest event vision training set to date (52 hours). Using a lightweight encoder-decoder architecture, our method achieves state-of-the-art performance on video reconstruction and optical flow estimation. With training data exceeding existing benchmarks by an order of magnitude, we empirically demonstrate—uniquely—that large-scale pretraining substantially enhances generalization in event vision.

Technology Category

Application Category

📝 Abstract
Event-based cameras offer unique advantages such as high temporal resolution, high dynamic range, and low power consumption. However, the massive storage requirements and I/O burdens of existing synthetic data generation pipelines and the scarcity of real data prevent event-based training datasets from scaling up, limiting the development and generalization capabilities of event vision models. To address this challenge, we introduce Video-to-Voxel (V2V), an approach that directly converts conventional video frames into event-based voxel grid representations, bypassing the storage-intensive event stream generation entirely. V2V enables a 150 times reduction in storage requirements while supporting on-the-fly parameter randomization for enhanced model robustness. Leveraging this efficiency, we train several video reconstruction and optical flow estimation model architectures on 10,000 diverse videos totaling 52 hours--an order of magnitude larger than existing event datasets, yielding substantial improvements.
Problem

Research questions and friction points this paper is trying to address.

Reducing storage needs for event-based vision training data
Overcoming scarcity of real event-based camera datasets
Enabling large-scale training for event vision models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Converts video frames to voxel grids
Reduces storage needs by 150 times
Supports on-the-fly parameter randomization
🔎 Similar Papers
No similar papers found.
H
Hanyue Lou
State Key Lab of Multimedia Info. Processing, School of Computer Science, Peking University; National Eng. Research Ctr. of Visual Technology, School of Computer Science, Peking University
Jinxiu Liang
Jinxiu Liang
National Institute of Informatics
Computer VisionComputational PhotographyMachine Learning
M
Minggui Teng
State Key Lab of Multimedia Info. Processing, School of Computer Science, Peking University; National Eng. Research Ctr. of Visual Technology, School of Computer Science, Peking University
Y
Yi Wang
Shanghai Innovation Institute; Shanghai AI Laboratory
Boxin Shi
Boxin Shi
Peking University
Computer VisionComputational Photography