🤖 AI Summary
Event cameras record only brightness changes, lacking absolute intensity information—leading to a highly ill-posed problem in color video reconstruction. To address this, we propose an unpaired, single-step event diffusion model. Our method introduces a temporally consistent EvEncoder to encode sparse, asynchronous event streams and establishes a proxy training framework that leverages large-scale natural image priors to guide the diffusion process. Crucially, we deeply integrate event-driven modeling with diffusion-based generation, enabling high-fidelity, photorealistic color video reconstruction in a single sampling step. Experiments on real-world datasets demonstrate that our approach significantly outperforms state-of-the-art methods across both pixel-level (PSNR) and perceptual (LPIPS) metrics, as well as in user studies. To our knowledge, this is the first method to achieve efficient and robust reconstruction of high-quality color video directly from monochrome event streams.
📝 Abstract
As neuromorphic sensors, event cameras asynchronously record changes in brightness as streams of sparse events with the advantages of high temporal resolution and high dynamic range. Reconstructing intensity images from events is a highly ill-posed task due to the inherent ambiguity of absolute brightness. Early methods generally follow an end-to-end regression paradigm, directly mapping events to intensity frames in a deterministic manner. While effective to some extent, these approaches often yield perceptually inferior results and struggle to scale up in model capacity and training data. In this work, we propose EvDiff, an event-based diffusion model that follows a surrogate training framework to produce high-quality videos. To reduce the heavy computational cost of high-frame-rate video generation, we design an event-based diffusion model that performs only a single forward diffusion step, equipped with a temporally consistent EvEncoder. Furthermore, our novel Surrogate Training Framework eliminates the dependence on paired event-image datasets, allowing the model to leverage large-scale image datasets for higher capacity. The proposed EvDiff is capable of generating high-quality colorful videos solely from monochromatic event streams. Experiments on real-world datasets demonstrate that our method strikes a sweet spot between fidelity and realism, outperforming existing approaches on both pixel-level and perceptual metrics.