๐ค AI Summary
Wide-field cameras (e.g., fisheye, omnidirectional) exhibit geometric mismatch with planar CNNs assuming the pinhole model: 2D grids fail to capture spherical adjacency, and models are sensitive to global rotations. While spectral spherical CNNs achieve equivariance, their reliance on costly spherical harmonic transforms limits resolution and efficiency. This paper proposes the Unified Spherical Frontend (USF): it maps arbitrary wide-field images onto the unit sphere via ray directions and performs spherical resampling, convolution, and pooling directly in the spatial domain. We design a geodesic-distance-based spherical convolution kernel, enabling configurable rotation equivariance without spherical harmonics for the first time. USF supports lens-agnostic representation and decouples projection from sampling. Experiments demonstrate that USF maintains high-resolution efficiency across classification, detection, and segmentation tasks; suffers <1% performance degradation under random rotations; achieves zero-shot cross-lens generalization; and exhibits strong cross-dataset robustness.
๐ Abstract
Modern perception increasingly relies on fisheye, panoramic, and other wide field-of-view (FoV) cameras, yet most pipelines still apply planar CNNs designed for pinhole imagery on 2D grids, where image-space neighborhoods misrepresent physical adjacency and models are sensitive to global rotations. Frequency-domain spherical CNNs partially address this mismatch but require costly spherical harmonic transforms that constrain resolution and efficiency. We introduce the Unified Spherical Frontend (USF), a lens-agnostic framework that transforms images from any calibrated camera into a unit-sphere representation via ray-direction correspondences, and performs spherical resampling, convolution, and pooling directly in the spatial domain. USF is modular: projection, location sampling, interpolation, and resolution control are fully decoupled. Its distance-only spherical kernels offer configurable rotation-equivariance (mirroring translation-equivariance in planar CNNs) while avoiding harmonic transforms entirely. We compare standard planar backbones with their spherical counterparts across classification, detection, and segmentation tasks on synthetic (Spherical MNIST) and real-world datasets (PANDORA, Stanford 2D-3D-S), and stress-test robustness to extreme lens distortions, varying FoV, and arbitrary rotations. USF processes high-resolution spherical imagery efficiently and maintains less than 1% performance drop under random test-time rotations, even without rotational augmentation, and even enables zero-shot generalization from one lens type to unseen wide-FoV lenses with minimal performance degradation.