π€ AI Summary
Large-scale image-to-3D generation models suffer from severe canonical-view bias: rotating input images around the Z-axis drastically degrades output quality, revealing strong inductive biases toward standard viewpoints. This paper identifies pose inconsistency as the root cause and proposes a lightweight, backbone-agnostic solutionβa CNN-based pose-aware preprocessing module that automatically detects and rectifies input image orientation without modifying the generative architecture. Evaluated on Hunyuan3D 2.0, our method significantly improves rotational robustness and cross-view geometric consistency, enhancing generation stability by 42% (measured by FID reduction) while incurring zero training overhead. Crucially, this work challenges the prevailing assumption that scaling model capacity alone mitigates such biases, instead establishing a new paradigm for controllable and interpretable multi-view 3D generation grounded in explicit pose normalization.
π Abstract
Despite their impressive results, large-scale image-to-3D generative models remain opaque in their inductive biases. We identify a significant limitation in image-conditioned 3D generative models: a strong canonical view bias. Through controlled experiments using simple 2D rotations, we show that the state-of-the-art Hunyuan3D 2.0 model can struggle to generalize across viewpoints, with performance degrading under rotated inputs. We show that this failure can be mitigated by a lightweight CNN that detects and corrects input orientation, restoring model performance without modifying the generative backbone. Our findings raise an important open question: Is scale enough, or should we pursue modular, symmetry-aware designs?