π€ AI Summary
To address the unclear co-design of convolution and attention mechanisms in 3D point cloud modeling, this paper proposes a stage-wise hybrid architecture: lightweight depthwise separable convolutions extract local geometric features in early high-resolution layers, while lightweight attention modules model long-range semantic context in deeper low-resolution layers. We first uncover their complementary roles in point cloud processing. Furthermore, we introduce PointROPEβa training-free, structure-aware 3D positional encoding that explicitly preserves spatial relationships. Experiments demonstrate that our method reduces parameter count by 3.6Γ, doubles inference speed, and cuts GPU memory consumption by 50%, while matching or surpassing Point Transformer V3 on mainstream benchmarks.
π Abstract
Modern neural architectures for 3D point cloud processing contain both convolutional layers and attention blocks, but the best way to assemble them remains unclear. We analyse the role of different computational blocks in 3D point cloud networks and find an intuitive behaviour: convolution is adequate to extract low-level geometry at high-resolution in early layers, where attention is expensive without bringing any benefits; attention captures high-level semantics and context in low-resolution, deep layers more efficiently. Guided by this design principle, we propose a new, improved 3D point cloud backbone that employs convolutions in early stages and switches to attention for deeper layers. To avoid the loss of spatial layout information when discarding redundant convolution layers, we introduce a novel, training-free 3D positional encoding, PointROPE. The resulting LitePT model has $3.6 imes$ fewer parameters, runs $2 imes$ faster, and uses $2 imes$ less memory than the state-of-the-art Point Transformer V3, but nonetheless matches or even outperforms it on a range of tasks and datasets. Code and models are available at: https://github.com/prs-eth/LitePT.