LM-MCVT: A Lightweight Multi-modal Multi-view Convolutional-Vision Transformer Approach for 3D Object Recognition

📅 2025-04-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of multi-view feature fusion and model redundancy in 3D object recognition within human-robot collaborative environments—characterized by complex scenes and highly variable object geometries—this paper proposes a lightweight, efficient recognition framework tailored for service robots. Our approach introduces three key innovations: (1) a globally entropy-driven multi-view embedding fusion (GEEF) mechanism that adaptively weights and integrates multi-view representations; (2) a hybrid convolutional–vision Transformer architecture jointly modeling local geometric details and global semantic structure; and (3) native support for dual-modal RGB-D and point cloud inputs, incorporating pre- and mid-level convolutional encoders alongside local and global Transformer modules. Evaluated on ModelNet40, our method achieves 95.6% accuracy under four-view settings, substantially outperforming state-of-the-art methods. On the real-world OmniObject3D dataset, it demonstrates consistently superior performance across five-fold cross-validation.

Technology Category

Application Category

📝 Abstract
In human-centered environments such as restaurants, homes, and warehouses, robots often face challenges in accurately recognizing 3D objects. These challenges stem from the complexity and variability of these environments, including diverse object shapes. In this paper, we propose a novel Lightweight Multi-modal Multi-view Convolutional-Vision Transformer network (LM-MCVT) to enhance 3D object recognition in robotic applications. Our approach leverages the Globally Entropy-based Embeddings Fusion (GEEF) method to integrate multi-views efficiently. The LM-MCVT architecture incorporates pre- and mid-level convolutional encoders and local and global transformers to enhance feature extraction and recognition accuracy. We evaluate our method on the synthetic ModelNet40 dataset and achieve a recognition accuracy of 95.6% using a four-view setup, surpassing existing state-of-the-art methods. To further validate its effectiveness, we conduct 5-fold cross-validation on the real-world OmniObject3D dataset using the same configuration. Results consistently show superior performance, demonstrating the method's robustness in 3D object recognition across synthetic and real-world 3D data.
Problem

Research questions and friction points this paper is trying to address.

Enhancing 3D object recognition in robotic applications
Addressing complexity in human-centered environments
Improving accuracy with multi-modal multi-view fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight multi-modal multi-view transformer network
Globally Entropy-based Embeddings Fusion method
Pre- and mid-level convolutional encoders with transformers
🔎 Similar Papers
No similar papers found.