SVL: Spike-based Vision-language Pretraining for Efficient 3D Open-world Understanding

šŸ“… 2025-05-23
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
Current spiking neural networks (SNNs) suffer from poor generalization, modality fragmentation, and heavy reliance on large foundation models in 3D open-world understanding, hindering multimodal question answering and zero-shot 3D classification. To address these limitations—particularly under energy-constrained settings—we propose the first SNN-based vision-language pretraining framework. Our method introduces two core innovations: (1) Multi-scale Tri-modal Alignment (MTA), enabling unsupervised contrastive learning across 3D point clouds, images, and text; and (2) Reparameterizable Vision–Language Integration (Rep-VLI), eliminating dependence on large pretrained text encoders. Experiments demonstrate that our approach achieves 85.4% Top-1 accuracy on zero-shot 3D classification—surpassing state-of-the-art artificial neural network (ANN) baselines—and yields average downstream performance gains exceeding 2%. Notably, it is the first SNN framework to support open-world 3D multimodal question answering while significantly reducing computational energy consumption.

Technology Category

Application Category

šŸ“ Abstract
Spiking Neural Networks (SNNs) provide an energy-efficient way to extract 3D spatio-temporal features. However, existing SNNs still exhibit a significant performance gap compared to Artificial Neural Networks (ANNs) due to inadequate pre-training strategies. These limitations manifest as restricted generalization ability, task specificity, and a lack of multimodal understanding, particularly in challenging tasks such as multimodal question answering and zero-shot 3D classification. To overcome these challenges, we propose a Spike-based Vision-Language (SVL) pretraining framework that empowers SNNs with open-world 3D understanding while maintaining spike-driven efficiency. SVL introduces two key components: (i) Multi-scale Triple Alignment (MTA) for label-free triplet-based contrastive learning across 3D, image, and text modalities, and (ii) Re-parameterizable Vision-Language Integration (Rep-VLI) to enable lightweight inference without relying on large text encoders. Extensive experiments show that SVL achieves a top-1 accuracy of 85.4% in zero-shot 3D classification, surpassing advanced ANN models, and consistently outperforms prior SNNs on downstream tasks, including 3D classification (+6.1%), DVS action recognition (+2.1%), 3D detection (+1.1%), and 3D segmentation (+2.1%) with remarkable efficiency. Moreover, SVL enables SNNs to perform open-world 3D question answering, sometimes outperforming ANNs. To the best of our knowledge, SVL represents the first scalable, generalizable, and hardware-friendly paradigm for 3D open-world understanding, effectively bridging the gap between SNNs and ANNs in complex open-world understanding tasks. Code is available https://github.com/bollossom/SVL.
Problem

Research questions and friction points this paper is trying to address.

Enhancing SNNs' 3D open-world understanding efficiency
Overcoming SNNs' performance gap versus ANNs
Enabling multimodal tasks in spike-based systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spike-based Vision-Language pretraining for 3D understanding
Multi-scale Triple Alignment for multimodal contrastive learning
Re-parameterizable Vision-Language Integration for lightweight inference
šŸ”Ž Similar Papers
No similar papers found.
Xuerui Qiu
Xuerui Qiu
Institue of Automation, Chinese Academy of Sciences
Representation Learning3D Computer VisionModel Compression
Peixi Wu
Peixi Wu
University of Science and Technology of China
MultiModalNeuromorphicObject Detection
Y
Yaozhi Wen
Institute of Automation, Chinese Academy of Sciences; School of Future Technology, University of Chinese Academy of Sciences
S
Shaowei Gu
Institute of Automation, Chinese Academy of Sciences; School of Future Technology, University of Chinese Academy of Sciences
Y
Yuqi Pan
Institute of Automation, Chinese Academy of Sciences; School of Future Technology, University of Chinese Academy of Sciences
Xinhao Luo
Xinhao Luo
Ph.D. student, Shanghai Jiao Tong University
High Performance ComputingML Compiler
X
XU Bo
Institute of Automation, Chinese Academy of Sciences; School of Future Technology, University of Chinese Academy of Sciences
Guoqi Li
Guoqi Li
Professor, Institue of Automation,Chinese Academy of Sciences,Previously Tsinghua University
Brain inspired computingSpiking neural networksBrain inspired large modelsNeuroAI