Multi-modal On-Device Learning for Monocular Depth Estimation on Ultra-low-power MCUs

📅 2025-11-26

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Monocular depth estimation (MDE) on ultra-low-power microcontroller units (MCUs) suffers severe accuracy degradation due to domain shift, hindering practical deployment on IoT edge devices. Method: We propose a field-adaptive online learning framework tailored for resource-constrained edge platforms. It fuses multi-modal sensor data and introduces a memory-driven sparse parameter update mechanism (requiring only 1.2 MB RAM), enabling cloud-free on-chip fine-tuning via pseudo-labeled depth supervision. A lightweight μPyD-Net architecture is deployed on the GAP9 RISC-V AI accelerator, supporting end-to-end backpropagation-based fine-tuning. Contribution/Results: The system autonomously labels 3,000 samples in 17.8 minutes, reducing RMSE from 4.9 m to 0.6 m while consuming ~300 mW. To our knowledge, this is the first demonstration of feasible online adaptive MDE on MCUs under extreme resource constraints (sub-2 MB RAM, no external memory/cloud), establishing a new paradigm for intelligent edge perception.

Technology Category

Application Category

📝 Abstract

Monocular depth estimation (MDE) plays a crucial role in enabling spatially-aware applications in Ultra-low-power (ULP) Internet-of-Things (IoT) platforms. However, the limited number of parameters of Deep Neural Networks for the MDE task, designed for IoT nodes, results in severe accuracy drops when the sensor data observed in the field shifts significantly from the training dataset. To address this domain shift problem, we present a multi-modal On-Device Learning (ODL) technique, deployed on an IoT device integrating a Greenwaves GAP9 MicroController Unit (MCU), a 80 mW monocular camera and a 8 x 8 pixel depth sensor, consuming $approx$300mW. In its normal operation, this setup feeds a tiny 107 k-parameter $μ$PyD-Net model with monocular images for inference. The depth sensor, usually deactivated to minimize energy consumption, is only activated alongside the camera to collect pseudo-labels when the system is placed in a new environment. Then, the fine-tuning task is performed entirely on the MCU, using the new data. To optimize our backpropagation-based on-device training, we introduce a novel memory-driven sparse update scheme, which minimizes the fine-tuning memory to 1.2 MB, 2.2x less than a full update, while preserving accuracy (i.e., only 2% and 1.5% drops on the KITTI and NYUv2 datasets). Our in-field tests demonstrate, for the first time, that ODL for MDE can be performed in 17.8 minutes on the IoT node, reducing the root mean squared error from 4.9 to 0.6m with only 3 k self-labeled samples, collected in a real-life deployment scenario.

Problem

Research questions and friction points this paper is trying to address.

Addresses domain shift in monocular depth estimation on ultra-low-power IoT devices.

Enables on-device learning with multi-modal sensors to adapt to new environments.

Optimizes memory usage for fine-tuning on resource-constrained microcontrollers.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal on-device learning for depth estimation

Memory-driven sparse update scheme for fine-tuning

Ultra-low-power MCU deployment with pseudo-label collection

🔎 Similar Papers

No similar papers found.

Apple

Seattle, United States of America

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)