M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the efficiency–accuracy trade-off in monocular visual spatial perception on edge devices, this paper proposes M2H, a lightweight multi-task learning framework that jointly performs semantic segmentation, depth estimation, edge detection, and surface normal prediction. Methodologically, M2H introduces a windowed cross-task attention module enabling structured feature interaction—preserving task-specific representations while enhancing inter-task prediction consistency. It employs a lightweight DINOv2-based ViT backbone integrated with windowed self-attention and a multi-task co-training strategy, substantially reducing computational overhead. Evaluated on NYUDv2, Hypersim, and Cityscapes, M2H outperforms state-of-the-art single- and multi-task methods across all tasks. It achieves real-time inference (>30 FPS) on commodity laptop GPUs and has been successfully deployed for dynamic 3D scene graph generation in real-world scenarios, demonstrating both practical efficiency and robust generalization.

Technology Category

Application Category

📝 Abstract
Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing computational overhead. This paper introduces Multi-Mono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-decoder architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the foundation for monocular spatial perception systems supporting 3D scene graph construction in dynamic environments. Comprehensive evaluations show that M2H outperforms state-of-the-art multi-task models on NYUDv2, surpasses single-task depth and semantic baselines on Hypersim, and achieves superior performance on the Cityscapes dataset, all while maintaining computational efficiency on laptop hardware. Beyond benchmarks, M2H is validated on real-world data, demonstrating its practicality in spatial perception tasks.
Problem

Research questions and friction points this paper is trying to address.

Develops efficient multi-task learning for monocular spatial perception
Enables cross-task feature exchange while preserving task-specific details
Optimizes real-time deployment on edge devices with computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Window-based cross-task attention for feature exchange
Lightweight ViT backbone optimized for real-time deployment
Multi-task framework combining segmentation depth edge estimation
🔎 Similar Papers
No similar papers found.