Self-supervised structured object representation learning

πŸ“… 2025-08-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Self-supervised learning (SSL) excels at global image understanding but struggles to model multi-scale, structured object representations in complex scenes. To address this, we propose ProtoScaleβ€”a novel module that enables fine-grained instance separation and hierarchical structural modeling without relying on global cropping. ProtoScale integrates semantic-clustering-guided multi-scale feature aggregation with context-preserving data augmentation. It jointly optimizes semantic grouping, instance discrimination, and cross-scale contextual modeling, thereby significantly enhancing structured representation learning. Evaluated on a joint COCO and UA-DETRAC benchmark, our approach achieves state-of-the-art downstream object detection performance using only minimal annotations and few fine-tuning epochs. This demonstrates the effectiveness and generalizability of structured SSL representations for dense prediction tasks.

Technology Category

Application Category

πŸ“ Abstract
Self-supervised learning (SSL) has emerged as a powerful technique for learning visual representations. While recent SSL approaches achieve strong results in global image understanding, they are limited in capturing the structured representation in scenes. In this work, we propose a self-supervised approach that progressively builds structured visual representations by combining semantic grouping, instance level separation, and hierarchical structuring. Our approach, based on a novel ProtoScale module, captures visual elements across multiple spatial scales. Unlike common strategies like DINO that rely on random cropping and global embeddings, we preserve full scene context across augmented views to improve performance in dense prediction tasks. We validate our method on downstream object detection tasks using a combined subset of multiple datasets (COCO and UA-DETRAC). Experimental results show that our method learns object centric representations that enhance supervised object detection and outperform the state-of-the-art methods, even when trained with limited annotated data and fewer fine-tuning epochs.
Problem

Research questions and friction points this paper is trying to address.

Self-supervised learning captures structured object representations
Preserves full scene context across augmented views
Enhances object detection with limited annotated data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic grouping and instance separation
ProtoScale module for multi-scale elements
Full scene context preservation across views
πŸ”Ž Similar Papers
No similar papers found.
O
Oussama Hadjerci
DASIA, Courbevoie, 92400, France.
A
Antoine Letienne
DASIA, Courbevoie, 92400, France.
Mohamed Abbas Hedjazi
Mohamed Abbas Hedjazi
phagos.org
Deep learning
Adel Hafiane
Adel Hafiane
INSA Centre Val de Loire
Image processingcomputer visionmachine learning