Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing self-supervised video pretraining methods for endoscopic analysis are prone to motion bias, often overlooking the static structural semantics critical for clinical decision-making, and are further constrained by the scarcity of annotated data. To address these limitations, this work proposes a hierarchical representation learning framework inspired by human cognition: it first focuses on intra-frame lesion regions to learn static semantics and then models their inter-frame evolution to capture contextual dynamics. The approach explicitly decouples and jointly optimizes these two semantic types through several key techniques—Teacher-Prior Adaptive Masking (TPAM), multi-view sparse sampling, Cross-View Masked Feature Completion (CVMFC), and Attention-Guided Temporal Prediction (AGTP). Evaluated across 11 endoscopic video datasets, the method significantly outperforms current state-of-the-art approaches, demonstrating superior representation capacity and generalization across diverse downstream tasks.

Technology Category

Application Category

📝 Abstract

Endoscopic video analysis is essential for early gastrointestinal screening but remains hindered by limited high-quality annotations. While self-supervised video pre-training shows promise, existing methods developed for natural videos prioritize dense spatio-temporal modeling and exhibit motion bias, overlooking the static, structured semantics critical to clinical decision-making. To address this challenge, we propose Focus-to-Perceive Representation Learning (FPRL), a cognition-inspired hierarchical framework that emulates clinical examination. FPRL first focuses on intra-frame lesion-centric regions to learn static semantics, and then perceives their evolution across frames to model contextual semantics. To achieve this, FPRL employs a hierarchical semantic modeling mechanism that explicitly distinguishes and collaboratively learns both types of semantics. Specifically, it begins by capturing static semantics via teacher-prior adaptive masking (TPAM) combined with multi-view sparse sampling. This approach mitigates redundant temporal dependencies and enables the model to concentrate on lesion-related local semantics. Following this, contextual semantics are derived through cross-view masked feature completion (CVMFC) and attention-guided temporal prediction (AGTP). These processes establish cross-view correspondences and effectively model structured inter-frame evolution, thereby reinforcing temporal semantic continuity while preserving global contextual integrity. Extensive experiments on 11 endoscopic video datasets show that FPRL achieves superior performance across diverse downstream tasks, demonstrating its effectiveness in endoscopic video representation learning. The code is available at https://github.com/MLMIP/FPRL.

Problem

Research questions and friction points this paper is trying to address.

endoscopic video analysis

self-supervised learning

static semantics

clinical decision-making

annotation scarcity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Focus-to-Perceive Representation Learning

static semantics

contextual semantics