🤖 AI Summary
Medical video and 3D volumetric sequences suffer from data scarcity, high annotation costs, and weak semantic-temporal controllability and noise-sample quality control in existing diffusion-based generation methods. To address these challenges, this paper proposes a controllable generative augmentation framework. Its core contributions are: (1) a multimodal conditional guidance mechanism for controllable sequence generation, enabling precise customization along both semantic and temporal dimensions; (2) a spatio-temporal consistency enhancement module to preserve structural coherence across frames and volumes; and (3) a dual-level (semantic and sequential) noise filtering mechanism with fine- and coarse-grained quality assessment to eliminate spurious samples. Built upon a diffusion model architecture, the framework demonstrates significant performance gains across three medical datasets, eleven classifiers, and three training paradigms—particularly improving high-risk patient identification and out-of-distribution generalization.
📝 Abstract
In the medical field, the limited availability of large-scale datasets and labor-intensive annotation processes hinder the performance of deep models. Diffusion-based generative augmentation approaches present a promising solution to this issue, having been proven effective in advancing downstream medical recognition tasks. Nevertheless, existing works lack sufficient semantic and sequential steerability for challenging video/3D sequence generation, and neglect quality control of noisy synthesized samples, resulting in unreliable synthetic databases and severely limiting the performance of downstream tasks. In this work, we present Ctrl-GenAug, a novel and general generative augmentation framework that enables highly semantic- and sequential-customized sequence synthesis and suppresses incorrectly synthesized samples, to aid medical sequence classification. Specifically, we first design a multimodal conditions-guided sequence generator for controllably synthesizing diagnosis-promotive samples. A sequential augmentation module is integrated to enhance the temporal/stereoscopic coherence of generated samples. Then, we propose a noisy synthetic data filter to suppress unreliable cases at semantic and sequential levels. Extensive experiments on 3 medical datasets, using 11 networks trained on 3 paradigms, comprehensively analyze the effectiveness and generality of Ctrl-GenAug, particularly in underrepresented high-risk populations and out-domain conditions.