π€ AI Summary
To address the challenge of generating controllable, semantically plausible unseen inputs for evaluating the behavioral boundaries of deep learning systems, this paper proposes Mimicry, a black-box test input generation method. Mimicry leverages a style-conditioned GAN to disentangle content and style representations in latent space, pioneering the use of style transfer for goal-directed semantic perturbation. It further introduces an output-probability-guided targeted boundary search mechanism to enable fine-grained, interpretable approximation of decision boundaries. Extensive experiments across five mainstream image classification models demonstrate that Mimicry-generated test inputs lie significantly closer to true decision boundaries, achieving substantial improvements in semantic plausibility, effectiveness, and diversity over state-of-the-art baselines. Human evaluation confirms markedly higher validity rates. This work establishes a novel paradigm for reliability verification of deep learning systems.
π Abstract
Evaluating the behavioral boundaries of deep learning (DL) systems is crucial for understanding their reliability across diverse, unseen inputs. Existing solutions fall short as they rely on untargeted random, model- or latent-based perturbations, due to difficulties in generating controlled input variations. In this work, we introduce Mimicry, a novel black-box test generator for fine-grained, targeted exploration of DL system boundaries. Mimicry performs boundary testing by leveraging the probabilistic nature of DL outputs to identify promising directions for exploration. It uses style-based GANs to disentangle input representations into content and style components, enabling controlled feature mixing to approximate the decision boundary. We evaluated Mimicry's effectiveness in generating boundary inputs for five widely used DL image classification systems of increasing complexity, comparing it to two baseline approaches. Our results show that Mimicry consistently identifies inputs closer to the decision boundary. It generates semantically meaningful boundary test cases that reveal new functional (mis)behaviors, while the baselines produce mainly corrupted or invalid inputs. Thanks to its enhanced control over latent space manipulations, Mimicry remains effective as dataset complexity increases, maintaining competitive diversity and higher validity rates, confirmed by human assessors.