🤖 AI Summary
Semantic segmentation models often exhibit excessive reliance on texture cues, compromising robustness. To address this, we propose a style-transfer-based data augmentation method leveraging non-uniform Voronoi regions: local style transformations are applied within synthetically generated Voronoi cells to suppress texture information and enhance shape awareness. The approach requires no additional annotations and is compatible with both CNN- and Transformer-based architectures. Experiments on Cityscapes and PASCAL Context demonstrate that our method significantly mitigates texture bias, improves segmentation robustness under common image corruptions and adversarial attacks, and exhibits strong generalization—delivering consistent performance gains across diverse backbones and datasets. Our key contribution is the first application of structured regional style transfer to disentangle texture and shape representations, establishing a novel paradigm for enhancing the generalization capability of semantic segmentation models.
📝 Abstract
Recent research has investigated the shape and texture biases of deep neural networks (DNNs) in image classification which influence their generalization capabilities and robustness. It has been shown that, in comparison to regular DNN training, training with stylized images reduces texture biases in image classification and improves robustness with respect to image corruptions. In an effort to advance this line of research, we examine whether style transfer can likewise deliver these two effects in semantic segmentation. To this end, we perform style transfer with style varying across artificial image areas. Those random areas are formed by a chosen number of Voronoi cells. The resulting style-transferred data is then used to train semantic segmentation DNNs with the objective of reducing their dependence on texture cues while enhancing their reliance on shape-based features. In our experiments, it turns out that in semantic segmentation, style transfer augmentation reduces texture bias and strongly increases robustness with respect to common image corruptions as well as adversarial attacks. These observations hold for convolutional neural networks and transformer architectures on the Cityscapes dataset as well as on PASCAL Context, showing the generality of the proposed method.