π€ AI Summary
This work addresses pervasive label noise in real-world scene text datasets, particularly focusing on challenges such as variable-length sequence misalignment and character-level annotation errors (e.g., confusions between visually similar characters). We propose Sequence-Level Semantic Label Corruption (SSLC), the first method capable of precisely detecting label errors in variable-length scene text. SSLC jointly models imageβtext modality alignment and character-level visual similarity to dynamically generate robust pseudo-corruption labels. It integrates a multimodal encoder with a character-level tokenizer into an end-to-end detection framework. Extensive experiments on multiple real-world scene text benchmarks demonstrate that SSLC significantly outperforms existing approaches, yielding an average 3.2% improvement in Scene Text Recognition (STR) accuracy. The results validate both the effectiveness and practical utility of our method for label noise detection in scene text understanding.
π Abstract
We introduce SELECT (Scene tExt Label Errors deteCTion), a novel approach that leverages multi-modal training to detect label errors in real-world scene text datasets. Utilizing an image-text encoder and a character-level tokenizer, SELECT addresses the issues of variable-length sequence labels, label sequence misalignment, and character-level errors, outperforming existing methods in accuracy and practical utility. In addition, we introduce Similarity-based Sequence Label Corruption (SSLC), a process that intentionally introduces errors into the training labels to mimic real-world error scenarios during training. SSLC not only can cause a change in the sequence length but also takes into account the visual similarity between characters during corruption. Our method is the first to detect label errors in real-world scene text datasets successfully accounting for variable-length labels. Experimental results demonstrate the effectiveness of SELECT in detecting label errors and improving STR accuracy on real-world text datasets, showcasing its practical utility.