FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting

📅 2024-08-27

🏛️ International Conference on Pattern Recognition

📈 Citations: 1

✨ Influential: 0

career value

165K/year

🤖 AI Summary

To address the challenges of recognizing arbitrarily shaped, low-resolution, and multilingual (English/Vietnamese) text in real-world scenarios, this paper proposes an efficient and robust end-to-end text recognition framework. Methodologically: (i) we design a lightweight Fast Self-Attention Cell (SAC²) that accelerates computation without sacrificing accuracy; (ii) we introduce the first deep integration of a lightweight Swin Transformer backbone with an Encoder–Decoder Transformer architecture; and (iii) we enhance localization–recognition consistency via multi-scale feature fusion and deformable ROI alignment. Our method achieves state-of-the-art performance on ICDAR2015, CTW1500, and TotalText, with a 37% speedup in inference time and a 29% reduction in model parameters. The source code, pretrained models, and annotated dataset are publicly released.

Technology Category

Application Category

📝 Abstract

The proliferation of scene text in both structured and unstructured environments presents significant challenges in optical character recognition (OCR), necessitating more efficient and robust text spotting solutions. This paper presents FastTextSpotter, a framework that integrates a Swin Transformer visual backbone with a Transformer Encoder-Decoder architecture, enhanced by a novel, faster self-attention unit, SAC2, to improve processing speeds while maintaining accuracy. FastTextSpotter has been validated across multiple datasets, including ICDAR2015 for regular texts and CTW1500 and TotalText for arbitrary-shaped texts, benchmarking against current state-of-the-art models. Our results indicate that FastTextSpotter not only achieves superior accuracy in detecting and recognizing multilingual scene text (English and Vietnamese) but also improves model efficiency, thereby setting new benchmarks in the field. This study underscores the potential of advanced transformer architectures in improving the adaptability and speed of text spotting applications in diverse real-world settings. The dataset, code, and pre-trained models have been released in our Github.

Problem

Research questions and friction points this paper is trying to address.

Improves multilingual scene text detection and recognition efficiency.

Enhances OCR accuracy for structured and unstructured environments.

Sets new benchmarks in text spotting with advanced transformer architectures.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Swin Transformer with Encoder-Decoder architecture

Introduces faster self-attention unit SAC2

Validated on multiple datasets for multilingual text spotting

🔎 Similar Papers

SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting