ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning

📅 2025-07-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the poor interpretability and weak contextual logical reasoning in event stream scene text recognition, this paper proposes ESTR-CoT—a novel framework that introduces Chain-of-Thought (CoT) reasoning into this domain for the first time. It synergistically integrates a vision encoder (EVA-CLIP ViT-G/14) and a large language model (Vicuna-7B) to jointly generate text predictions and interpretable reasoning paths. To strengthen logical reasoning capabilities, we construct a large-scale, three-stage CoT-annotated dataset. Furthermore, Q-Former-based cross-modal alignment and end-to-end supervised fine-tuning are employed. Extensive experiments on EventSTR, WordArt*, and IC15* benchmarks demonstrate significant improvements in both recognition accuracy and model interpretability, validating the effectiveness and generalizability of our approach.

Technology Category

Application Category

📝 Abstract
Event stream based scene text recognition is a newly arising research topic in recent years which performs better than the widely used RGB cameras in extremely challenging scenarios, especially the low illumination, fast motion. Existing works either adopt end-to-end encoder-decoder framework or large language models for enhanced recognition, however, they are still limited by the challenges of insufficient interpretability and weak contextual logical reasoning. In this work, we propose a novel chain-of-thought reasoning based event stream scene text recognition framework, termed ESTR-CoT. Specifically, we first adopt the vision encoder EVA-CLIP (ViT-G/14) to transform the input event stream into tokens and utilize a Llama tokenizer to encode the given generation prompt. A Q-former is used to align the vision token to the pre-trained large language model Vicuna-7B and output both the answer and chain-of-thought (CoT) reasoning process simultaneously. Our framework can be optimized using supervised fine-tuning in an end-to-end manner. In addition, we also propose a large-scale CoT dataset to train our framework via a three stage processing (i.e., generation, polish, and expert verification). This dataset provides a solid data foundation for the development of subsequent reasoning-based large models. Extensive experiments on three event stream STR benchmark datasets (i.e., EventSTR, WordArt*, IC15*) fully validated the effectiveness and interpretability of our proposed framework. The source code and pre-trained models will be released on https://github.com/Event-AHU/ESTR-CoT.
Problem

Research questions and friction points this paper is trying to address.

Improving interpretability in event stream text recognition
Enhancing contextual reasoning for scene text recognition
Addressing low illumination and fast motion challenges
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses EVA-CLIP vision encoder for event stream tokenization
Integrates Vicuna-7B LLM with Q-former alignment
Trains with three-stage processed CoT dataset
🔎 Similar Papers
No similar papers found.
X
Xiao Wang
School of Computer Science and Technology, Anhui University, Hefei 230601, China
J
Jingtao Jiang
School of Computer Science and Technology, Anhui University, Hefei 230601, China
Q
Qiang Chen
School of Computer Science and Technology, Anhui University, Hefei 230601, China
Lan Chen
Lan Chen
Communication University of China
Image/Video generation and editing
L
Lin Zhu
Beijing Institute of Technology, Beijing, China
Yaowei Wang
Yaowei Wang
The Hong Kong Polytechnic University
Y
Yonghong Tian
Peng Cheng Laboratory, Shenzhen, China; School of Computer Science, Peking University, China; School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University, China
Jin Tang
Jin Tang
Anhui University
Computer visionintelligent video analysis