ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning

📅 2025-07-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the poor interpretability and weak contextual logical reasoning in event stream scene text recognition, this paper proposes ESTR-CoT—a novel framework that introduces Chain-of-Thought (CoT) reasoning into this domain for the first time. It synergistically integrates a vision encoder (EVA-CLIP ViT-G/14) and a large language model (Vicuna-7B) to jointly generate text predictions and interpretable reasoning paths. To strengthen logical reasoning capabilities, we construct a large-scale, three-stage CoT-annotated dataset. Furthermore, Q-Former-based cross-modal alignment and end-to-end supervised fine-tuning are employed. Extensive experiments on EventSTR, WordArt*, and IC15* benchmarks demonstrate significant improvements in both recognition accuracy and model interpretability, validating the effectiveness and generalizability of our approach.

Technology Category

Application Category

📝 Abstract

Event stream based scene text recognition is a newly arising research topic in recent years which performs better than the widely used RGB cameras in extremely challenging scenarios, especially the low illumination, fast motion. Existing works either adopt end-to-end encoder-decoder framework or large language models for enhanced recognition, however, they are still limited by the challenges of insufficient interpretability and weak contextual logical reasoning. In this work, we propose a novel chain-of-thought reasoning based event stream scene text recognition framework, termed ESTR-CoT. Specifically, we first adopt the vision encoder EVA-CLIP (ViT-G/14) to transform the input event stream into tokens and utilize a Llama tokenizer to encode the given generation prompt. A Q-former is used to align the vision token to the pre-trained large language model Vicuna-7B and output both the answer and chain-of-thought (CoT) reasoning process simultaneously. Our framework can be optimized using supervised fine-tuning in an end-to-end manner. In addition, we also propose a large-scale CoT dataset to train our framework via a three stage processing (i.e., generation, polish, and expert verification). This dataset provides a solid data foundation for the development of subsequent reasoning-based large models. Extensive experiments on three event stream STR benchmark datasets (i.e., EventSTR, WordArt*, IC15*) fully validated the effectiveness and interpretability of our proposed framework. The source code and pre-trained models will be released on https://github.com/Event-AHU/ESTR-CoT.

Problem

Research questions and friction points this paper is trying to address.

Improving interpretability in event stream text recognition

Enhancing contextual reasoning for scene text recognition

Addressing low illumination and fast motion challenges

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses EVA-CLIP vision encoder for event stream tokenization

Integrates Vicuna-7B LLM with Q-former alignment

Trains with three-stage processed CoT dataset

🔎 Similar Papers

No similar papers found.

Authors to Follow