Towards Analysing Invoices and Receipts with Amazon Textract

📅 2025-12-22

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This study systematically evaluates AWS Textract’s performance in extracting structured fields—particularly the total amount—from real-world receipts spanning diverse formats, quality levels, and degradation conditions (e.g., blur, skew, occlusion), revealing critical failures in layout understanding and robustness. Method: We propose the first fine-grained, receipt-specific diagnostic framework that jointly models image quality and layout features to enable interpretable failure attribution. Based on empirical analysis, we design an end-to-end optimization pipeline integrating targeted preprocessing (e.g., skew correction, contrast enhancement) and a rule-based post-processing engine. Contribution/Results: Our framework establishes a reproducible OCR diagnostic paradigm for receipt processing. Experiments show 98.2% recall for total-amount extraction, yet performance degrades significantly under low-quality imaging—highlighting key failure modes. The proposed pipeline delivers empirically validated, production-ready mitigation strategies, bridging the gap between diagnostic insight and deployable OCR engineering.

Technology Category

Application Category

📝 Abstract

This paper presents an evaluation of the AWS Textract in the context of extracting data from receipts. We analyse Textract functionalities using a dataset that includes receipts of varied formats and conditions. Our analysis provided a qualitative view of Textract strengths and limitations. While the receipts totals were consistently detected, we also observed typical issues and irregularities that were often influenced by image quality and layout. Based on the analysis of the observations, we propose mitigation strategies.

Problem

Research questions and friction points this paper is trying to address.

Evaluating AWS Textract for receipt data extraction

Analyzing Textract performance across varied receipt formats

Proposing strategies to address extraction issues and irregularities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates AWS Textract for receipt data extraction

Analyzes varied receipt formats and conditions

Proposes mitigation strategies for observed issues

🔎 Similar Papers

Deep Learning based Key Information Extraction from Business Documents: Systematic Literature Review