Normalized vs Diplomatic Annotation: A Case Study of Automatic Information Extraction from Handwritten Uruguayan Birth Certificates

📅 2025-07-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses key information extraction from handwritten Uruguayan birth certificates under low-resource conditions, investigating the suitability of standardized versus diplomatic annotation strategies across diverse field types. We propose a novel paradigm—“selecting annotation strategies according to field-level semantic properties”: standardized annotation for normative fields (e.g., dates, locations), and diplomatic annotation (preserving original orthography) for highly variable fields (e.g., names, surnames). Leveraging a Document Attention Network (DAN), we fine-tune separate models for each strategy and evaluate them on 201 real-world certificate images authored by 15+ writers. Results show consistent F1-score improvements across all fields, with diplomatic annotation yielding an absolute +8.3% gain in name extraction. Crucially, this work is the first to systematically demonstrate a strong correlation between annotation strategy choice and field standardizability, establishing a transferable methodology for handwritten document information extraction.

Technology Category

Application Category

📝 Abstract
This study evaluates the recently proposed Document Attention Network (DAN) for extracting key-value information from Uruguayan birth certificates, handwritten in Spanish. We investigate two annotation strategies for automatically transcribing handwritten documents, fine-tuning DAN with minimal training data and annotation effort. Experiments were conducted on two datasets containing the same images (201 scans of birth certificates written by more than 15 different writers) but with different annotation methods. Our findings indicate that normalized annotation is more effective for fields that can be standardized, such as dates and places of birth, whereas diplomatic annotation performs much better for fields containing names and surnames, which can not be standardized.
Problem

Research questions and friction points this paper is trying to address.

Evaluates DAN for extracting data from handwritten birth certificates
Compares normalized vs diplomatic annotation for transcription accuracy
Assesses effectiveness for standardized vs non-standardized fields
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Document Attention Network (DAN)
Compares normalized and diplomatic annotations
Requires minimal training data
🔎 Similar Papers
No similar papers found.
N
Natalia Bottaioli
Université Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, France; Facultad de Ingeniería, Universidad de la República, Montevideo, Uruguay; Digital Sense, Montevideo, Uruguay
S
Solène Tarride
TEKLIA, Paris, France
J
Jérémy Anger
Université Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, France
S
Seginus Mowlavi
Université Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, France
Marina Gardella
Marina Gardella
Centre Borelli, ENS Paris-Saclay, Université Paris-Saclay
Image processing
A
Antoine Tadros
Université Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, France
Gabriele Facciolo
Gabriele Facciolo
Professor of Mathematics, Centre Borelli, ENS Paris-Saclay
Image ProcessingComputer VisionRemote Sensing
R
Rafael Grompone von Gioi
Université Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, France
Christopher Kermorvant
Christopher Kermorvant
Founder and Scientific Director at Teklia
Machine LearningHandwriting RecognitionGrammatical InferenceDocument image analysis
J
Jean-Michel Morel
City University of Hong Kong, Hong Kong
Javier Preciozzi
Javier Preciozzi
PhD
Computer Vision & Image Processing