Few-shot Writer Adaptation via Multimodal In-Context Learning

📅 2026-03-31

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the significant performance degradation of handwritten text recognition models when encountering unseen writing styles. Existing writer adaptation methods typically rely on parameter updates during inference, which incur high computational costs and require careful hyperparameter tuning. To overcome these limitations, the authors propose a context-driven framework that introduces multimodal in-context learning to handwritten text recognition for the first time, enabling few-shot adaptation without any parameter updates by leveraging only a small number of samples from the target writer. The approach employs a lightweight CNN-Transformer hybrid architecture, combined with context length optimization and joint training strategies. Evaluated on the IAM and RIMES datasets, the model achieves character error rates of 3.92% and 2.34%, respectively, substantially outperforming all existing writer-independent models that do not update parameters at inference time.

Technology Category

Application Category

📝 Abstract

While state-of-the-art Handwritten Text Recognition (HTR) models perform well on standard benchmarks, they frequently struggle with writers exhibiting highly specific styles that are underrepresented in the training data. To handle unseen and atypical writers, writer adaptation techniques personalize HTR models to individual handwriting styles. Leading writer adaptation methods require either offline fine-tuning or parameter updates at inference time, both involving gradient computation and backpropagation, which increase computational costs and demand careful hyperparameter tuning. In this work, we propose a novel context-driven HTR framework3 inspired by multimodal in-context learning, enabling inference-time writer adaptation using only a few examples from the target writer without any parameter updates. We further demonstrate the impact of context length, design a compact 8M-parameter CNN-Transformer that enables few-shot in-context adaptation, and show that combining context-driven and standard OCR training strategies leads to complementary improvements. Experiments on IAM and RIMES validate our approach with Character Error Rates of 3.92% and 2.34%, respectively, surpassing all writer-independent HTR models without requiring any parameter updates at inference time.

Problem

Research questions and friction points this paper is trying to address.

Handwritten Text Recognition

Writer Adaptation

Few-shot Learning

In-Context Learning

Multimodal Learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

few-shot writer adaptation

multimodal in-context learning

parameter-free inference