🤖 AI Summary
This work addresses sign language translation (SLT), proposing a two-stage “detection + generation” hybrid framework, Spotter+GPT. First, a Transformer-based sign spotter detects isolated sign units (spotting) from video with high precision. Second, the resulting symbolic sign sequence is fed into fine-tuned or prompt-engineered GPT-series large language models (LLMs) to generate grammatically correct and contextually coherent spoken-language sentences. Crucially, this is the first approach to decouple sign spotting from linguistic generation: the spotter is trained on linguistically annotated spotting data, while the LLM handles high-level syntactic and semantic modeling—enhancing both interpretability and generalization. On the PHOENIX14-T benchmark, Spotter+GPT achieves state-of-the-art performance, surpassing leading end-to-end SLT methods in both BLEU and METEOR scores, thereby validating the effectiveness and advancement of this paradigm.
📝 Abstract
Sign Language Translation (SLT) is a challenging task that aims to generate spoken language sentences from sign language videos. In this paper, we introduce a hybrid SLT approach, Spotter+GPT, that utilizes a sign spotter and a powerful Large Language Model (LLM) to improve SLT performance. Spotter+GPT breaks down the SLT task into two stages. The videos are first processed by the Spotter, which is trained on a linguistic sign language dataset, to identify individual signs. These spotted signs are then passed to an LLM, which transforms them into coherent and contextually appropriate spoken language sentences. The source code of the Spotter is available at https://gitlab.surrey.ac.uk/cogvispublic/sign-spotter.