Comparison of Autoencoders for tokenization of ASL datasets

📅 2025-01-12

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This study addresses the challenge of effective image tokenization for American Sign Language (ASL) recognition and generation, using a large-scale ASL dataset comprising 87,000 images across 29 classes. We systematically evaluate three classes of autoencoders—convolutional, feedforward, and diffusion-based—for high-fidelity reconstruction and downstream ASL modeling. Introducing Diffusion Autoencoders (Diffusion AEs) to ASL representation learning for the first time, we propose a novel tokenization paradigm grounded in probabilistic noise modeling and iterative denoising. Experiments employ end-to-end training and jointly optimize MSE reconstruction error and Mean Opinion Score (MOS) for comprehensive evaluation. Results demonstrate that the Diffusion AE significantly outperforms its convolutional and feedforward counterparts: it achieves a 32% reduction in MSE and attains a MOS of 4.6/5, yielding superior reconstruction fidelity and perceptual consistency. This work establishes the efficacy and practicality of diffusion models for multimodal tokenization in sign language, offering a promising pathway for low-resource gesture language modeling.

Technology Category

Application Category

📝 Abstract

Generative AI, powered by large language models (LLMs), has revolutionized applications across text, audio, images, and video. This study focuses on developing and evaluating encoder-decoder architectures for the American Sign Language (ASL) image dataset, consisting of 87,000 images across 29 hand sign classes. Three approaches were compared: Feedforward Autoencoders, Convolutional Autoencoders, and Diffusion Autoencoders. The Diffusion Autoencoder outperformed the others, achieving the lowest mean squared error (MSE) and highest Mean Opinion Score (MOS) due to its probabilistic noise modeling and iterative denoising capabilities. The Convolutional Autoencoder demonstrated effective spatial feature extraction but lacked the robustness of the diffusion process, while the Feedforward Autoencoder served as a baseline with limitations in handling complex image data. Objective and subjective evaluations confirmed the superiority of the Diffusion Autoencoder for high-fidelity image reconstruction, emphasizing its potential in multimodal AI applications such as sign language recognition and generation. This work provides critical insights into designing robust encoder-decoder systems to advance multimodal AI capabilities.

Problem

Research questions and friction points this paper is trying to address.

Autoencoders

American Sign Language

Video Translation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Autoencoders

Multimodal Applications

ASL Image Reconstruction

🔎 Similar Papers

No similar papers found.