🤖 AI Summary
This study addresses real-time American Sign Language (ASL) video-to-text translation under edge computing constraints, where accuracy, latency, and computational efficiency are critically coupled.
Method: We systematically benchmark 3D CNNs, LSTMs, and their hybrid variants on a curated ASL dataset comprising 50 classes and 1,200 video samples, evaluating trade-offs across accuracy, per-frame inference latency, and resource consumption.
Contribution/Results: Experimental results reveal that 3D CNNs achieve the highest accuracy (92.4%) but incur a 3.2% increase in frame processing latency; LSTMs yield lower accuracy (86.7%) yet significantly reduce computational overhead; hybrid models exhibit intermediate performance. Crucially, this work provides the first quantitative characterization of the fundamental triadic trade-off among spatiotemporal feature modeling (3D CNN), sequential dynamic modeling (LSTM), and system-level constraints—accuracy, latency, and efficiency. We formalize the “scenario-driven architecture selection” principle, establishing a reproducible benchmark and decision framework for designing lightweight, assistive-technology-oriented sign language recognition systems.
📝 Abstract
This study investigates the performance of 3D Convolutional Neural Networks (3D CNNs) and Long Short-Term Memory (LSTM) networks for real-time American Sign Language (ASL) recognition. Though 3D CNNs are good at spatiotemporal feature extraction from video sequences, LSTMs are optimized for modeling temporal dependencies in sequential data. We evaluate both architectures on a dataset containing 1,200 ASL signs across 50 classes, comparing their accuracy, computational efficiency, and latency under similar training conditions. Experimental results demonstrate that 3D CNNs achieve 92.4% recognition accuracy but require 3.2% more processing time per frame compared to LSTMs, which maintain 86.7% accuracy with significantly lower resource consumption. The hybrid 3D CNNLSTM model shows decent performance, which suggests that context-dependent architecture selection is crucial for practical implementation.This project provides professional benchmarks for developing assistive technologies, highlighting trade-offs between recognition precision and real-time operational requirements in edge computing environments.