6 June 2024 | Talal H. Noor, Ayman Noor, Ahmed F. Alharbi, Ahmed Faisal, Rakan Alrashidi, Ahmed S. Alsaedi, Ghada Alharbi, Tawfeeq Alsonoosy, and Abdullah Alsaedi
This paper presents a hybrid deep learning model for real-time Arabic Sign Language (ArSL) recognition, addressing the lack of sign language interpreters in Saudi Arabia. The hybrid model combines a Convolutional Neural Network (CNN) for spatial feature extraction and a Long Short-Term Memory (LSTM) for temporal feature extraction, enabling the system to recognize both static and dynamic gestures. The model was tested on a dataset of 4000 images and 500 videos, consisting of 10 static gesture words and 10 dynamic gesture words. The CNN achieved an accuracy of 94.40%, while the LSTM achieved 82.70%. The hybrid model demonstrated promising performance, offering a solution to improve communication accessibility for the hearing-impaired community in Saudi Arabia. The system architecture includes four layers: data acquisition, mobile network, cloud, and sign language recognition. The hybrid model is designed to process both images and videos, with the CNN handling image-based recognition and the LSTM handling video-based recognition. The model was implemented using Google Cloud Computing Services, with virtual machines running Ubuntu and Docker for containerization. The system was tested on a dataset of 4000 images and 500 videos, achieving high accuracy in recognizing Arabic sign language words. The results indicate that the hybrid model can significantly enhance communication accessibility for the hearing-impaired community in Saudi Arabia.This paper presents a hybrid deep learning model for real-time Arabic Sign Language (ArSL) recognition, addressing the lack of sign language interpreters in Saudi Arabia. The hybrid model combines a Convolutional Neural Network (CNN) for spatial feature extraction and a Long Short-Term Memory (LSTM) for temporal feature extraction, enabling the system to recognize both static and dynamic gestures. The model was tested on a dataset of 4000 images and 500 videos, consisting of 10 static gesture words and 10 dynamic gesture words. The CNN achieved an accuracy of 94.40%, while the LSTM achieved 82.70%. The hybrid model demonstrated promising performance, offering a solution to improve communication accessibility for the hearing-impaired community in Saudi Arabia. The system architecture includes four layers: data acquisition, mobile network, cloud, and sign language recognition. The hybrid model is designed to process both images and videos, with the CNN handling image-based recognition and the LSTM handling video-based recognition. The model was implemented using Google Cloud Computing Services, with virtual machines running Ubuntu and Docker for containerization. The system was tested on a dataset of 4000 images and 500 videos, achieving high accuracy in recognizing Arabic sign language words. The results indicate that the hybrid model can significantly enhance communication accessibility for the hearing-impaired community in Saudi Arabia.