Towards Privacy-Aware Sign Language Translation at Scale

Towards Privacy-Aware Sign Language Translation at Scale

2024 | Phillip Rust, Bowen Shi, Skyler Wang, Necati Cihan Camgöz, Jean Maillard
This paper presents a privacy-aware sign language translation (SLT) framework, SSVP-SLT, which addresses the challenges of data scarcity and privacy risks in SLT. The framework consists of two stages: self-supervised video pretraining on anonymized and unannotated videos, followed by supervised SLT finetuning on a curated parallel dataset. SSVP-SLT leverages masked autoencoding (MAE) for self-supervised pretraining and optionally includes language-supervised pretraining to bridge the modality gap between video and text. The framework is designed to be generic, scalable, and privacy-aware, with a focus on protecting the privacy of signers by using facial obfuscation for anonymization. SSVP-SLT achieves state-of-the-art performance on the How2Sign dataset, outperforming the strongest baselines by over 3 BLEU-4 in both finetuned and zero-shot settings. The framework also introduces a new ASL-to-English SLT benchmark dataset, DailyMoth-70h, consisting of over 70 hours of continuous signing in native ASL. The results show that facial blurring has relatively little negative impact on downstream performance, demonstrating that privacy can be protected without significant performance degradation. The paper also discusses the effectiveness of self-supervised pretraining for SLT, highlighting the importance of capturing long-range spatiotemporal dependencies in signed utterances. The results indicate that pretraining on longer video clips is necessary for achieving high performance, and that the model's ability to abstract away surface-level information such as background and signer appearance is crucial for learning meaningful sign representations. The study also evaluates the impact of different text models and data augmentation techniques on SLT performance. The results show that T5 outperforms BART, and that data augmentation can improve generalization without significant storage costs. The paper also discusses the importance of considering demographic biases in SLT models and the need for future research to address these issues. Overall, the paper demonstrates the effectiveness of self-supervised pretraining for SLT while considering privacy risks, and highlights the potential of SSVP-SLT to scale sign language processing and improve translation performance. The framework provides a promising solution to the challenges of data scarcity and privacy in SLT, with the potential to benefit the d/Deaf and hard-of-hearing communities.This paper presents a privacy-aware sign language translation (SLT) framework, SSVP-SLT, which addresses the challenges of data scarcity and privacy risks in SLT. The framework consists of two stages: self-supervised video pretraining on anonymized and unannotated videos, followed by supervised SLT finetuning on a curated parallel dataset. SSVP-SLT leverages masked autoencoding (MAE) for self-supervised pretraining and optionally includes language-supervised pretraining to bridge the modality gap between video and text. The framework is designed to be generic, scalable, and privacy-aware, with a focus on protecting the privacy of signers by using facial obfuscation for anonymization. SSVP-SLT achieves state-of-the-art performance on the How2Sign dataset, outperforming the strongest baselines by over 3 BLEU-4 in both finetuned and zero-shot settings. The framework also introduces a new ASL-to-English SLT benchmark dataset, DailyMoth-70h, consisting of over 70 hours of continuous signing in native ASL. The results show that facial blurring has relatively little negative impact on downstream performance, demonstrating that privacy can be protected without significant performance degradation. The paper also discusses the effectiveness of self-supervised pretraining for SLT, highlighting the importance of capturing long-range spatiotemporal dependencies in signed utterances. The results indicate that pretraining on longer video clips is necessary for achieving high performance, and that the model's ability to abstract away surface-level information such as background and signer appearance is crucial for learning meaningful sign representations. The study also evaluates the impact of different text models and data augmentation techniques on SLT performance. The results show that T5 outperforms BART, and that data augmentation can improve generalization without significant storage costs. The paper also discusses the importance of considering demographic biases in SLT models and the need for future research to address these issues. Overall, the paper demonstrates the effectiveness of self-supervised pretraining for SLT while considering privacy risks, and highlights the potential of SSVP-SLT to scale sign language processing and improve translation performance. The framework provides a promising solution to the challenges of data scarcity and privacy in SLT, with the potential to benefit the d/Deaf and hard-of-hearing communities.
Reach us at info@study.space