[slides] Sign2GPT%3A Leveraging Large Language Models for Gloss-Free Sign Language Translation

**Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language Translation** **Authors:** Ryan Wong, Necati Cihan Camgoz, Richard Bowden **Institution:** University of Surrey, Meta Reality Labs **Abstract:** Automatic Sign Language Translation (SLT) requires integrating computer vision and natural language processing to bridge the communication gap between sign and spoken languages. However, the lack of large-scale training data for sign language translation necessitates the use of resources from spoken language. Sign2GPT is a novel framework that utilizes large-scale pre-trained vision and language models via lightweight adapters for gloss-free sign language translation. The lightweight adapters are crucial due to the constraints of limited dataset sizes and computational requirements when training with long sign videos. A novel pretraining strategy is proposed, which directs the encoder to learn sign representations from automatically extracted pseudo-glosses without requiring gloss order information or annotations. The approach is evaluated on two public benchmark sign language translation datasets, RWTH-PHOENIX-Weather 2014T and CSL-Daily, and demonstrates significant improvements over state-of-the-art gloss-free translation performance. **Introduction:** Sign languages are primary forms of communication for millions of Deaf individuals, using complex visual gestures. Automatic SLT is challenging due to the need to understand both sign and spoken language semantics. Prior studies often rely on gloss annotations, which are resource-intensive and time-consuming to create. Gloss-free SLT, which does not require manual gloss annotations, has gained attention. Sign2GPT addresses these challenges by leveraging large-scale pre-trained vision and language models. The proposed approach includes a novel pretraining strategy that generates pseudo-glosses from spoken language sentences and pretrains the sign encoder using these pseudo-glosses. This eliminates the need for manual gloss annotations and gloss order information. **Method:** Sign2GPT consists of a pretraining stage and a downstream translation stage. The pretraining stage uses pseudo-glosses to learn visual-linguistic representations, while the downstream translation stage leverages a frozen GPT model. The spatial backbone employs the Dino-V2 Vision Transformer, and the sign encoder is designed to handle spatio-temporal sign representations. The language decoder uses the XGLM model, adapted with zero-gated multi-head cross-attention and LoRA for efficient adaptation. **Results:** Sign2GPT is evaluated on two datasets: RWTH-PHOENIX-Weather 2014T and CSL-Daily. The model demonstrates significant improvements in BLEU-4 scores compared to previous state-of-the-art methods, particularly in gloss-free translation. Qualitative results show effective translation and pseudo-gloss localization. **Conclusion:** Sign2GPT addresses the challenge of gloss-free sign language translation by leveraging large-scale pre-trained models and a novel pretraining strategy. The approach significantly improves translation performance and offers a promising direction for fusing visual features with spoken language models.**Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language Translation** **Authors:** Ryan Wong, Necati Cihan Camgoz, Richard Bowden **Institution:** University of Surrey, Meta Reality Labs **Abstract:** Automatic Sign Language Translation (SLT) requires integrating computer vision and natural language processing to bridge the communication gap between sign and spoken languages. However, the lack of large-scale training data for sign language translation necessitates the use of resources from spoken language. Sign2GPT is a novel framework that utilizes large-scale pre-trained vision and language models via lightweight adapters for gloss-free sign language translation. The lightweight adapters are crucial due to the constraints of limited dataset sizes and computational requirements when training with long sign videos. A novel pretraining strategy is proposed, which directs the encoder to learn sign representations from automatically extracted pseudo-glosses without requiring gloss order information or annotations. The approach is evaluated on two public benchmark sign language translation datasets, RWTH-PHOENIX-Weather 2014T and CSL-Daily, and demonstrates significant improvements over state-of-the-art gloss-free translation performance. **Introduction:** Sign languages are primary forms of communication for millions of Deaf individuals, using complex visual gestures. Automatic SLT is challenging due to the need to understand both sign and spoken language semantics. Prior studies often rely on gloss annotations, which are resource-intensive and time-consuming to create. Gloss-free SLT, which does not require manual gloss annotations, has gained attention. Sign2GPT addresses these challenges by leveraging large-scale pre-trained vision and language models. The proposed approach includes a novel pretraining strategy that generates pseudo-glosses from spoken language sentences and pretrains the sign encoder using these pseudo-glosses. This eliminates the need for manual gloss annotations and gloss order information. **Method:** Sign2GPT consists of a pretraining stage and a downstream translation stage. The pretraining stage uses pseudo-glosses to learn visual-linguistic representations, while the downstream translation stage leverages a frozen GPT model. The spatial backbone employs the Dino-V2 Vision Transformer, and the sign encoder is designed to handle spatio-temporal sign representations. The language decoder uses the XGLM model, adapted with zero-gated multi-head cross-attention and LoRA for efficient adaptation. **Results:** Sign2GPT is evaluated on two datasets: RWTH-PHOENIX-Weather 2014T and CSL-Daily. The model demonstrates significant improvements in BLEU-4 scores compared to previous state-of-the-art methods, particularly in gloss-free translation. Qualitative results show effective translation and pseudo-gloss localization. **Conclusion:** Sign2GPT addresses the challenge of gloss-free sign language translation by leveraging large-scale pre-trained models and a novel pretraining strategy. The approach significantly improves translation performance and offers a promising direction for fusing visual features with spoken language models.

SIGN2GPT: LEVERAGING LARGE LANGUAGE MODELS FOR GLOSS-FREE SIGN LANGUAGE TRANSLATION

7 May 2024 | Ryan Wong, Necati Cihan Camgoz, Richard Bowden