SIGN2GPT: LEVERAGING LARGE LANGUAGE MODELS FOR GLOSS-FREE SIGN LANGUAGE TRANSLATION

SIGN2GPT: LEVERAGING LARGE LANGUAGE MODELS FOR GLOSS-FREE SIGN LANGUAGE TRANSLATION

2024 | Ryan Wong, Necati Cihan Camgoz, Richard Bowden
Sign2GPT is a novel framework for gloss-free sign language translation that leverages large-scale pretrained vision and language models with lightweight adapters. The framework addresses the challenges of limited dataset sizes and computational demands in training with long sign videos. It introduces a novel pretraining strategy that enables the encoder to learn sign representations from automatically extracted pseudo-glosses without requiring gloss order information or annotations. The approach is evaluated on two public benchmark datasets, RWTH-PHOENIX-Weather 2014T and CSL-Daily, achieving significant improvements in gloss-free translation performance. The framework uses a spatial backbone based on the Dino-V2 Vision Transformer to extract spatial features from video frames, followed by a spatio-temporal transformer model for spatiotemporal sign representation. The language decoder is adapted using the XGLM model, with modifications to enable effective translation from sign language to spoken language. The pretraining stage involves generating pseudo-glosses from spoken language sentences and using them to train the sign encoder. The model employs low-rank adapters to adapt frozen pretrained models, enabling efficient training and reducing memory constraints. The results show that Sign2GPT outperforms previous gloss-free translation methods, demonstrating the effectiveness of the approach in sign language translation. The framework also includes ablation studies and qualitative results, highlighting the model's ability to capture semantic content and improve translation accuracy. The work contributes to the field of sign language translation by providing a novel method that leverages large language models and visual-linguistic features for effective translation.Sign2GPT is a novel framework for gloss-free sign language translation that leverages large-scale pretrained vision and language models with lightweight adapters. The framework addresses the challenges of limited dataset sizes and computational demands in training with long sign videos. It introduces a novel pretraining strategy that enables the encoder to learn sign representations from automatically extracted pseudo-glosses without requiring gloss order information or annotations. The approach is evaluated on two public benchmark datasets, RWTH-PHOENIX-Weather 2014T and CSL-Daily, achieving significant improvements in gloss-free translation performance. The framework uses a spatial backbone based on the Dino-V2 Vision Transformer to extract spatial features from video frames, followed by a spatio-temporal transformer model for spatiotemporal sign representation. The language decoder is adapted using the XGLM model, with modifications to enable effective translation from sign language to spoken language. The pretraining stage involves generating pseudo-glosses from spoken language sentences and using them to train the sign encoder. The model employs low-rank adapters to adapt frozen pretrained models, enabling efficient training and reducing memory constraints. The results show that Sign2GPT outperforms previous gloss-free translation methods, demonstrating the effectiveness of the approach in sign language translation. The framework also includes ablation studies and qualitative results, highlighting the model's ability to capture semantic content and improve translation accuracy. The work contributes to the field of sign language translation by providing a novel method that leverages large language models and visual-linguistic features for effective translation.
Reach us at info@study.space