14 May 2024 | Jeong Hun Yeo*, Seunghee Han*, Minsu Kim, Yong Man Ro†
This paper introduces a novel framework called VSP-LLM, which integrates Large Language Models (LLMs) to enhance visual speech processing. Visual speech processing involves recognizing and translating lip movements into text, a task complicated by homophones—words that sound the same but have different meanings. The VSP-LLM framework leverages the strong context modeling capabilities of LLMs to improve the accuracy of visual speech recognition (VSR) and translation (VST).
The framework employs a self-supervised visual speech model to embed input video into a latent space of an LLM. To reduce computational load, a deduplication method is introduced, which compresses redundant visual features while preserving contextual information. This method uses visual speech units, derived from self-supervised models, to represent lip movements. These units are then used to map visual features into the LLM's input space, enabling efficient and context-aware processing.
The VSP-LLM framework is trained on a combination of the LRS3 dataset (433 hours of English audio-visual speech) and MuAViC (a multilingual audio-visual corpus). It demonstrates superior performance in both VSR and VST tasks, even with limited training data. Specifically, a model trained on just 30 hours of labeled data outperforms a model trained on 433 hours of data. The framework achieves state-of-the-art results on the MuAViC benchmark, showing that the integration of LLMs significantly improves the accuracy and efficiency of visual speech processing.
The key contributions of this work include the first integration of visual speech modeling with LLMs, the development of a unified model for VSR and VST, a novel visual speech deduplication method that improves computational efficiency, and the demonstration of the effectiveness of LLMs in handling limited training data. The framework also shows that the context modeling ability of LLMs can effectively address the ambiguity of homophones in visual speech processing.This paper introduces a novel framework called VSP-LLM, which integrates Large Language Models (LLMs) to enhance visual speech processing. Visual speech processing involves recognizing and translating lip movements into text, a task complicated by homophones—words that sound the same but have different meanings. The VSP-LLM framework leverages the strong context modeling capabilities of LLMs to improve the accuracy of visual speech recognition (VSR) and translation (VST).
The framework employs a self-supervised visual speech model to embed input video into a latent space of an LLM. To reduce computational load, a deduplication method is introduced, which compresses redundant visual features while preserving contextual information. This method uses visual speech units, derived from self-supervised models, to represent lip movements. These units are then used to map visual features into the LLM's input space, enabling efficient and context-aware processing.
The VSP-LLM framework is trained on a combination of the LRS3 dataset (433 hours of English audio-visual speech) and MuAViC (a multilingual audio-visual corpus). It demonstrates superior performance in both VSR and VST tasks, even with limited training data. Specifically, a model trained on just 30 hours of labeled data outperforms a model trained on 433 hours of data. The framework achieves state-of-the-art results on the MuAViC benchmark, showing that the integration of LLMs significantly improves the accuracy and efficiency of visual speech processing.
The key contributions of this work include the first integration of visual speech modeling with LLMs, the development of a unified model for VSR and VST, a novel visual speech deduplication method that improves computational efficiency, and the demonstration of the effectiveness of LLMs in handling limited training data. The framework also shows that the context modeling ability of LLMs can effectively address the ambiguity of homophones in visual speech processing.