7 Jan 2024 | Qiushi Zhu, Jie Zhang, Yu Gu, Yuchen Hu, Lirong Dai
This paper proposes a multichannel multi-modal speech self-supervised learning framework called AV-wav2vec2, which utilizes video and multichannel audio data as inputs. The framework processes multichannel audio streams and a visual stream in parallel, with intra- and inter-channel contrastive losses as training targets to fully exploit the spatiotemporal information in multichannel speech data. Additionally, it uses additional single-channel audio data for multi-task joint training to improve the performance of speech representation. The framework is validated on audio-visual speech recognition (AVSR), automatic speech recognition (ASR), visual speech recognition (VSR), and audio-visual speaker diarization (AVSD) tasks using a Chinese multichannel multi-modal dataset in real scenarios.
The proposed framework consists of a multichannel audio-visual branch and a single-channel audio branch. The multichannel audio-visual branch includes a visual encoder, audio encoder, visual-audio fusion module, and Transformer encoder module. The visual encoder uses a modified Resnet-18 to extract visual features, while the audio encoder processes multichannel audio data. The visual-audio fusion module concatenates visual and audio features, and the Transformer encoder learns contextual representations. Additional single-channel audio data is used for multi-task joint training to improve multi-modal representation learning.
The self-supervised pre-training task uses both intra- and inter-channel contrastive loss functions to improve the efficiency of pre-training and better leverage spatial information across different microphone channels. The total loss function includes intra-channel contrastive loss, inter-channel contrastive loss, and single-channel audio loss. The framework is evaluated on AVSR, ASR, VSR, and AVSD tasks, showing consistent performance gains on far-field, mid-field, and near-field data. The AVSD task is further evaluated using the pre-trained model as a feature extractor, demonstrating improved performance compared to baseline models. The results show that the proposed framework outperforms existing methods in AVSR, ASR, VSR, and AVSD tasks, particularly when using additional unlabeled data. The framework is also effective in noisy environments and can handle multiple speakers, TV sounds, and other interferences. The proposed model is validated on a large-scale Chinese multichannel multi-modal dataset, demonstrating its effectiveness in multichannel multi-modal speech recognition.This paper proposes a multichannel multi-modal speech self-supervised learning framework called AV-wav2vec2, which utilizes video and multichannel audio data as inputs. The framework processes multichannel audio streams and a visual stream in parallel, with intra- and inter-channel contrastive losses as training targets to fully exploit the spatiotemporal information in multichannel speech data. Additionally, it uses additional single-channel audio data for multi-task joint training to improve the performance of speech representation. The framework is validated on audio-visual speech recognition (AVSR), automatic speech recognition (ASR), visual speech recognition (VSR), and audio-visual speaker diarization (AVSD) tasks using a Chinese multichannel multi-modal dataset in real scenarios.
The proposed framework consists of a multichannel audio-visual branch and a single-channel audio branch. The multichannel audio-visual branch includes a visual encoder, audio encoder, visual-audio fusion module, and Transformer encoder module. The visual encoder uses a modified Resnet-18 to extract visual features, while the audio encoder processes multichannel audio data. The visual-audio fusion module concatenates visual and audio features, and the Transformer encoder learns contextual representations. Additional single-channel audio data is used for multi-task joint training to improve multi-modal representation learning.
The self-supervised pre-training task uses both intra- and inter-channel contrastive loss functions to improve the efficiency of pre-training and better leverage spatial information across different microphone channels. The total loss function includes intra-channel contrastive loss, inter-channel contrastive loss, and single-channel audio loss. The framework is evaluated on AVSR, ASR, VSR, and AVSD tasks, showing consistent performance gains on far-field, mid-field, and near-field data. The AVSD task is further evaluated using the pre-trained model as a feature extractor, demonstrating improved performance compared to baseline models. The results show that the proposed framework outperforms existing methods in AVSR, ASR, VSR, and AVSD tasks, particularly when using additional unlabeled data. The framework is also effective in noisy environments and can handle multiple speakers, TV sounds, and other interferences. The proposed model is validated on a large-scale Chinese multichannel multi-modal dataset, demonstrating its effectiveness in multichannel multi-modal speech recognition.