The paper introduces a multichannel multi-modal self-supervised learning framework called AV-way2vec2, which aims to improve the performance of far-field multichannel speech processing. The framework leverages video and multichannel audio data to enhance speech recognition in noisy environments. Key contributions include:
1. **Model Structure**: The framework consists of a multichannel audio-visual branch and a single-channel audio branch. The multichannel audio-visual branch includes a visual encoder, audio encoder, visual-audio fusion module, and Transformer encoder. The visual encoder uses a modified ResNet-18, while the audio encoder has eight convolutional layers. The visual-audio fusion module combines the visual and audio features, and the Transformer encoder learns contextual representations.
2. **Contrastive Loss Functions**: Intra- and inter-channel contrastive losses are used to exploit spatiotemporal information in multichannel speech data. These losses help in leveraging different channels to provide self-supervised pre-training targets.
3. **Additional Single-Channel Audio Data**: To address the scarcity of labeled multichannel data, the framework utilizes additional unlabeled single-channel audio data for multi-task joint training, improving multi-modal representation learning.
4. **Downstream Tasks**: The pre-trained model is fine-tuned on several downstream tasks, including audio-visual speech recognition (AVSR), automatic speech recognition (ASR), visual speech recognition (VSR), and audio-visual speaker diarization (AVSD). Experimental results show significant performance improvements on far-field, mid-field, and near-field data.
5. **Evaluation**: The effectiveness of the proposed method is validated using the MISP2021-AVSR dataset and the WenetSpeech dataset. The results demonstrate that AV-way2vec2 outperforms existing methods, particularly in noisy conditions and with limited labeled data.
6. **Conclusion**: The paper highlights the potential of AV-way2vec2 in leveraging visual and audio information to improve speech recognition in challenging environments, making it a valuable contribution to the field of multichannel speech processing.The paper introduces a multichannel multi-modal self-supervised learning framework called AV-way2vec2, which aims to improve the performance of far-field multichannel speech processing. The framework leverages video and multichannel audio data to enhance speech recognition in noisy environments. Key contributions include:
1. **Model Structure**: The framework consists of a multichannel audio-visual branch and a single-channel audio branch. The multichannel audio-visual branch includes a visual encoder, audio encoder, visual-audio fusion module, and Transformer encoder. The visual encoder uses a modified ResNet-18, while the audio encoder has eight convolutional layers. The visual-audio fusion module combines the visual and audio features, and the Transformer encoder learns contextual representations.
2. **Contrastive Loss Functions**: Intra- and inter-channel contrastive losses are used to exploit spatiotemporal information in multichannel speech data. These losses help in leveraging different channels to provide self-supervised pre-training targets.
3. **Additional Single-Channel Audio Data**: To address the scarcity of labeled multichannel data, the framework utilizes additional unlabeled single-channel audio data for multi-task joint training, improving multi-modal representation learning.
4. **Downstream Tasks**: The pre-trained model is fine-tuned on several downstream tasks, including audio-visual speech recognition (AVSR), automatic speech recognition (ASR), visual speech recognition (VSR), and audio-visual speaker diarization (AVSD). Experimental results show significant performance improvements on far-field, mid-field, and near-field data.
5. **Evaluation**: The effectiveness of the proposed method is validated using the MISP2021-AVSR dataset and the WenetSpeech dataset. The results demonstrate that AV-way2vec2 outperforms existing methods, particularly in noisy conditions and with limited labeled data.
6. **Conclusion**: The paper highlights the potential of AV-way2vec2 in leveraging visual and audio information to improve speech recognition in challenging environments, making it a valuable contribution to the field of multichannel speech processing.