26 Feb 2024 | Szu-Wei Fu, Kuo-Hsuan Hung, Yu Tsao, Yu-Chiang Frank Wang
This paper proposes VQScore, a self-supervised speech quality estimation method that relies solely on clean speech for training. The method is based on the quantization error of a vector-quantized variational autoencoder (VQ-VAE), which is trained using clean speech. The VQ-VAE's codebook is treated as a high-level representation of speech signals, enabling the quantization error to correlate strongly with human perception. The proposed VQScore is defined as the average cosine similarity between the original embeddings and their quantized counterparts in the code space. The method also incorporates a self-distillation mechanism combined with adversarial training to improve the robustness of the encoder for speech enhancement. Experimental results show that the proposed VQScore and enhancement model are competitive with supervised baselines. The code will be released after publication. The study also explores self-supervised speech enhancement using the VQ-VAE framework, with the encoder and decoder trained to handle noisy speech. The proposed method achieves significant improvements in speech quality estimation and enhancement, particularly in mismatched conditions. The results demonstrate that the self-supervised model outperforms supervised models in terms of generalization and robustness. The study highlights the potential of self-supervised learning in speech processing tasks, particularly in scenarios where labeled data is scarce or unavailable.This paper proposes VQScore, a self-supervised speech quality estimation method that relies solely on clean speech for training. The method is based on the quantization error of a vector-quantized variational autoencoder (VQ-VAE), which is trained using clean speech. The VQ-VAE's codebook is treated as a high-level representation of speech signals, enabling the quantization error to correlate strongly with human perception. The proposed VQScore is defined as the average cosine similarity between the original embeddings and their quantized counterparts in the code space. The method also incorporates a self-distillation mechanism combined with adversarial training to improve the robustness of the encoder for speech enhancement. Experimental results show that the proposed VQScore and enhancement model are competitive with supervised baselines. The code will be released after publication. The study also explores self-supervised speech enhancement using the VQ-VAE framework, with the encoder and decoder trained to handle noisy speech. The proposed method achieves significant improvements in speech quality estimation and enhancement, particularly in mismatched conditions. The results demonstrate that the self-supervised model outperforms supervised models in terms of generalization and robustness. The study highlights the potential of self-supervised learning in speech processing tasks, particularly in scenarios where labeled data is scarce or unavailable.