26 Feb 2024 | Szu-Wei Fu 1, Kuo-Hsuan Hung 1*, Yu Tsao 2, Yu-Chiang Frank Wang 1
This paper presents a novel self-supervised method for speech quality estimation and enhancement using only clean speech. The proposed method, named VQScore, leverages the quantization error of a vector-quantized variational autoencoder (VQ-VAE) to estimate speech quality. The VQ-VAE is trained on clean speech, and its quantization error is used as a metric to evaluate speech quality. The authors found that the quantization error in the code space (z) provides a higher correlation with human perception compared to the signal space (x). Additionally, the VQ-VAE can be used for self-supervised speech enhancement (SE) by incorporating a novel self-distillation mechanism combined with adversarial training. The proposed method is evaluated on various datasets and compared with supervised baselines, showing competitive performance in both objective and subjective evaluations. The results demonstrate that the proposed VQScore and SE model are effective and robust, achieving comparable or better performance than supervised methods without requiring labeled data during training.This paper presents a novel self-supervised method for speech quality estimation and enhancement using only clean speech. The proposed method, named VQScore, leverages the quantization error of a vector-quantized variational autoencoder (VQ-VAE) to estimate speech quality. The VQ-VAE is trained on clean speech, and its quantization error is used as a metric to evaluate speech quality. The authors found that the quantization error in the code space (z) provides a higher correlation with human perception compared to the signal space (x). Additionally, the VQ-VAE can be used for self-supervised speech enhancement (SE) by incorporating a novel self-distillation mechanism combined with adversarial training. The proposed method is evaluated on various datasets and compared with supervised baselines, showing competitive performance in both objective and subjective evaluations. The results demonstrate that the proposed VQScore and SE model are effective and robust, achieving comparable or better performance than supervised methods without requiring labeled data during training.