Language-Codec is a discrete acoustic codec model designed to bridge the gap between discrete codec representations and speech language models. It introduces a Masked Channel Residual Vector Quantization (MCRVQ) mechanism, along with improved Fourier transform structures, larger training datasets, refined discriminator design, and optimized hyperparameter selection to address the challenges of existing discrete codecs. The model achieves excellent audio reconstruction quality with only four channels, enhancing compatibility with downstream models. Language-Codec outperforms competing audio compression algorithms across various metrics and test datasets. It is open-sourced with pre-trained models available at https://github.com/jishengpeng/languagecodec.
The model is trained on a comprehensive 50,000-hour speech dataset, including LibriLight, DNS Challenge 4, Common Voice, LibriTTS, and internal Chinese data. It is evaluated on the LibriTTS test set, demonstrating superior performance in audio reconstruction quality, speaker similarity, and generalization across clean and noisy environments. Language-Codec also performs well in zero-shot text-to-speech tasks, showing improved speaker similarity and audio quality compared to other models. Ablation experiments confirm the effectiveness of the MCRVQ mechanism in enhancing audio reconstruction quality and reducing the difficulty of text generation for downstream tasks. The model is a state-of-the-art foundational codec model for future research in speech generation.Language-Codec is a discrete acoustic codec model designed to bridge the gap between discrete codec representations and speech language models. It introduces a Masked Channel Residual Vector Quantization (MCRVQ) mechanism, along with improved Fourier transform structures, larger training datasets, refined discriminator design, and optimized hyperparameter selection to address the challenges of existing discrete codecs. The model achieves excellent audio reconstruction quality with only four channels, enhancing compatibility with downstream models. Language-Codec outperforms competing audio compression algorithms across various metrics and test datasets. It is open-sourced with pre-trained models available at https://github.com/jishengpeng/languagecodec.
The model is trained on a comprehensive 50,000-hour speech dataset, including LibriLight, DNS Challenge 4, Common Voice, LibriTTS, and internal Chinese data. It is evaluated on the LibriTTS test set, demonstrating superior performance in audio reconstruction quality, speaker similarity, and generalization across clean and noisy environments. Language-Codec also performs well in zero-shot text-to-speech tasks, showing improved speaker similarity and audio quality compared to other models. Ablation experiments confirm the effectiveness of the MCRVQ mechanism in enhancing audio reconstruction quality and reducing the difficulty of text generation for downstream tasks. The model is a state-of-the-art foundational codec model for future research in speech generation.