[slides and audio] SemantiCodec%3A An Ultra Low Bitrate Semantic Audio Codec for General Sound

SemantiCodec is a novel audio codec designed to compress audio into fewer than 100 tokens per second across diverse audio types, including speech, general audio, and music, without compromising quality. It features a dual-encoder architecture: a semantic encoder using self-supervised AudioMAE, discretized with k-means clustering, and an acoustic encoder to capture remaining details. The semantic and acoustic encoder outputs are used to reconstruct audio via a diffusion-model-based decoder. SemantiCodec has three variants with token rates of 25, 50, and 100 per second, supporting ultra-low bit rates between 0.31 kbps and 1.43 kbps. Experimental results show that SemantiCodec significantly outperforms the state-of-the-art Descript codec on reconstruction quality and contains richer semantic information than other codecs, even at lower bitrates. The code and demos are available at https://haoheliu.github.io/SemantiCodec/. The paper introduces SemantiCodec, which leverages strong generative models and rich features learned by self-supervised models for semantic-driven audio encoding and reconstruction. It achieves strong reconstruction performance across general audio types at exceptionally low token rates, surpassing counterparts with higher token rates. Evaluation on audio classification benchmarks demonstrates the significantly richer semantic information in SemantiCodec's tokens, indicating strong potential in future audio language modelling. The paper also discusses related work, including neural audio codecs, semantic audio representation learning, and conditional audio generation. The system overview describes the architecture of SemantiCodec, including semantic clustering, the encoder, and the latent diffusion model for reconstruction. The paper presents experimental results showing that SemantiCodec outperforms other codecs in reconstruction quality and semantic information. The results indicate that SemantiCodec has the potential for efficient audio transmission and storage and audio-based language modelling due to its ability to provide shorter discrete representations of audio without substantially compromising reconstruction quality.SemantiCodec is a novel audio codec designed to compress audio into fewer than 100 tokens per second across diverse audio types, including speech, general audio, and music, without compromising quality. It features a dual-encoder architecture: a semantic encoder using self-supervised AudioMAE, discretized with k-means clustering, and an acoustic encoder to capture remaining details. The semantic and acoustic encoder outputs are used to reconstruct audio via a diffusion-model-based decoder. SemantiCodec has three variants with token rates of 25, 50, and 100 per second, supporting ultra-low bit rates between 0.31 kbps and 1.43 kbps. Experimental results show that SemantiCodec significantly outperforms the state-of-the-art Descript codec on reconstruction quality and contains richer semantic information than other codecs, even at lower bitrates. The code and demos are available at https://haoheliu.github.io/SemantiCodec/. The paper introduces SemantiCodec, which leverages strong generative models and rich features learned by self-supervised models for semantic-driven audio encoding and reconstruction. It achieves strong reconstruction performance across general audio types at exceptionally low token rates, surpassing counterparts with higher token rates. Evaluation on audio classification benchmarks demonstrates the significantly richer semantic information in SemantiCodec's tokens, indicating strong potential in future audio language modelling. The paper also discusses related work, including neural audio codecs, semantic audio representation learning, and conditional audio generation. The system overview describes the architecture of SemantiCodec, including semantic clustering, the encoder, and the latent diffusion model for reconstruction. The paper presents experimental results showing that SemantiCodec outperforms other codecs in reconstruction quality and semantic information. The results indicate that SemantiCodec has the potential for efficient audio transmission and storage and audio-based language modelling due to its ability to provide shorter discrete representations of audio without substantially compromising reconstruction quality.

SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound

30 Apr 2024 | Haohe Liu, Xuenan Xu, Yi Yuan, Mengyue Wu, Wenwu Wang, Mark D. Plumbley