1 Feb 2024 | Soham Deshmukh, Dareen Alharthi, Benjamin Elizalde, Hannes Gamper, Mahmoud Al Ismail, Rita Singh, Bhiksha Raj, Huaming Wang
The paper introduces PAM (Prompting Audio-Language Models), a novel reference-free metric for assessing audio quality across various tasks, including text-to-audio (TTA), text-to-music (TTM), text-to-speech (TTS), and deep noise suppression (DNS). PAM leverages Audio-Language Models (ALMs) trained on audio-text pairs to calculate a similarity score between an audio input and a text prompt related to quality. Unlike other reference-free metrics, PAM does not require computing embeddings on a reference dataset or training task-specific models on human listening scores. The evaluation of PAM against established metrics and human listening scores on four tasks shows that it correlates well with existing metrics and human perception. The paper also discusses the limitations of PAM, such as its performance in speech generation tasks and the need for fine-tuning ALMs. Overall, PAM demonstrates the potential of ALMs for computing a general-purpose audio quality metric.The paper introduces PAM (Prompting Audio-Language Models), a novel reference-free metric for assessing audio quality across various tasks, including text-to-audio (TTA), text-to-music (TTM), text-to-speech (TTS), and deep noise suppression (DNS). PAM leverages Audio-Language Models (ALMs) trained on audio-text pairs to calculate a similarity score between an audio input and a text prompt related to quality. Unlike other reference-free metrics, PAM does not require computing embeddings on a reference dataset or training task-specific models on human listening scores. The evaluation of PAM against established metrics and human listening scores on four tasks shows that it correlates well with existing metrics and human perception. The paper also discusses the limitations of PAM, such as its performance in speech generation tasks and the need for fine-tuning ALMs. Overall, PAM demonstrates the potential of ALMs for computing a general-purpose audio quality metric.