PAM: Prompting Audio-Language Models for Audio Quality Assessment

PAM: Prompting Audio-Language Models for Audio Quality Assessment

1 Feb 2024 | Soham Deshmukh¹, Dareen Alharthi¹, Benjamin Elizalde², Hannes Gamper², Mahmoud Al Ismail², Rita Singh¹, Bhiksha Raj¹, Huaming Wang²
PAM is a reference-free audio quality assessment metric that leverages Audio-Language Models (ALMs) to evaluate audio quality across various tasks. ALMs are pre-trained on audio-text pairs that may contain information about audio quality, artifacts, or noise. PAM uses two opposing prompts to derive a score, which correlates well with human listening scores and existing metrics. The method does not require a reference dataset or task-specific training, making it efficient and versatile. PAM was evaluated on four tasks: text-to-audio (TTA), text-to-music generation (TTM), text-to-speech (TTS), and deep noise suppression (DNS). The results show that PAM performs well in detecting audio quality and distortions, particularly in tasks where human listening scores are available. However, PAM has limitations in speech generation tasks where task-specific metrics are more accurate. The method is zero-shot and can be applied to various audio generation tasks without requiring task-specific fine-tuning. PAM is a promising approach for general-purpose audio quality assessment, as it can be used across different audio tasks and distributions. The results indicate that PAM is a reliable and effective metric for assessing audio quality, especially in scenarios where human listening scores are not available.PAM is a reference-free audio quality assessment metric that leverages Audio-Language Models (ALMs) to evaluate audio quality across various tasks. ALMs are pre-trained on audio-text pairs that may contain information about audio quality, artifacts, or noise. PAM uses two opposing prompts to derive a score, which correlates well with human listening scores and existing metrics. The method does not require a reference dataset or task-specific training, making it efficient and versatile. PAM was evaluated on four tasks: text-to-audio (TTA), text-to-music generation (TTM), text-to-speech (TTS), and deep noise suppression (DNS). The results show that PAM performs well in detecting audio quality and distortions, particularly in tasks where human listening scores are available. However, PAM has limitations in speech generation tasks where task-specific metrics are more accurate. The method is zero-shot and can be applied to various audio generation tasks without requiring task-specific fine-tuning. PAM is a promising approach for general-purpose audio quality assessment, as it can be used across different audio tasks and distributions. The results indicate that PAM is a reliable and effective metric for assessing audio quality, especially in scenarios where human listening scores are not available.
Reach us at info@study.space
Understanding PAM%3A Prompting Audio-Language Models for Audio Quality Assessment