25 Jun 2024 | Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, Nancy F. Chen
AudioBench is a comprehensive benchmark designed to evaluate audio large language models (AudioLLMs). It includes 8 tasks and 26 datasets, focusing on speech understanding, voice interpretation, and audio scene understanding. The benchmark addresses the lack of comprehensive evaluations for AudioLLMs, providing relevant datasets and metrics. Four models were evaluated, revealing no single model excels across all tasks. AudioBench includes open-source code, data, and a leaderboard to support future model development. The benchmark incorporates diverse prompt templates and varies input lengths to assess model performance on longer audio sequences. It also explores evaluation metrics for open-ended generation, using model-as-judge methods. Results show that no model consistently outperforms others, highlighting opportunities for future improvements. AudioBench emphasizes the importance of evaluating models across various tasks and modalities, including speech, environmental sounds, and paralinguistic features. The benchmark includes datasets for speech understanding, audio scene understanding, and voice understanding, with a focus on English audio. It also addresses the need for multilingual capabilities and code-switching. The benchmark includes a cascade model, Whisper+Llama3, which performs well in speech-intensive tasks. AudioBench aims to provide a robust testbed for future AudioLLM developments, emphasizing the need for comprehensive evaluations and improvements in audio processing and understanding. The benchmark highlights the importance of robustness to diverse queries and the need for better evaluation metrics for open-ended generations. The study also discusses future research directions, including long audio processing, multi-round query handling, multilingual capabilities, and speech generation. The benchmark aims to advance research and improve the capabilities of AudioLLMs through comprehensive evaluations and benchmarking.AudioBench is a comprehensive benchmark designed to evaluate audio large language models (AudioLLMs). It includes 8 tasks and 26 datasets, focusing on speech understanding, voice interpretation, and audio scene understanding. The benchmark addresses the lack of comprehensive evaluations for AudioLLMs, providing relevant datasets and metrics. Four models were evaluated, revealing no single model excels across all tasks. AudioBench includes open-source code, data, and a leaderboard to support future model development. The benchmark incorporates diverse prompt templates and varies input lengths to assess model performance on longer audio sequences. It also explores evaluation metrics for open-ended generation, using model-as-judge methods. Results show that no model consistently outperforms others, highlighting opportunities for future improvements. AudioBench emphasizes the importance of evaluating models across various tasks and modalities, including speech, environmental sounds, and paralinguistic features. The benchmark includes datasets for speech understanding, audio scene understanding, and voice understanding, with a focus on English audio. It also addresses the need for multilingual capabilities and code-switching. The benchmark includes a cascade model, Whisper+Llama3, which performs well in speech-intensive tasks. AudioBench aims to provide a robust testbed for future AudioLLM developments, emphasizing the need for comprehensive evaluations and improvements in audio processing and understanding. The benchmark highlights the importance of robustness to diverse queries and the need for better evaluation metrics for open-ended generations. The study also discusses future research directions, including long audio processing, multi-round query handling, multilingual capabilities, and speech generation. The benchmark aims to advance research and improve the capabilities of AudioLLMs through comprehensive evaluations and benchmarking.