AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension

AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension

26 Jul 2024 | Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, Jingren Zhou
AIR-Bench is a benchmark designed to evaluate the ability of large audio-language models (LALMs) to understand various types of audio signals and interact with humans in the textual format. It includes two dimensions: foundation and chat benchmarks. The foundation benchmark consists of 19 tasks with approximately 19k single-choice questions, aiming to assess the basic single-task ability of LALMs. The chat benchmark contains 2k instances of open-ended question-and-answer data, directly assessing the model's comprehension of complex audio and its capacity to follow instructions. Both benchmarks require the model to generate hypotheses directly. AIR-Bench introduces a unified framework that leverages advanced language models, such as GPT-4, to evaluate the scores of generated hypotheses given the meta-information of the audio. Experimental results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation. By revealing the limitations of existing LALMs through evaluation results, AIR-Bench can provide insights into the direction of future research. The dataset and evaluation code are available at https://github.com/OFA-Sys/AIR-Bench. AIR-Bench is characterized by three primary features: comprehensive audio signals coverage, hierarchical benchmark structure, and a unified, objective, and reproducible evaluation framework. The benchmark includes a range of ability dimensions, namely the foundation and chat abilities, which cater to various audio types such as speech, sound, and music. The foundational dimension comprises 19 distinct leaf abilities, each of which is assessed using a single-choice question format. The chat dimension assesses abilities through an open-ended question-and-answer format, incorporating diverse audio sources and mixed audio. AIR-Bench proposes a novel audio mixing strategy with loudness control and temporal dislocation to enhance the complexity of the audio. A unified, objective, and reproducible evaluation framework has been developed to assess the quality of generative hypotheses. The paper evaluates the performance of various LALMs on both fundamental and chat benchmarks, utilizing their latest publicly available checkpoints. The results show that existing LALMs either have limited audio understanding or instruction-following capabilities, leaving significant room for improvement in this field. The paper also discusses the limitations of AIR-Bench, including the lack of tasks involving multiple audio comparisons and the reliance on GPT-4 for evaluation. The ethical considerations of AIR-Bench include the use of publicly available datasets and the potential for data misuse. The paper concludes that AIR-Bench is the first generative evaluation benchmark for large audio-language models, encompassing a wide array of audio such as speech, natural sounds, and music. It also proposes a novel audio mixing strategy to simulate audio from real-world scenarios more accurately. A standardized, objective, and reproducible evaluation framework is employed to automatically assess the quality of hypotheses generated by LALMs. The paper also plans to launch and maintain a leaderboard that will serve as a platform for the community to access and compare model performance consistently over time.AIR-Bench is a benchmark designed to evaluate the ability of large audio-language models (LALMs) to understand various types of audio signals and interact with humans in the textual format. It includes two dimensions: foundation and chat benchmarks. The foundation benchmark consists of 19 tasks with approximately 19k single-choice questions, aiming to assess the basic single-task ability of LALMs. The chat benchmark contains 2k instances of open-ended question-and-answer data, directly assessing the model's comprehension of complex audio and its capacity to follow instructions. Both benchmarks require the model to generate hypotheses directly. AIR-Bench introduces a unified framework that leverages advanced language models, such as GPT-4, to evaluate the scores of generated hypotheses given the meta-information of the audio. Experimental results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation. By revealing the limitations of existing LALMs through evaluation results, AIR-Bench can provide insights into the direction of future research. The dataset and evaluation code are available at https://github.com/OFA-Sys/AIR-Bench. AIR-Bench is characterized by three primary features: comprehensive audio signals coverage, hierarchical benchmark structure, and a unified, objective, and reproducible evaluation framework. The benchmark includes a range of ability dimensions, namely the foundation and chat abilities, which cater to various audio types such as speech, sound, and music. The foundational dimension comprises 19 distinct leaf abilities, each of which is assessed using a single-choice question format. The chat dimension assesses abilities through an open-ended question-and-answer format, incorporating diverse audio sources and mixed audio. AIR-Bench proposes a novel audio mixing strategy with loudness control and temporal dislocation to enhance the complexity of the audio. A unified, objective, and reproducible evaluation framework has been developed to assess the quality of generative hypotheses. The paper evaluates the performance of various LALMs on both fundamental and chat benchmarks, utilizing their latest publicly available checkpoints. The results show that existing LALMs either have limited audio understanding or instruction-following capabilities, leaving significant room for improvement in this field. The paper also discusses the limitations of AIR-Bench, including the lack of tasks involving multiple audio comparisons and the reliance on GPT-4 for evaluation. The ethical considerations of AIR-Bench include the use of publicly available datasets and the potential for data misuse. The paper concludes that AIR-Bench is the first generative evaluation benchmark for large audio-language models, encompassing a wide array of audio such as speech, natural sounds, and music. It also proposes a novel audio mixing strategy to simulate audio from real-world scenarios more accurately. A standardized, objective, and reproducible evaluation framework is employed to automatically assess the quality of hypotheses generated by LALMs. The paper also plans to launch and maintain a leaderboard that will serve as a platform for the community to access and compare model performance consistently over time.
Reach us at info@study.space
Understanding AIR-Bench%3A Benchmarking Large Audio-Language Models via Generative Comprehension