**AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension**
This paper introduces AIR-Bench, a novel benchmark designed to evaluate the ability of large audio-language models (LALMs) to understand and interact with various types of audio signals, including human speech, natural sounds, and music. AIR-Bench consists of two main components: the foundation benchmark and the chat benchmark. The foundation benchmark includes 19 tasks with approximately 19,000 single-choice questions, assessing the basic single-task capabilities of LALMs. The chat benchmark contains over 2,000 instances of open-ended question-and-answer data, directly evaluating the model's comprehension of complex audio and its ability to follow instructions. Both benchmarks require models to generate hypotheses directly, which are then evaluated using a unified framework that leverages advanced language models like GPT-4.
The paper highlights the limitations of existing LALMs through evaluation results, providing insights into future research directions. The evaluation framework is designed to be standardized, objective, and reproducible, ensuring consistent and fair comparisons across different models. The authors conducted a thorough evaluation of nine prominent open-source LALMs, demonstrating that existing models have limited audio understanding or instruction-following capabilities.
Key contributions of the paper include:
- AIR-Bench, the first generative evaluation benchmark for LALMs, covering a wide range of audio types.
- A novel audio mixing strategy with loudness control and temporal dislocation to enhance the complexity of audio signals.
- A unified, objective, and reproducible evaluation framework using GPT-4 to assess the quality of generated hypotheses.
The paper also discusses related work, experimental results, and ethical considerations, emphasizing the importance of addressing biases and data misuse in automated evaluation methods.**AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension**
This paper introduces AIR-Bench, a novel benchmark designed to evaluate the ability of large audio-language models (LALMs) to understand and interact with various types of audio signals, including human speech, natural sounds, and music. AIR-Bench consists of two main components: the foundation benchmark and the chat benchmark. The foundation benchmark includes 19 tasks with approximately 19,000 single-choice questions, assessing the basic single-task capabilities of LALMs. The chat benchmark contains over 2,000 instances of open-ended question-and-answer data, directly evaluating the model's comprehension of complex audio and its ability to follow instructions. Both benchmarks require models to generate hypotheses directly, which are then evaluated using a unified framework that leverages advanced language models like GPT-4.
The paper highlights the limitations of existing LALMs through evaluation results, providing insights into future research directions. The evaluation framework is designed to be standardized, objective, and reproducible, ensuring consistent and fair comparisons across different models. The authors conducted a thorough evaluation of nine prominent open-source LALMs, demonstrating that existing models have limited audio understanding or instruction-following capabilities.
Key contributions of the paper include:
- AIR-Bench, the first generative evaluation benchmark for LALMs, covering a wide range of audio types.
- A novel audio mixing strategy with loudness control and temporal dislocation to enhance the complexity of audio signals.
- A unified, objective, and reproducible evaluation framework using GPT-4 to assess the quality of generated hypotheses.
The paper also discusses related work, experimental results, and ethical considerations, emphasizing the importance of addressing biases and data misuse in automated evaluation methods.