GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

17 Jun 2024 | Sreyan Ghosh*, Sonal Kumar*, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraismwami, Dinesh Manocha
GAMA is a novel General-purpose Large Audio-Language Model (LALM) with advanced audio understanding and complex reasoning abilities. The model integrates an LLM with multiple types of audio representations, including features from a custom Audio Q-Former and a multi-layer aggregator that aggregates features from multiple layers of an audio encoder. GAMA is fine-tuned on a large-scale audio-language dataset to enhance its audio understanding capabilities. Additionally, a synthetically generated instruction-tuning (IT) dataset, CompA-R, is proposed to endow GAMA with complex reasoning abilities. CompA-R includes instructions that require the model to perform complex reasoning on the input audio. GAMA is instruction-tuned with CompA-R, and a soft prompt is added with high-level semantic evidence by leveraging event tags of the input audio. A human-labeled evaluation dataset, CompA-R-test, is also proposed for evaluating the capabilities of LALMs on open-ended audio question-answering that requires complex reasoning. Through automated and expert human evaluations, GAMA outperforms all other LALMs in literature on diverse audio understanding tasks by margins of 1%-84%. Further, GAMA IT-ed on CompA-R proves to be superior in its complex reasoning and instruction following capabilities. The model architecture includes an Audio Spectrogram Transformer (AST) and an Audio Q-Former, which are combined with a multi-layer aggregator to extract diverse audio features. The model also employs a soft prompt to enhance its complex reasoning abilities. GAMA is trained on a mixture of open-source datasets and outperforms prior audio-language models on 16 datasets spanning 4 tasks. The model is evaluated on various benchmarks, including classification, captioning, and open-ended audio question-answering. GAMA shows significant improvements in these tasks compared to other baselines. The model's performance is also evaluated using human annotations and automated methods. The results demonstrate that GAMA has superior audio understanding and complex reasoning capabilities compared to other LALMs.GAMA is a novel General-purpose Large Audio-Language Model (LALM) with advanced audio understanding and complex reasoning abilities. The model integrates an LLM with multiple types of audio representations, including features from a custom Audio Q-Former and a multi-layer aggregator that aggregates features from multiple layers of an audio encoder. GAMA is fine-tuned on a large-scale audio-language dataset to enhance its audio understanding capabilities. Additionally, a synthetically generated instruction-tuning (IT) dataset, CompA-R, is proposed to endow GAMA with complex reasoning abilities. CompA-R includes instructions that require the model to perform complex reasoning on the input audio. GAMA is instruction-tuned with CompA-R, and a soft prompt is added with high-level semantic evidence by leveraging event tags of the input audio. A human-labeled evaluation dataset, CompA-R-test, is also proposed for evaluating the capabilities of LALMs on open-ended audio question-answering that requires complex reasoning. Through automated and expert human evaluations, GAMA outperforms all other LALMs in literature on diverse audio understanding tasks by margins of 1%-84%. Further, GAMA IT-ed on CompA-R proves to be superior in its complex reasoning and instruction following capabilities. The model architecture includes an Audio Spectrogram Transformer (AST) and an Audio Q-Former, which are combined with a multi-layer aggregator to extract diverse audio features. The model also employs a soft prompt to enhance its complex reasoning abilities. GAMA is trained on a mixture of open-source datasets and outperforms prior audio-language models on 16 datasets spanning 4 tasks. The model is evaluated on various benchmarks, including classification, captioning, and open-ended audio question-answering. GAMA shows significant improvements in these tasks compared to other baselines. The model's performance is also evaluated using human annotations and automated methods. The results demonstrate that GAMA has superior audio understanding and complex reasoning capabilities compared to other LALMs.
Reach us at info@study.space