GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

2024-06-17 | Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha
GAMA is a novel General-purpose Large Audio-Language Model (LALM) designed to enhance audio understanding and complex reasoning abilities. The model integrates an LLM with multiple types of audio representations, including features from a custom Audio Q-Former and an Audio Spectrogram Transformer (AST), which are aggregated by a multi-layer aggregator. GAMA is fine-tuned on a large-scale audio-language dataset to improve its audio understanding capabilities. To further enhance complex reasoning, the authors propose CompA-R, a synthetically generated instruction-tuning dataset with complex reasoning tasks. GAMA is then instruction-tuned on CompA-R, and its performance is evaluated on a human-labeled dataset, CompA-R-test, which assesses open-ended audio question-answering. GAMA outperforms other LALMs on diverse audio understanding tasks, demonstrating significant improvements in complex reasoning and instruction-following capabilities. The paper also discusses the limitations and future work, including the potential for extending GAMA to music understanding and using larger LLMs.GAMA is a novel General-purpose Large Audio-Language Model (LALM) designed to enhance audio understanding and complex reasoning abilities. The model integrates an LLM with multiple types of audio representations, including features from a custom Audio Q-Former and an Audio Spectrogram Transformer (AST), which are aggregated by a multi-layer aggregator. GAMA is fine-tuned on a large-scale audio-language dataset to improve its audio understanding capabilities. To further enhance complex reasoning, the authors propose CompA-R, a synthetically generated instruction-tuning dataset with complex reasoning tasks. GAMA is then instruction-tuned on CompA-R, and its performance is evaluated on a human-labeled dataset, CompA-R-test, which assesses open-ended audio question-answering. GAMA outperforms other LALMs on diverse audio understanding tasks, demonstrating significant improvements in complex reasoning and instruction-following capabilities. The paper also discusses the limitations and future work, including the potential for extending GAMA to music understanding and using larger LLMs.
Reach us at info@study.space