This paper introduces BAT, a large language model (LLM) designed to reason about spatial sounds in a 3D environment. BAT combines the spatial sound perception ability of a binaural acoustic scene analysis model with the natural language reasoning capabilities of an LLM. To address the lack of existing datasets for in-the-wild spatial sounds, the authors synthesized a binaural audio dataset using AudioSet and SoundSpaces 2.0. They also developed SPATIALSOUNDQA, a spatial sound-based question-answering dataset, to train BAT in various aspects of spatial sound perception and reasoning.
BAT's acoustic front end encoder is a novel spatial audio encoder named Spatial Audio Spectrogram Transformer (SPATIAL-AST), which achieves strong performance in sound event detection, spatial localization, and distance estimation. By integrating SPATIAL-AST with the LLaMA-2 7B model, BAT transcends standard Sound Event Localization and Detection (SELD) tasks, enabling the model to reason about the relationships between sounds in its environment. The authors demonstrate BAT's superior performance on both spatial sound perception and reasoning, showcasing the immense potential of LLMs in navigating and interpreting complex spatial audio environments.
The key contributions of this work include the first spatial audio-based question answering dataset SPATIALSOUNDQA, the first spatial audio encoder SPATIAL-AST, and the first spatial audio-based LLM BAT. The authors also present a comprehensive evaluation of BAT's performance on various tasks, including sound event detection, direction and distance estimation, and spatial reasoning. The results show that BAT achieves strong performance across all tasks, with a mean Average Precision (mAP) of 50.03% for audio event classification, a Mean Angular Error (MAE) of 17.94° for direction of arrival estimation, and a Distance Error Rate (DER) of 32.54% for distance estimation.
The authors also discuss the limitations of their approach, including the need for more extensive datasets and the potential for future work in expanding the model to handle more complex spatial audio scenarios. They conclude that BAT represents a significant advancement in the field of spatial audio perception and reasoning, with the potential to impact various applications, including virtual reality, gaming, and audio engineering.This paper introduces BAT, a large language model (LLM) designed to reason about spatial sounds in a 3D environment. BAT combines the spatial sound perception ability of a binaural acoustic scene analysis model with the natural language reasoning capabilities of an LLM. To address the lack of existing datasets for in-the-wild spatial sounds, the authors synthesized a binaural audio dataset using AudioSet and SoundSpaces 2.0. They also developed SPATIALSOUNDQA, a spatial sound-based question-answering dataset, to train BAT in various aspects of spatial sound perception and reasoning.
BAT's acoustic front end encoder is a novel spatial audio encoder named Spatial Audio Spectrogram Transformer (SPATIAL-AST), which achieves strong performance in sound event detection, spatial localization, and distance estimation. By integrating SPATIAL-AST with the LLaMA-2 7B model, BAT transcends standard Sound Event Localization and Detection (SELD) tasks, enabling the model to reason about the relationships between sounds in its environment. The authors demonstrate BAT's superior performance on both spatial sound perception and reasoning, showcasing the immense potential of LLMs in navigating and interpreting complex spatial audio environments.
The key contributions of this work include the first spatial audio-based question answering dataset SPATIALSOUNDQA, the first spatial audio encoder SPATIAL-AST, and the first spatial audio-based LLM BAT. The authors also present a comprehensive evaluation of BAT's performance on various tasks, including sound event detection, direction and distance estimation, and spatial reasoning. The results show that BAT achieves strong performance across all tasks, with a mean Average Precision (mAP) of 50.03% for audio event classification, a Mean Angular Error (MAE) of 17.94° for direction of arrival estimation, and a Distance Error Rate (DER) of 32.54% for distance estimation.
The authors also discuss the limitations of their approach, including the need for more extensive datasets and the potential for future work in expanding the model to handle more complex spatial audio scenarios. They conclude that BAT represents a significant advancement in the field of spatial audio perception and reasoning, with the potential to impact various applications, including virtual reality, gaming, and audio engineering.