Understanding BAT%3A Learning to Reason about Spatial Sounds with Large Language Models

The paper introduces BAT, a large language model (LLM) designed to reason about spatial sounds in a 3D environment. BAT combines the spatial sound perception capabilities of a binaural acoustic scene analysis model with the natural language reasoning capabilities of an LLM. To address the lack of datasets for in-the-wild spatial sounds, the authors synthesized a binaural audio dataset using AudioSet and SoundSpaces 2.0. They also developed SPATIAL-SOUNDQA, a spatial sound-based question-answering dataset, which includes a range of tasks to train and evaluate BAT's spatial sound perception and reasoning abilities. The acoustic front end encoder of BAT, named SPATIAL-AST, is a novel spatial audio encoder that achieves strong performance in sound event detection, spatial localization, and distance estimation. By integrating SPATIAL-AST with the LLaMA-2 7B model, BAT can reason about the relationships between sounds in its environment, surpassing standard Sound Event Localization and Detection (SELD) tasks. Experiments demonstrate BAT's superior performance on spatial sound perception and reasoning tasks, showcasing the potential of LLMs in navigating and interpreting complex spatial audio environments. The authors also provide a detailed introduction to related work, discuss the generation of spatial audio data, and present the architecture and training objectives of SPATIAL-AST and BAT. They conclude by highlighting the limitations and future directions for the research, emphasizing the potential impact on spatial audio perception, multimodal LLMs, and embodied AI systems.The paper introduces BAT, a large language model (LLM) designed to reason about spatial sounds in a 3D environment. BAT combines the spatial sound perception capabilities of a binaural acoustic scene analysis model with the natural language reasoning capabilities of an LLM. To address the lack of datasets for in-the-wild spatial sounds, the authors synthesized a binaural audio dataset using AudioSet and SoundSpaces 2.0. They also developed SPATIAL-SOUNDQA, a spatial sound-based question-answering dataset, which includes a range of tasks to train and evaluate BAT's spatial sound perception and reasoning abilities. The acoustic front end encoder of BAT, named SPATIAL-AST, is a novel spatial audio encoder that achieves strong performance in sound event detection, spatial localization, and distance estimation. By integrating SPATIAL-AST with the LLaMA-2 7B model, BAT can reason about the relationships between sounds in its environment, surpassing standard Sound Event Localization and Detection (SELD) tasks. Experiments demonstrate BAT's superior performance on spatial sound perception and reasoning tasks, showcasing the potential of LLMs in navigating and interpreting complex spatial audio environments. The authors also provide a detailed introduction to related work, discuss the generation of spatial audio data, and present the architecture and training objectives of SPATIAL-AST and BAT. They conclude by highlighting the limitations and future directions for the research, emphasizing the potential impact on spatial audio perception, multimodal LLMs, and embodied AI systems.

BAT: Learning to Reason about Spatial Sounds with Large Language Models

25 May 2024 | Zhisheng Zheng, Puyuan Peng, Ziyang Ma, Xie Chen, Eunsol Choi, David Harwath