27 Mar 2024 | Ziyang Chen, Israel D. Gebru, Christian Richardt, Anurag Kumar, William Laney, Andrew Owens, Alexander Richard
The paper introduces Real Acoustic Fields (RAF), a new audio-visual room acoustics dataset and benchmark that captures real-world room impulse response (RIR) data from multiple modalities. RAF includes high-quality, densely sampled RIRs paired with multi-view images and precise 6DoF pose tracking data for sound emitters and listeners. This dataset is the first to provide densely captured room acoustic data, making it ideal for researchers working on audio and audio-visual neural acoustic field modeling. The dataset was collected using a custom-built microphone tower and robotic loudspeaker stand, enabling the capture of sound source directivity data. The dataset also includes visual data captured using the VR-NeRF approach, allowing for high-fidelity visual reconstruction and novel-view synthesis.
The authors evaluated existing audio and audio-visual models against multiple criteria and proposed settings to enhance their performance on real-world data. They also conducted experiments to investigate the impact of incorporating visual data into neural acoustic field models. Additionally, they demonstrated the effectiveness of a simple sim2real approach, where a model is pre-trained with simulated data and fine-tuned with sparse real-world data, resulting in significant improvements in few-shot learning.
The paper also presents several neural acoustic field models, including NAF, INRAS, NACF, AV-NeRF, NAF++ and INRAS++. These models were evaluated on the RAF dataset, and the results showed that INRAS++ performed best on most metrics, with a lightweight architecture and fast inference speed. The sim2real approach was shown to significantly improve performance in few-shot scenarios, even with limited training data.
The experiments demonstrated that the RAF dataset provides a valuable resource for evaluating and benchmarking novel-view acoustic synthesis models and impulse response generation techniques. The dataset fills a gap in existing research by providing real-world data for these evaluations. The authors also discussed the limitations of the dataset, including the cost and time required to collect real-world room impulse data, and the potential for deceptive and misleading media when using RIR data to produce audio recordings that mimic real recordings from a specific room.The paper introduces Real Acoustic Fields (RAF), a new audio-visual room acoustics dataset and benchmark that captures real-world room impulse response (RIR) data from multiple modalities. RAF includes high-quality, densely sampled RIRs paired with multi-view images and precise 6DoF pose tracking data for sound emitters and listeners. This dataset is the first to provide densely captured room acoustic data, making it ideal for researchers working on audio and audio-visual neural acoustic field modeling. The dataset was collected using a custom-built microphone tower and robotic loudspeaker stand, enabling the capture of sound source directivity data. The dataset also includes visual data captured using the VR-NeRF approach, allowing for high-fidelity visual reconstruction and novel-view synthesis.
The authors evaluated existing audio and audio-visual models against multiple criteria and proposed settings to enhance their performance on real-world data. They also conducted experiments to investigate the impact of incorporating visual data into neural acoustic field models. Additionally, they demonstrated the effectiveness of a simple sim2real approach, where a model is pre-trained with simulated data and fine-tuned with sparse real-world data, resulting in significant improvements in few-shot learning.
The paper also presents several neural acoustic field models, including NAF, INRAS, NACF, AV-NeRF, NAF++ and INRAS++. These models were evaluated on the RAF dataset, and the results showed that INRAS++ performed best on most metrics, with a lightweight architecture and fast inference speed. The sim2real approach was shown to significantly improve performance in few-shot scenarios, even with limited training data.
The experiments demonstrated that the RAF dataset provides a valuable resource for evaluating and benchmarking novel-view acoustic synthesis models and impulse response generation techniques. The dataset fills a gap in existing research by providing real-world data for these evaluations. The authors also discussed the limitations of the dataset, including the cost and time required to collect real-world room impulse data, and the potential for deceptive and misleading media when using RIR data to produce audio recordings that mimic real recordings from a specific room.