November 3-7, 2014, Orlando, Florida, USA | Justin Salamon, Christopher Jacoby, Juan Pablo Bello
This paper presents a dataset and taxonomy for urban sound research. The authors identify two main challenges in urban sound classification: the lack of a common taxonomy and the scarcity of large, real-world annotated data. To address these issues, they propose a taxonomy of urban sounds and introduce the UrbanSound dataset, which contains 27 hours of audio with 18.5 hours of annotated sound events across 10 classes. The dataset is based on field recordings and includes thousands of labeled sound source occurrences. The authors also present UrbanSound8K, a subset of the dataset designed for training sound classification algorithms.
The taxonomy is structured into four main categories: human, nature, mechanical, and music. It is designed to be detailed, focusing on specific sound sources such as car horns, jackhammers, and sirens. The taxonomy is based on noise complaints filed through New York City's 311 service, reflecting the most frequently complained about sound categories and sources.
The UrbanSound dataset was collected from Freesound, an online repository of user-uploaded recordings. The authors manually annotated the data, resulting in 3075 labeled occurrences amounting to 18.5 hours of labeled audio. The UrbanSound8K subset contains 8732 labeled slices (8.75 hours) with a maximum slice duration of 4 seconds, based on experiments showing that 4 seconds is sufficient for accurate classification.
The authors conducted a series of classification experiments using a baseline approach, examining the performance of different algorithms. They found that the choice of slice duration significantly affects classification accuracy, with 4 seconds being optimal. They also observed that certain sound classes are more affected by slice duration, highlighting the importance of analyzing them at longer temporal scales.
The study highlights the challenges of urban sound classification, including sensitivity to temporal scale, confusion due to timbre similarity, and sensitivity to background interference. The authors believe that the dataset will open new avenues for research in sound and multimedia applications with a focus on urban environments and urban informatics.This paper presents a dataset and taxonomy for urban sound research. The authors identify two main challenges in urban sound classification: the lack of a common taxonomy and the scarcity of large, real-world annotated data. To address these issues, they propose a taxonomy of urban sounds and introduce the UrbanSound dataset, which contains 27 hours of audio with 18.5 hours of annotated sound events across 10 classes. The dataset is based on field recordings and includes thousands of labeled sound source occurrences. The authors also present UrbanSound8K, a subset of the dataset designed for training sound classification algorithms.
The taxonomy is structured into four main categories: human, nature, mechanical, and music. It is designed to be detailed, focusing on specific sound sources such as car horns, jackhammers, and sirens. The taxonomy is based on noise complaints filed through New York City's 311 service, reflecting the most frequently complained about sound categories and sources.
The UrbanSound dataset was collected from Freesound, an online repository of user-uploaded recordings. The authors manually annotated the data, resulting in 3075 labeled occurrences amounting to 18.5 hours of labeled audio. The UrbanSound8K subset contains 8732 labeled slices (8.75 hours) with a maximum slice duration of 4 seconds, based on experiments showing that 4 seconds is sufficient for accurate classification.
The authors conducted a series of classification experiments using a baseline approach, examining the performance of different algorithms. They found that the choice of slice duration significantly affects classification accuracy, with 4 seconds being optimal. They also observed that certain sound classes are more affected by slice duration, highlighting the importance of analyzing them at longer temporal scales.
The study highlights the challenges of urban sound classification, including sensitivity to temporal scale, confusion due to timbre similarity, and sensitivity to background interference. The authors believe that the dataset will open new avenues for research in sound and multimedia applications with a focus on urban environments and urban informatics.