The DCASE 2024 Task 4 challenge focuses on improving sound event detection (SED) systems by leveraging diverse training data with varying annotation granularity and missing labels. The challenge aims to develop robust SED systems that can generalize across different scenarios, even when annotations are inconsistent or missing. Participants are encouraged to use both strong and weak labels, as well as data from different domains, to enhance system performance. The challenge includes two datasets: DESED and MAESTRO. DESED contains real-world and synthetic audio clips with annotated sound events, while MAESTRO provides long-form real-world recordings with soft labels.
The challenge introduces a baseline system that incorporates a convolutional recurrent neural network (CRNN) with self-supervised features from the BEATs pre-trained model. The system uses a dual-phase hyperparameter tuning approach to optimize performance. Key improvements include the use of a multi-class median filter and a dropstep regularization strategy. The system also addresses the issue of missing labels by cross-mapping sound event classes between datasets.
The evaluation metrics include the polyphonic sound detection score (PSDS) and segment-based mean (macro-averaged) partial area under ROC curve (segMPAUC). The challenge also emphasizes energy efficiency, requiring participants to report energy consumption during training and testing. Results indicate that using diverse training data with missing labels can lead to stronger SED systems compared to training on individual datasets. The challenge encourages the development of novel methods to handle inconsistent annotations and missing labels, aiming to advance SED research in domestic environments.The DCASE 2024 Task 4 challenge focuses on improving sound event detection (SED) systems by leveraging diverse training data with varying annotation granularity and missing labels. The challenge aims to develop robust SED systems that can generalize across different scenarios, even when annotations are inconsistent or missing. Participants are encouraged to use both strong and weak labels, as well as data from different domains, to enhance system performance. The challenge includes two datasets: DESED and MAESTRO. DESED contains real-world and synthetic audio clips with annotated sound events, while MAESTRO provides long-form real-world recordings with soft labels.
The challenge introduces a baseline system that incorporates a convolutional recurrent neural network (CRNN) with self-supervised features from the BEATs pre-trained model. The system uses a dual-phase hyperparameter tuning approach to optimize performance. Key improvements include the use of a multi-class median filter and a dropstep regularization strategy. The system also addresses the issue of missing labels by cross-mapping sound event classes between datasets.
The evaluation metrics include the polyphonic sound detection score (PSDS) and segment-based mean (macro-averaged) partial area under ROC curve (segMPAUC). The challenge also emphasizes energy efficiency, requiring participants to report energy consumption during training and testing. Results indicate that using diverse training data with missing labels can lead to stronger SED systems compared to training on individual datasets. The challenge encourages the development of novel methods to handle inconsistent annotations and missing labels, aiming to advance SED research in domestic environments.