Understanding DCASE 2024 Task 4%3A Sound Event Detection with Heterogeneous Data and Missing Labels

The DCASE 2024 Task 4 focuses on advancing sound event detection (SED) systems in domestic environments by leveraging training data with varying supervision uncertainty. The challenge aims to explore how to best use training data from different domains and with different annotation granularities (strong/weak temporal resolution, soft/hard labels) to create a robust SED system that can generalize across different scenarios. Key issues include inconsistent annotations across datasets and the presence of missing labels during training. To address these challenges, the task introduces two datasets: DESED and MAESTRO, and evaluates systems on both. The evaluation metrics include the polyphonic sound detection score (PSDS) and segment-based mean partial area under ROC curve (segMPAU). The baseline system, a convolutional recurrent neural network (CRNN) with self-supervised learned features, has been updated to handle missing labels and inconsistent annotations. Results from the baseline system suggest that leveraging diverse domain training data with missing labels can lead to better performance compared to training systems for each domain separately. The paper also discusses the importance of soft labels and their impact on system performance, as well as the energy efficiency of the systems.The DCASE 2024 Task 4 focuses on advancing sound event detection (SED) systems in domestic environments by leveraging training data with varying supervision uncertainty. The challenge aims to explore how to best use training data from different domains and with different annotation granularities (strong/weak temporal resolution, soft/hard labels) to create a robust SED system that can generalize across different scenarios. Key issues include inconsistent annotations across datasets and the presence of missing labels during training. To address these challenges, the task introduces two datasets: DESED and MAESTRO, and evaluates systems on both. The evaluation metrics include the polyphonic sound detection score (PSDS) and segment-based mean partial area under ROC curve (segMPAU). The baseline system, a convolutional recurrent neural network (CRNN) with self-supervised learned features, has been updated to handle missing labels and inconsistent annotations. Results from the baseline system suggest that leveraging diverse domain training data with missing labels can lead to better performance compared to training systems for each domain separately. The paper also discusses the importance of soft labels and their impact on system performance, as well as the energy efficiency of the systems.

DCASE 2024 TASK 4: SOUND EVENT DETECTION WITH HETEROGENEOUS DATA AND MISSING LABELS

12 Jun 2024 | Samuele Cornell1,, Janek Ebbers2,, Constance Douwes3, Irene Martín-Morató4, Manu Harju4, Annamaria Mesaros4, Romain Serizel3

DCASE 2024 TASK 4: SOUND EVENT DETECTION WITH HETEROGENEOUS DATA AND MISSING LABELS

12 Jun 2024 | Samuele Cornell1,*, Janek Ebbers2,*, Constance Douwes3, Irene Martín-Morató4, Manu Harju4, Annamaria Mesaros4, Romain Serizel3

12 Jun 2024 | Samuele Cornell1,, Janek Ebbers2,, Constance Douwes3, Irene Martín-Morató4, Manu Harju4, Annamaria Mesaros4, Romain Serizel3