[slides] Sound Event Bounding Boxes

The paper introduces Sound Event Bounding Boxes (SEBBs) as a novel output format for sound event detection (SED) to improve the accuracy and interpretability of event extent predictions. Traditional SED systems rely on frame-level presence confidence and thresholding to determine event boundaries, which can lead to suboptimal performance due to the coupling of event extent and confidence prediction. SEBBs decouple these aspects by representing each sound event as a tuple of class type, extent, and overall confidence. This format allows for more accurate and interpretable event detection, as the presence confidence is now independent of the event's temporal extent. The paper proposes a change-detection-based algorithm to convert existing frame-level outputs into SEBBs, which significantly improves performance on the DCASE 2023 Challenge, boosting the state of the art from 0.644 to 0.686 in PSDS1. The algorithm also enhances the F1 score, achieving a new state of the art of 0.706. The SEBB format is shown to produce monotonically increasing ROC curves, which is crucial for reliable performance evaluation. The paper also presents three post-processing methods for converting frame-level outputs into SEBBs: threshold-based, change-detection-based, and hybrid. These methods are evaluated on the DCASE 2023 Challenge Task 4a, demonstrating that SEBBs significantly outperform traditional frame-level thresholding approaches. The results show that SEBBs provide more accurate event detection, particularly in scenarios where event boundaries are critical for performance evaluation. The study highlights the importance of decoupling event extent and confidence prediction in SED systems to achieve better performance and more interpretable results.The paper introduces Sound Event Bounding Boxes (SEBBs) as a novel output format for sound event detection (SED) to improve the accuracy and interpretability of event extent predictions. Traditional SED systems rely on frame-level presence confidence and thresholding to determine event boundaries, which can lead to suboptimal performance due to the coupling of event extent and confidence prediction. SEBBs decouple these aspects by representing each sound event as a tuple of class type, extent, and overall confidence. This format allows for more accurate and interpretable event detection, as the presence confidence is now independent of the event's temporal extent. The paper proposes a change-detection-based algorithm to convert existing frame-level outputs into SEBBs, which significantly improves performance on the DCASE 2023 Challenge, boosting the state of the art from 0.644 to 0.686 in PSDS1. The algorithm also enhances the F1 score, achieving a new state of the art of 0.706. The SEBB format is shown to produce monotonically increasing ROC curves, which is crucial for reliable performance evaluation. The paper also presents three post-processing methods for converting frame-level outputs into SEBBs: threshold-based, change-detection-based, and hybrid. These methods are evaluated on the DCASE 2023 Challenge Task 4a, demonstrating that SEBBs significantly outperform traditional frame-level thresholding approaches. The results show that SEBBs provide more accurate event detection, particularly in scenarios where event boundaries are critical for performance evaluation. The study highlights the importance of decoupling event extent and confidence prediction in SED systems to achieve better performance and more interpretable results.

Sound Event Bounding Boxes

6 Jun 2024 | Janek Ebbers, François G. Germain, Gordon Wichern, Jonathan Le Roux