6 Jun 2024 | Janek Ebbers, François G. Germain, Gordon Wichern, Jonathan Le Roux
This paper introduces Sound Event Bounding Boxes (SEBBs) as a new output format for sound event detection (SED) systems. Traditional SED systems use frame-level presence confidence scores and thresholding to determine event boundaries, which can lead to suboptimal performance due to the coupling of event extent and confidence prediction. SEBBs decouple these aspects by representing each sound event as a tuple of class type, onset/offset times, and overall confidence. This format allows for more accurate and interpretable event detection, as the event's presence confidence is no longer tied to its temporal extent.
The paper proposes a change-detection-based algorithm to convert frame-level outputs into SEBBs, which significantly improves performance on the DCASE 2023 Challenge. The algorithm outperforms existing methods, boosting the state-of-the-art performance from 0.644 to 0.686 on the PSDS1 metric. The method is also effective for other evaluation metrics, such as the F1-score, achieving a new state-of-the-art of 0.706.
The paper also presents three post-processing approaches to generate SEBBs: threshold-based, change-detection-based, and hybrid. These methods allow for more accurate event detection by considering both the presence confidence and the temporal extent of events. The results show that SEBBs significantly improve performance across a range of systems, with the best-performing system achieving a PSDS1 score of 0.703 and an F1-score of 0.734.
The paper concludes that SEBBs provide a more accurate and interpretable way to represent sound events, and that the proposed change-detection-based algorithm is an effective method for converting frame-level outputs into SEBBs. The results demonstrate that SEBBs significantly improve performance on the DCASE 2023 Challenge, setting a new state-of-the-art for sound event detection.This paper introduces Sound Event Bounding Boxes (SEBBs) as a new output format for sound event detection (SED) systems. Traditional SED systems use frame-level presence confidence scores and thresholding to determine event boundaries, which can lead to suboptimal performance due to the coupling of event extent and confidence prediction. SEBBs decouple these aspects by representing each sound event as a tuple of class type, onset/offset times, and overall confidence. This format allows for more accurate and interpretable event detection, as the event's presence confidence is no longer tied to its temporal extent.
The paper proposes a change-detection-based algorithm to convert frame-level outputs into SEBBs, which significantly improves performance on the DCASE 2023 Challenge. The algorithm outperforms existing methods, boosting the state-of-the-art performance from 0.644 to 0.686 on the PSDS1 metric. The method is also effective for other evaluation metrics, such as the F1-score, achieving a new state-of-the-art of 0.706.
The paper also presents three post-processing approaches to generate SEBBs: threshold-based, change-detection-based, and hybrid. These methods allow for more accurate event detection by considering both the presence confidence and the temporal extent of events. The results show that SEBBs significantly improve performance across a range of systems, with the best-performing system achieving a PSDS1 score of 0.703 and an F1-score of 0.734.
The paper concludes that SEBBs provide a more accurate and interpretable way to represent sound events, and that the proposed change-detection-based algorithm is an effective method for converting frame-level outputs into SEBBs. The results demonstrate that SEBBs significantly improve performance on the DCASE 2023 Challenge, setting a new state-of-the-art for sound event detection.