Look Once to Hear: Target Speech Hearing with Noisy Examples

Look Once to Hear: Target Speech Hearing with Noisy Examples

May 11–16, 2024 | Bandhav Veluri, Malek Itani, Tuochao Chen, Takuya Yoshioka, Shyamnath Gollakota
This paper introduces a novel intelligent hearable system called "Look Once to Hear," which enables users to hear target speakers in noisy environments by looking at them for a few seconds. The system uses a noisy binaural audio example captured during this short look to learn the speech traits of the target speaker, allowing the system to extract the target speech from interfering speakers and background noise. The system achieves a signal quality improvement of 7.01 dB using less than 5 seconds of noisy enrollment audio and can process 8 ms of audio chunks in 6.24 ms on an embedded CPU. The system generalizes to real-world static and mobile speakers in previously unseen indoor and outdoor multipath environments. The enrollment interface for noisy examples does not cause performance degradation compared to clean examples, while being convenient and user-friendly. The system is designed to work with existing binaural hearable hardware architectures and requires only two microphones typical to today's hearables. The system is evaluated in real-world scenarios and shows improved performance compared to raw unprocessed input. The system is open-sourced, providing code and datasets for further research in HCI and machine learning.This paper introduces a novel intelligent hearable system called "Look Once to Hear," which enables users to hear target speakers in noisy environments by looking at them for a few seconds. The system uses a noisy binaural audio example captured during this short look to learn the speech traits of the target speaker, allowing the system to extract the target speech from interfering speakers and background noise. The system achieves a signal quality improvement of 7.01 dB using less than 5 seconds of noisy enrollment audio and can process 8 ms of audio chunks in 6.24 ms on an embedded CPU. The system generalizes to real-world static and mobile speakers in previously unseen indoor and outdoor multipath environments. The enrollment interface for noisy examples does not cause performance degradation compared to clean examples, while being convenient and user-friendly. The system is designed to work with existing binaural hearable hardware architectures and requires only two microphones typical to today's hearables. The system is evaluated in real-world scenarios and shows improved performance compared to raw unprocessed input. The system is open-sourced, providing code and datasets for further research in HCI and machine learning.
Reach us at info@study.space
[slides and audio] Look Once to Hear%3A Target Speech Hearing with Noisy Examples