Understanding NOTSOFAR-1 Challenge%3A New Datasets%2C Baseline%2C and Tasks for Distant Meeting Transcription

The NOTSOFAR-1 Challenge introduces a new dataset and baseline system for distant speaker diarization and automatic speech recognition (DASR) in far-field meeting scenarios. The challenge focuses on single-channel and known-geometry multi-channel tracks, aiming to advance research in distant conversational speech recognition. Key contributions include: 1. **Dataset**: A benchmarking dataset of 315 meetings, each averaging 6 minutes, recorded in 30 conference rooms with 4-8 attendees and 35 unique speakers. The dataset captures a broad spectrum of real-world acoustic conditions and conversational dynamics. 2. **Simulated Training Dataset**: A 1000-hour simulated training set, synthesized with enhanced authenticity for real-world generalization, incorporating 15,000 real acoustic transfer functions (ATFs). 3. **Baseline System**: An open-source system that includes continuous speech separation (CSS), automatic speech recognition (ASR), and speaker diarization modules, written in Python. The challenge aims to address fundamental questions in distant conversational speech recognition, such as the advantage of multi-channel and geometry-specific algorithms over single-channel ones. The dataset and simulated training set are designed to bridge the gap between training and testing conditions, fostering innovation and practical solutions. The baseline system provides a starting point for participants, and the challenge will feature two main tracks: single-channel and known-geometry multi-channel, with metrics for speaker-attributed and speaker-agnostic performance evaluation.The NOTSOFAR-1 Challenge introduces a new dataset and baseline system for distant speaker diarization and automatic speech recognition (DASR) in far-field meeting scenarios. The challenge focuses on single-channel and known-geometry multi-channel tracks, aiming to advance research in distant conversational speech recognition. Key contributions include: 1. **Dataset**: A benchmarking dataset of 315 meetings, each averaging 6 minutes, recorded in 30 conference rooms with 4-8 attendees and 35 unique speakers. The dataset captures a broad spectrum of real-world acoustic conditions and conversational dynamics. 2. **Simulated Training Dataset**: A 1000-hour simulated training set, synthesized with enhanced authenticity for real-world generalization, incorporating 15,000 real acoustic transfer functions (ATFs). 3. **Baseline System**: An open-source system that includes continuous speech separation (CSS), automatic speech recognition (ASR), and speaker diarization modules, written in Python. The challenge aims to address fundamental questions in distant conversational speech recognition, such as the advantage of multi-channel and geometry-specific algorithms over single-channel ones. The dataset and simulated training set are designed to bridge the gap between training and testing conditions, fostering innovation and practical solutions. The baseline system provides a starting point for participants, and the challenge will feature two main tracks: single-channel and known-geometry multi-channel, with metrics for speaker-attributed and speaker-agnostic performance evaluation.

NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription

16 Jan 2024 | Alon Vinnikov, Amir Ivry, Aviv Hurvitz, Igor Abramovski, Sharon Koubi, Ilya Gurvich, Shai Pe'er, Xiong Xiao, Benjamin Martinez Elizalde, Naoyuki Kanda, Xiaofei Wang, Shalev Shaer, Stav Yagev, Yossi Asher, Sunit Sivasankaran, Yifan Gong, Min Tang, Huaming Wang, Eyal Krupka