Understanding EAT%3A Self-Supervised Pre-Training with Efficient Audio Transformer

The paper introduces the Efficient Audio Transformer (EAT), a novel model designed for efficient and effective audio self-supervised learning (SSL). EAT aims to address the computational demands of traditional SSL models, which can be a significant barrier to their practical application. Inspired by the success of data2vec 2.0 in image SSL and Audio-MAE in audio SSL, EAT incorporates a bootstrap self-supervised training paradigm and introduces a novel Utterance-Frame Objective (UFO) to enhance the modeling of acoustic events. The UFO combines global utterance-level and local frame-level losses, improving the model's ability to understand audio clips. EAT also employs an inverse block multi-mask strategy, which uses large inverse block masks to increase the challenge of extracting audio semantics and predicting masked features, while reducing computational costs. The model's architecture includes a complex Transformer encoder and a lightweight CNN decoder, facilitating efficient feature decoding. Experimental results demonstrate that EAT achieves state-of-the-art performance on various audio-related datasets, including AudioSet, ESC-50, and SPC-2, while significantly reducing pre-training time by up to 15 times compared to existing audio SSL models. The paper also includes ablation studies to validate the effectiveness of key components in EAT, such as utterance-level learning and the inverse block masking strategy.The paper introduces the Efficient Audio Transformer (EAT), a novel model designed for efficient and effective audio self-supervised learning (SSL). EAT aims to address the computational demands of traditional SSL models, which can be a significant barrier to their practical application. Inspired by the success of data2vec 2.0 in image SSL and Audio-MAE in audio SSL, EAT incorporates a bootstrap self-supervised training paradigm and introduces a novel Utterance-Frame Objective (UFO) to enhance the modeling of acoustic events. The UFO combines global utterance-level and local frame-level losses, improving the model's ability to understand audio clips. EAT also employs an inverse block multi-mask strategy, which uses large inverse block masks to increase the challenge of extracting audio semantics and predicting masked features, while reducing computational costs. The model's architecture includes a complex Transformer encoder and a lightweight CNN decoder, facilitating efficient feature decoding. Experimental results demonstrate that EAT achieves state-of-the-art performance on various audio-related datasets, including AudioSet, ESC-50, and SPC-2, while significantly reducing pre-training time by up to 15 times compared to existing audio SSL models. The paper also includes ablation studies to validate the effectiveness of key components in EAT, such as utterance-level learning and the inverse block masking strategy.

EAT: Self-Supervised Pre-Training with Efficient Audio Transformer

7 Jan 2024 | Wenxi Chen, Yuzhe Liang, Ziyang Ma, Zhisheng Zheng, Xie Chen *