EAT: Self-Supervised Pre-Training with Efficient Audio Transformer
This paper introduces EAT, an efficient audio self-supervised learning model that improves both effectiveness and efficiency in audio representation learning. Inspired by data2vec 2.0 and Audio-MAE, EAT employs a bootstrap self-supervised training paradigm and a novel Utterance-Frame Objective (UFO) to enhance acoustic event modeling. The model uses an inverse block multi-mask strategy, which preserves unmasked data in block units, increasing the challenge of extracting audio semantics. EAT achieves state-of-the-art performance on audio-related tasks, including AudioSet (AS-2M, AS-20K), ESC-50, and SPC-2, with a significant pre-training speedup of up to 15x compared to existing models.
EAT's architecture combines a complex Transformer encoder with a lightweight CNN decoder, enabling efficient feature decoding. The model uses a high mask ratio (80%) during pre-training, which enhances training speed and increases the challenge of masked learning. The inverse block masking technique preserves unmasked data in block units, resulting in larger regions of locality for unmasked patch embeddings. This method significantly improves pre-training efficiency and performance.
The model's pre-training process involves a bootstrapping framework where the student model is continuously updated using target features from a teacher model. The teacher model is updated via an exponential moving average (EMA) technique, similar to MOCO. EAT employs a Masked Language Modeling (MLM) approach with an 80% masking ratio, focusing on patch embeddings from downsampled audio spectrograms.
EAT's UFO objective combines global utterance-level and local frame-level losses, enhancing the model's ability to understand audio clips. The model's performance is evaluated on several audio-related tasks, including audio classification and speech classification. EAT achieves state-of-the-art results on these tasks, demonstrating its superior generalization and learning efficiency in the audio domain.
The model's efficiency gains are attributed to its high mask ratio and the use of a lightweight CNN decoder. EAT's pre-training process is significantly faster than previous models, with a total pre-training time reduction of 15.65 times compared to BEATs and 10.02 times compared to Audio-MAE. The model's performance is further enhanced by its multi-mask strategy, which creates multiple clone-masked embeddings from the same spectrogram patch, amplifying data utilization via parallel computing.
EAT's contributions include the introduction of the UFO objective, the adoption of the inverse block multi-mask method from data2vec 2.0, and achieving state-of-the-art results on several popular audio-related datasets. The code and pre-trained models are open-sourced to facilitate the development of the community.EAT: Self-Supervised Pre-Training with Efficient Audio Transformer
This paper introduces EAT, an efficient audio self-supervised learning model that improves both effectiveness and efficiency in audio representation learning. Inspired by data2vec 2.0 and Audio-MAE, EAT employs a bootstrap self-supervised training paradigm and a novel Utterance-Frame Objective (UFO) to enhance acoustic event modeling. The model uses an inverse block multi-mask strategy, which preserves unmasked data in block units, increasing the challenge of extracting audio semantics. EAT achieves state-of-the-art performance on audio-related tasks, including AudioSet (AS-2M, AS-20K), ESC-50, and SPC-2, with a significant pre-training speedup of up to 15x compared to existing models.
EAT's architecture combines a complex Transformer encoder with a lightweight CNN decoder, enabling efficient feature decoding. The model uses a high mask ratio (80%) during pre-training, which enhances training speed and increases the challenge of masked learning. The inverse block masking technique preserves unmasked data in block units, resulting in larger regions of locality for unmasked patch embeddings. This method significantly improves pre-training efficiency and performance.
The model's pre-training process involves a bootstrapping framework where the student model is continuously updated using target features from a teacher model. The teacher model is updated via an exponential moving average (EMA) technique, similar to MOCO. EAT employs a Masked Language Modeling (MLM) approach with an 80% masking ratio, focusing on patch embeddings from downsampled audio spectrograms.
EAT's UFO objective combines global utterance-level and local frame-level losses, enhancing the model's ability to understand audio clips. The model's performance is evaluated on several audio-related tasks, including audio classification and speech classification. EAT achieves state-of-the-art results on these tasks, demonstrating its superior generalization and learning efficiency in the audio domain.
The model's efficiency gains are attributed to its high mask ratio and the use of a lightweight CNN decoder. EAT's pre-training process is significantly faster than previous models, with a total pre-training time reduction of 15.65 times compared to BEATs and 10.02 times compared to Audio-MAE. The model's performance is further enhanced by its multi-mask strategy, which creates multiple clone-masked embeddings from the same spectrogram patch, amplifying data utilization via parallel computing.
EAT's contributions include the introduction of the UFO objective, the adoption of the inverse block multi-mask method from data2vec 2.0, and achieving state-of-the-art results on several popular audio-related datasets. The code and pre-trained models are open-sourced to facilitate the development of the community.