MASKED AUDIO GENERATION USING A SINGLE NON-AUTOREGRESSIVE TRANSFORMER

MASKED AUDIO GENERATION USING A SINGLE NON-AUTOREGRESSIVE TRANSFORMER

5 Mar 2024 | *Alon Ziv1,3, Itai Gat1, Gael Le Lan1, Tal Remez1, Felix Kreuk1, Alexandre Défossez2 Jade Copet1, Gabriel Synnaeve1, Yossi Adi1,3
**MAGNeT (Masked Audio Generation using Non-autoregressive Transformers)** is a novel method for generating audio sequences from text inputs. Unlike previous autoregressive models, MAGNeT uses a single-stage, non-autoregressive transformer to predict spans of masked tokens during training and gradually construct the output sequence during inference. To enhance the quality of generated audio, MAGNeT introduces a rescoring method that leverages an external pre-trained model to rank predictions. Additionally, a hybrid version of MAGNeT combines autoregressive and non-autoregressive models to generate the initial part of the sequence autoregressively while decoding the rest in parallel. The method is evaluated on text-to-music and text-to-audio generation tasks, showing comparable performance to autoregressive baselines while being significantly faster (7 times faster than the autoregressive baseline). The paper also includes extensive ablation studies and analysis of the trade-offs between autoregressive and non-autoregressive modeling in terms of latency, throughput, and generation quality.**MAGNeT (Masked Audio Generation using Non-autoregressive Transformers)** is a novel method for generating audio sequences from text inputs. Unlike previous autoregressive models, MAGNeT uses a single-stage, non-autoregressive transformer to predict spans of masked tokens during training and gradually construct the output sequence during inference. To enhance the quality of generated audio, MAGNeT introduces a rescoring method that leverages an external pre-trained model to rank predictions. Additionally, a hybrid version of MAGNeT combines autoregressive and non-autoregressive models to generate the initial part of the sequence autoregressively while decoding the rest in parallel. The method is evaluated on text-to-music and text-to-audio generation tasks, showing comparable performance to autoregressive baselines while being significantly faster (7 times faster than the autoregressive baseline). The paper also includes extensive ablation studies and analysis of the trade-offs between autoregressive and non-autoregressive modeling in terms of latency, throughput, and generation quality.
Reach us at info@study.space