2024 | *Alon Ziv¹,³, Itai Gat¹, Gael Le Lan¹, Tal Remez¹, Felix Kreuk¹, Alexandre Défossez², Jade Copet⁴, Gabriel Synnaeve¹, Yossi Adi¹,³
MAGNET is a masked generative sequence modeling method for audio that uses a single non-autoregressive transformer. Unlike previous approaches, MAGNET operates directly on multiple audio token streams and is trained with a single-stage, non-autoregressive transformer. During training, it predicts masked token spans using a masking scheduler, while during inference, it gradually builds the output sequence through multiple decoding steps. To enhance audio quality, MAGNET introduces a novel rescoring method that leverages an external pre-trained model to rescore and rank predictions. A hybrid version of MAGNET combines autoregressive and non-autoregressive models, generating the initial few seconds in an autoregressive manner while the rest is decoded in parallel. MAGNET is efficient for text-to-music and text-to-audio generation, achieving performance comparable to autoregressive baselines while being significantly faster (x7 faster). Through ablation studies and analysis, the importance of each component of MAGNET is highlighted, along with trade-offs between autoregressive and non-autoregressive modeling in terms of latency, throughput, and generation quality. Samples are available on the demo page. MAGNET is evaluated on text-to-music and text-to-audio generation, showing comparable results to baselines with significantly reduced latency. The method is efficient, flexible, and suitable for real-time audio generation with minimal quality degradation. The work introduces a novel non-autoregressive model for audio generation, demonstrating its effectiveness in generating long sequences (30 seconds) with a single model, significantly faster inference time, and improved generation quality through rescoring. The hybrid version of MAGNET combines autoregressive and non-autoregressive models for joint optimization. The method is evaluated using objective metrics and human studies, showing its potential for real-time audio generation.MAGNET is a masked generative sequence modeling method for audio that uses a single non-autoregressive transformer. Unlike previous approaches, MAGNET operates directly on multiple audio token streams and is trained with a single-stage, non-autoregressive transformer. During training, it predicts masked token spans using a masking scheduler, while during inference, it gradually builds the output sequence through multiple decoding steps. To enhance audio quality, MAGNET introduces a novel rescoring method that leverages an external pre-trained model to rescore and rank predictions. A hybrid version of MAGNET combines autoregressive and non-autoregressive models, generating the initial few seconds in an autoregressive manner while the rest is decoded in parallel. MAGNET is efficient for text-to-music and text-to-audio generation, achieving performance comparable to autoregressive baselines while being significantly faster (x7 faster). Through ablation studies and analysis, the importance of each component of MAGNET is highlighted, along with trade-offs between autoregressive and non-autoregressive modeling in terms of latency, throughput, and generation quality. Samples are available on the demo page. MAGNET is evaluated on text-to-music and text-to-audio generation, showing comparable results to baselines with significantly reduced latency. The method is efficient, flexible, and suitable for real-time audio generation with minimal quality degradation. The work introduces a novel non-autoregressive model for audio generation, demonstrating its effectiveness in generating long sequences (30 seconds) with a single model, significantly faster inference time, and improved generation quality through rescoring. The hybrid version of MAGNET combines autoregressive and non-autoregressive models for joint optimization. The method is evaluated using objective metrics and human studies, showing its potential for real-time audio generation.