16 Jun 2024 | Or Tal, Alon Ziv, Itai Gat, Felix Kreuk, Yossi Adi
JASCO is a temporally controlled text-to-music generation model that integrates both symbolic and audio-based conditions. It uses a Flow Matching modeling paradigm and a novel conditioning method to generate high-quality music samples conditioned on global text descriptions and fine-grained local controls. JASCO allows for both global (text) and local (chords, melody, drum tracks) controls, using information bottleneck layers and temporal blurring to extract relevant information. The model uses source separation, F0 saliency detection, and chord progression extraction to obtain symbolic and audio-based conditions. It is evaluated using objective metrics and human studies, showing comparable generation quality to baselines while offering significantly better and more versatile controls. JASCO is trained using pre-trained models to extract relevant information without requiring studio-quality data or supervised datasets. It is tested on various conditions, including melody, chords, audio, and drums, and shows improved performance in terms of melody adherence, chord progression, and rhythmic pattern similarity. The model is also evaluated in human studies, where it achieves similar generation quality to MusicGen while showing better text alignment and control over generated music. JASCO's approach allows for a wide range of controls, including text, melody, rhythm, and musical structure, and is capable of generating high-fidelity samples that align with the given controls. The model is based on the Flow Matching paradigm and is compared to diffusion models, showing superior performance in terms of FAD, KL, and CLAP scores. JASCO is also evaluated in terms of loss weighting, showing improved generation quality with the proposed modification. The model is supported by extensive experiments and analysis, demonstrating its effectiveness in generating high-quality, temporally controlled music samples.JASCO is a temporally controlled text-to-music generation model that integrates both symbolic and audio-based conditions. It uses a Flow Matching modeling paradigm and a novel conditioning method to generate high-quality music samples conditioned on global text descriptions and fine-grained local controls. JASCO allows for both global (text) and local (chords, melody, drum tracks) controls, using information bottleneck layers and temporal blurring to extract relevant information. The model uses source separation, F0 saliency detection, and chord progression extraction to obtain symbolic and audio-based conditions. It is evaluated using objective metrics and human studies, showing comparable generation quality to baselines while offering significantly better and more versatile controls. JASCO is trained using pre-trained models to extract relevant information without requiring studio-quality data or supervised datasets. It is tested on various conditions, including melody, chords, audio, and drums, and shows improved performance in terms of melody adherence, chord progression, and rhythmic pattern similarity. The model is also evaluated in human studies, where it achieves similar generation quality to MusicGen while showing better text alignment and control over generated music. JASCO's approach allows for a wide range of controls, including text, melody, rhythm, and musical structure, and is capable of generating high-fidelity samples that align with the given controls. The model is based on the Flow Matching paradigm and is compared to diffusion models, showing superior performance in terms of FAD, KL, and CLAP scores. JASCO is also evaluated in terms of loss weighting, showing improved generation quality with the proposed modification. The model is supported by extensive experiments and analysis, demonstrating its effectiveness in generating high-quality, temporally controlled music samples.