16 Jun 2024 | Or Tal*1,2 Alon Ziv*1 Itai Gat2 Felix Kreuk2 Yossi Adi1,2
JASCO is a novel text-to-music generation model that integrates both symbolic and audio-based conditions to produce high-quality music samples. The model is based on the Flow Matching paradigm and introduces a novel conditioning method that allows for both global textual descriptions and fine-grained local controls, such as chords, melodies, and audio prompts. JASCO uses information bottleneck layers and temporal blurring to extract relevant information from symbolic and audio conditions, enabling the model to generate music that aligns with specific controls while maintaining high-quality audio and text adherence. Experimental results show that JASCO performs comparably to existing baselines in terms of generation quality while offering significantly better control over the generated music. The model is evaluated using both objective metrics and human studies, demonstrating its effectiveness in generating music that meets the specified conditions.JASCO is a novel text-to-music generation model that integrates both symbolic and audio-based conditions to produce high-quality music samples. The model is based on the Flow Matching paradigm and introduces a novel conditioning method that allows for both global textual descriptions and fine-grained local controls, such as chords, melodies, and audio prompts. JASCO uses information bottleneck layers and temporal blurring to extract relevant information from symbolic and audio conditions, enabling the model to generate music that aligns with specific controls while maintaining high-quality audio and text adherence. Experimental results show that JASCO performs comparably to existing baselines in terms of generation quality while offering significantly better control over the generated music. The model is evaluated using both objective metrics and human studies, demonstrating its effectiveness in generating music that meets the specified conditions.