[slides and audio] Bass Accompaniment Generation Via Latent Diffusion

This paper presents a novel controllable system for generating basslines that accompany arbitrary input music tracks. The system uses audio autoencoders to compress audio waveforms into invertible latent representations and a conditional latent diffusion model to generate corresponding bass stems. The latent diffusion model is trained on pairs of mixes and matching bass stems, allowing it to generate basslines that match the input mix in style and rhythm. To provide control over the timbre of generated samples, the system introduces a technique to ground the latent space to a user-provided reference style during diffusion sampling. Additionally, the system adapts classifier-free guidance to avoid distortions at high guidance strengths when generating an unbounded latent space. The model is trained on a dataset of 20,000 songs with available stems, with 1,500 tracks used as a test set. Quantitative experiments demonstrate that the system can generate basslines with user-specified timbres. The system also includes a style grounding technique that allows users to control the timbre of generated samples. The model is evaluated using a contrastive model that assigns high scores to matching (mix, bass stem) pairs and low scores to non-matching ones. The results show that the system can generate basslines that musically match the input mix and can be grounded with user-provided timbres. The system has a limitation in that it does not offer user control over the exact notes of the generated accompaniment. Future work involves training the model to generate other instruments besides bass. The system is supported by UKRI [grant EP/S022694/1].This paper presents a novel controllable system for generating basslines that accompany arbitrary input music tracks. The system uses audio autoencoders to compress audio waveforms into invertible latent representations and a conditional latent diffusion model to generate corresponding bass stems. The latent diffusion model is trained on pairs of mixes and matching bass stems, allowing it to generate basslines that match the input mix in style and rhythm. To provide control over the timbre of generated samples, the system introduces a technique to ground the latent space to a user-provided reference style during diffusion sampling. Additionally, the system adapts classifier-free guidance to avoid distortions at high guidance strengths when generating an unbounded latent space. The model is trained on a dataset of 20,000 songs with available stems, with 1,500 tracks used as a test set. Quantitative experiments demonstrate that the system can generate basslines with user-specified timbres. The system also includes a style grounding technique that allows users to control the timbre of generated samples. The model is evaluated using a contrastive model that assigns high scores to matching (mix, bass stem) pairs and low scores to non-matching ones. The results show that the system can generate basslines that musically match the input mix and can be grounded with user-provided timbres. The system has a limitation in that it does not offer user control over the exact notes of the generated accompaniment. Future work involves training the model to generate other instruments besides bass. The system is supported by UKRI [grant EP/S022694/1].

BASS ACCOMPANIMENT GENERATION VIA LATENT DIFFUSION

2 Feb 2024 | Marco Pasini 1,2, Maarten Grachten 1, Stefan Lattner 1