REVISITING DEEP AUDIO-TEXT RETRIEVAL THROUGH THE LENS OF TRANSPORTATION

REVISITING DEEP AUDIO-TEXT RETRIEVAL THROUGH THE LENS OF TRANSPORTATION

16 May 2024 | Manh Luong, Khai Nguyen, Nhat Ho, Dinh Phung, Gholamreza Haffari, Lizhen Qu
The paper introduces the mini-batch Learning-to-Match (m-LTM) framework for audio-text retrieval, addressing scalability challenges in the conventional Learning-to-Match (LTM) framework. The m-LTM framework leverages mini-batch subsampling and Mahalanobis-enhanced ground metrics to learn a rich and expressive joint embedding space between audio and text modalities. To handle misaligned training data, a variant of the m-LTM framework using partial optimal transport (POT) is proposed. Extensive experiments on three datasets (AudioCaps, Clotho, and ESC-50) demonstrate that the proposed method achieves state-of-the-art performance in audio-text retrieval tasks. The m-LTM framework also shows superior noise tolerance compared to triplet and contrastive loss, especially under varying noise ratios in the training data. The code for the m-LTM framework is available at https://github.com/v-manhl3/m-LTM-Audio-Text-Retrieve.The paper introduces the mini-batch Learning-to-Match (m-LTM) framework for audio-text retrieval, addressing scalability challenges in the conventional Learning-to-Match (LTM) framework. The m-LTM framework leverages mini-batch subsampling and Mahalanobis-enhanced ground metrics to learn a rich and expressive joint embedding space between audio and text modalities. To handle misaligned training data, a variant of the m-LTM framework using partial optimal transport (POT) is proposed. Extensive experiments on three datasets (AudioCaps, Clotho, and ESC-50) demonstrate that the proposed method achieves state-of-the-art performance in audio-text retrieval tasks. The m-LTM framework also shows superior noise tolerance compared to triplet and contrastive loss, especially under varying noise ratios in the training data. The code for the m-LTM framework is available at https://github.com/v-manhl3/m-LTM-Audio-Text-Retrieve.
Reach us at info@study.space