Understanding Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation

This paper proposes a mini-batch learning-to-match (m-LTM) framework for audio-text retrieval, which addresses the challenges of scalability and misaligned training data in cross-modal matching. The m-LTM framework leverages mini-batch subsampling and Mahalanobis-enhanced ground metrics to learn a rich and expressive joint embedding space for audio and text modalities. It also incorporates Partial Optimal Transport (POT) to mitigate the negative impact of misaligned training data pairs. The framework is evaluated on three datasets: AudioCaps, Clotho, and ESC-50. Results show that m-LTM achieves state-of-the-art performance in audio-text retrieval, outperforming triplet and contrastive loss approaches. Additionally, m-LTM demonstrates superior noise tolerance compared to contrastive loss, especially under varying noise ratios in the AudioCaps dataset. The framework effectively bridges the modality gap between audio and text embeddings, enhancing transferability to downstream tasks. The m-LTM framework is capable of learning a shared embedding space that is both rich and expressive, enabling effective cross-modal matching and zero-shot sound event detection. The proposed method is implemented with a flexible ground metric and is shown to be robust to noisy correspondence in training data. The framework is evaluated through extensive experiments, including ablation studies, and is found to be effective in learning a joint embedding space that supports cross-modal retrieval and transfer learning.This paper proposes a mini-batch learning-to-match (m-LTM) framework for audio-text retrieval, which addresses the challenges of scalability and misaligned training data in cross-modal matching. The m-LTM framework leverages mini-batch subsampling and Mahalanobis-enhanced ground metrics to learn a rich and expressive joint embedding space for audio and text modalities. It also incorporates Partial Optimal Transport (POT) to mitigate the negative impact of misaligned training data pairs. The framework is evaluated on three datasets: AudioCaps, Clotho, and ESC-50. Results show that m-LTM achieves state-of-the-art performance in audio-text retrieval, outperforming triplet and contrastive loss approaches. Additionally, m-LTM demonstrates superior noise tolerance compared to contrastive loss, especially under varying noise ratios in the AudioCaps dataset. The framework effectively bridges the modality gap between audio and text embeddings, enhancing transferability to downstream tasks. The m-LTM framework is capable of learning a shared embedding space that is both rich and expressive, enabling effective cross-modal matching and zero-shot sound event detection. The proposed method is implemented with a flexible ground metric and is shown to be robust to noisy correspondence in training data. The framework is evaluated through extensive experiments, including ablation studies, and is found to be effective in learning a joint embedding space that supports cross-modal retrieval and transfer learning.

Revisiting Deep Audio-Text Retrieval through the Lens of Transportation

2024 | Manh Luong¹, Khai Nguyen², Nhat Ho², Dinh Phung¹, Gholamreza Haffari¹, Lizhen Qu¹