19 Mar 2024 | Yuanhuiyi Lyu, Xu Zheng, Jizhou Zhou, Lin Wang
UniBind is a multi-modal learning approach that learns a unified and balanced representation space for seven modalities: image, text, audio, point cloud, thermal, video, and event data. Unlike existing methods that treat image as the central modality, UniBind makes the alignment centers modality-agnostic, leveraging large language models (LLMs) and multi-modal LLMs to learn a balanced representation space. The key steps include constructing a knowledge base of text embeddings using LLMs and multi-modal LLMs, building LLM-augmented class-wise embedding centers, and aligning all embeddings via contrastive learning to achieve a unified and balanced representation space. UniBind demonstrates strong zero-shot recognition performance gains, achieving a 6.75% improvement on ImageNet with multi-modal fine-tuning while reducing 90% of the learnable parameters. It is compatible with all CLIP-style multi-modal learning models and delivers significant performance boosts across various benchmarks. The method improves the reliability of embedding alignment centers by using more semantically rich descriptions, leading to more distinct category boundaries in the representation space. UniBind outperforms existing methods in cross-modal retrieval tasks and achieves new state-of-the-art performance on multiple modalities. The approach is flexible and efficient, making it suitable for a wide range of multi-modal learning applications.UniBind is a multi-modal learning approach that learns a unified and balanced representation space for seven modalities: image, text, audio, point cloud, thermal, video, and event data. Unlike existing methods that treat image as the central modality, UniBind makes the alignment centers modality-agnostic, leveraging large language models (LLMs) and multi-modal LLMs to learn a balanced representation space. The key steps include constructing a knowledge base of text embeddings using LLMs and multi-modal LLMs, building LLM-augmented class-wise embedding centers, and aligning all embeddings via contrastive learning to achieve a unified and balanced representation space. UniBind demonstrates strong zero-shot recognition performance gains, achieving a 6.75% improvement on ImageNet with multi-modal fine-tuning while reducing 90% of the learnable parameters. It is compatible with all CLIP-style multi-modal learning models and delivers significant performance boosts across various benchmarks. The method improves the reliability of embedding alignment centers by using more semantically rich descriptions, leading to more distinct category boundaries in the representation space. UniBind outperforms existing methods in cross-modal retrieval tasks and achieves new state-of-the-art performance on multiple modalities. The approach is flexible and efficient, making it suitable for a wide range of multi-modal learning applications.