Understanding UniBind%3A LLM-Augmented Unified and Balanced Representation Space to Bind Them All

UniBind is a flexible and efficient approach that learns a unified and balanced representation space for seven diverse modalities: image, text, audio, point cloud, thermal, video, and event data. Unlike existing methods that treat a specific modality (e.g., image) as the central modality, UniBind makes the alignment centers modality-agnostic, ensuring a more balanced representation space. The core idea is to leverage large language models (LLMs) and multi-modal LLMs to construct a knowledge base of text embeddings, which are then used to build class-wise embedding centers. These centers are aligned with all modality embeddings through contrastive learning, ensuring that all modalities are equally considered in the representation space. UniBind demonstrates superior performance in zero-shot and fine-tuning recognition tasks, achieving significant improvements over prior methods. It also achieves new state-of-the-art performance on ImageNet with a 6.75% gain while reducing 90% of the learnable parameters. The approach is flexible and can be applied to various CLIP-style multi-modal learning models, making it a powerful tool for multi-modal data representation and fusion.UniBind is a flexible and efficient approach that learns a unified and balanced representation space for seven diverse modalities: image, text, audio, point cloud, thermal, video, and event data. Unlike existing methods that treat a specific modality (e.g., image) as the central modality, UniBind makes the alignment centers modality-agnostic, ensuring a more balanced representation space. The core idea is to leverage large language models (LLMs) and multi-modal LLMs to construct a knowledge base of text embeddings, which are then used to build class-wise embedding centers. These centers are aligned with all modality embeddings through contrastive learning, ensuring that all modalities are equally considered in the representation space. UniBind demonstrates superior performance in zero-shot and fine-tuning recognition tasks, achieving significant improvements over prior methods. It also achieves new state-of-the-art performance on ImageNet with a 6.75% gain while reducing 90% of the learnable parameters. The approach is flexible and can be applied to various CLIP-style multi-modal learning models, making it a powerful tool for multi-modal data representation and fusion.

UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All

19 Mar 2024 | Yuanhuiyi Lyu1 * Xu Zheng1 * Jizhou Zhou1 Lin Wang1,2 ‡