X-VILA: Cross-Modality Alignment for Large Language Model

X-VILA: Cross-Modality Alignment for Large Language Model

29 May 2024 | Hanrong Ye¹², De-An Huang¹, Yao Lu¹, Zhiqing Yu¹, Wei Ping¹, Andrew Tao¹, Jan Kautz¹, Song Han¹³, Dan Xu¹, Pavlo Molchanov¹, Hongxu Yin¹
X-VILA is a cross-modality foundation model that extends the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities. It enables cross-modality understanding, reasoning, and generation by aligning modality-specific encoders with LLM inputs and diffusion decoders with LLM outputs. To facilitate this alignment, an effective interleaved any-to-any modality instruction-following dataset is curated. A significant problem with current cross-modality alignment is the loss of visual information, which is addressed by introducing a visual alignment mechanism with a visual embedding highway module. A resource-efficient training recipe is also proposed, demonstrating proficiency in any-to-any modality conversations. X-VILA showcases emergent properties across modalities even without similar training data and is made open-source. The model is designed to handle multiple modalities, including video, image, and audio, at both input and output stages. It features a two-phase alignment mechanism: textual alignment, which aligns input and output representations to the LLM's textual embedding space, and visual alignment, which addresses the limitations of textual alignment in preserving visual features. The visual embedding highway (VEH) module enables direct guidance of visual decoders by bypassing the LLM, enhancing visual consistency between input and output. X-VILA is trained through three phases: (i) data-effective alignment, (ii) interleaved multi-modality pre-training, and (iii) X-to-X cross-modality instruction tuning. The model is evaluated on various datasets, including ActivityNet Captions and WebVid, demonstrating strong performance in cross-modality alignment. It outperforms existing models like Next-GPT in terms of visual consistency and cross-modality generation. X-VILA also exhibits emergent abilities, such as long-context cross-modality generation and new cross-modality tasks like image-to-audio and audio-to-image generation. The model is capable of generating multi-modality outputs that align with the given context, showcasing its effectiveness in cross-modality understanding and generation.X-VILA is a cross-modality foundation model that extends the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities. It enables cross-modality understanding, reasoning, and generation by aligning modality-specific encoders with LLM inputs and diffusion decoders with LLM outputs. To facilitate this alignment, an effective interleaved any-to-any modality instruction-following dataset is curated. A significant problem with current cross-modality alignment is the loss of visual information, which is addressed by introducing a visual alignment mechanism with a visual embedding highway module. A resource-efficient training recipe is also proposed, demonstrating proficiency in any-to-any modality conversations. X-VILA showcases emergent properties across modalities even without similar training data and is made open-source. The model is designed to handle multiple modalities, including video, image, and audio, at both input and output stages. It features a two-phase alignment mechanism: textual alignment, which aligns input and output representations to the LLM's textual embedding space, and visual alignment, which addresses the limitations of textual alignment in preserving visual features. The visual embedding highway (VEH) module enables direct guidance of visual decoders by bypassing the LLM, enhancing visual consistency between input and output. X-VILA is trained through three phases: (i) data-effective alignment, (ii) interleaved multi-modality pre-training, and (iii) X-to-X cross-modality instruction tuning. The model is evaluated on various datasets, including ActivityNet Captions and WebVid, demonstrating strong performance in cross-modality alignment. It outperforms existing models like Next-GPT in terms of visual consistency and cross-modality generation. X-VILA also exhibits emergent abilities, such as long-context cross-modality generation and new cross-modality tasks like image-to-audio and audio-to-image generation. The model is capable of generating multi-modality outputs that align with the given context, showcasing its effectiveness in cross-modality understanding and generation.
Reach us at info@study.space