7 Mar 2024 | Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yugang Jiang, Xipeng Qiu
AnyGPT is a unified multimodal large language model that uses discrete representations to process various modalities, including speech, text, images, and music. It can be trained stably without changes to the current large language model (LLM) architecture or training paradigms. The model uses data-level preprocessing to seamlessly integrate new modalities into LLMs, similar to adding new languages. A multimodal text-centric dataset was built for pre-training, and generative models were used to synthesize a large-scale any-to-any multimodal instruction dataset, AnyInstruct-108k, consisting of 108k samples of multi-turn conversations that intricately interweave various modalities. Experimental results show that AnyGPT can facilitate any-to-any multimodal conversations and achieves performance comparable to specialized models across all modalities, proving that discrete representations can effectively and conveniently unify multiple modalities within a language model. The model's architecture includes multimodal tokenizers that compress raw multimodal data into discrete tokens, which are then processed by the LLM. De-tokenizers convert the discrete representations back into the original modal representations. The model's ability to handle arbitrary combinations of multimodal inputs and outputs is demonstrated through extensive case studies. AnyGPT's contributions include the proposal of AnyGPT, a token-based any-to-any multimodal language model, and the development of AnyInstruct-108k, a dataset comprising 108k multi-turn dialogues with interleaved multimodal elements. The model also demonstrates that discrete representations can effectively unify multiple modalities within a language model. The model's performance is evaluated across various tasks, including image understanding and generation, speech recognition and synthesis, and music understanding and generation. The results show that AnyGPT achieves commendable performance on various multimodal understanding and generation tasks. The model's ability to handle any-to-any multimodal dialogue is demonstrated through example demonstrations. The model's limitations include the need for longer context for multimodal content and the challenge of handling diverse data and optimizing performance. Future work includes enhancing LLMs, improving tokenizers, and extending context for multimodal content.AnyGPT is a unified multimodal large language model that uses discrete representations to process various modalities, including speech, text, images, and music. It can be trained stably without changes to the current large language model (LLM) architecture or training paradigms. The model uses data-level preprocessing to seamlessly integrate new modalities into LLMs, similar to adding new languages. A multimodal text-centric dataset was built for pre-training, and generative models were used to synthesize a large-scale any-to-any multimodal instruction dataset, AnyInstruct-108k, consisting of 108k samples of multi-turn conversations that intricately interweave various modalities. Experimental results show that AnyGPT can facilitate any-to-any multimodal conversations and achieves performance comparable to specialized models across all modalities, proving that discrete representations can effectively and conveniently unify multiple modalities within a language model. The model's architecture includes multimodal tokenizers that compress raw multimodal data into discrete tokens, which are then processed by the LLM. De-tokenizers convert the discrete representations back into the original modal representations. The model's ability to handle arbitrary combinations of multimodal inputs and outputs is demonstrated through extensive case studies. AnyGPT's contributions include the proposal of AnyGPT, a token-based any-to-any multimodal language model, and the development of AnyInstruct-108k, a dataset comprising 108k multi-turn dialogues with interleaved multimodal elements. The model also demonstrates that discrete representations can effectively unify multiple modalities within a language model. The model's performance is evaluated across various tasks, including image understanding and generation, speech recognition and synthesis, and music understanding and generation. The results show that AnyGPT achieves commendable performance on various multimodal understanding and generation tasks. The model's ability to handle any-to-any multimodal dialogue is demonstrated through example demonstrations. The model's limitations include the need for longer context for multimodal content and the challenge of handling diverse data and optimizing performance. Future work includes enhancing LLMs, improving tokenizers, and extending context for multimodal content.