Chameleon is a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. The models are trained using a stable approach, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. Chameleon is evaluated on a range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed-modal generation. It demonstrates broad capabilities, including state-of-the-art performance in image captioning, outperforms Llama-2 in text-only tasks, and performs non-trivial image generation. It also matches or exceeds the performance of larger models like Gemini Pro and GPT-4V in human evaluations.
Chameleon uses a unified architecture with fully token-based representations for both image and text modalities. By quantizing images into discrete tokens, it applies the same transformer architecture to sequences of both image and text tokens, without separate encoders. This early-fusion approach allows seamless reasoning and generation across modalities but presents technical challenges in optimization stability and scaling. The paper introduces architectural innovations and training techniques to address these challenges, including query-key normalization and revised layer norms. These techniques enable the training of Chameleon-34B on 5x more tokens than Llama-2, achieving state-of-the-art performance on various benchmarks.
Chameleon is evaluated on a diverse set of tasks, including visual question answering, image captioning, text generation, and image generation. It outperforms models like Flamingo, IDEFICS, and Llava-1.5 in image captioning and matches models like Mixtral 8x7B and Gemini-Pro in text-only tasks. In a human evaluation, Chameleon-34B outperforms Gemini-Pro and GPT-4V in mixed-modal generation, achieving a 60.4% preference rate against Gemini-Pro and a 51.6% preference rate against GPT-4V.
Chameleon is trained on a large-scale dataset consisting of text-only, text-image, and interleaved text-image data. The pre-training stage is divided into two parts, with the first stage using a mixture of text-only and text-image data, and the second stage incorporating higher quality datasets. The models are trained using a combination of architectural innovations and training techniques, including query-key normalization, dropout, and normalization reordering, to ensure stable training.
Chameleon is evaluated on a variety of tasks, including text-only reasoning, math problems, and world knowledge. It outperforms Llama-2 on several tasks, with Chameleon-34B even outperforming Llama-2 70B on 5/8 tasks. It also performs well on image captioning and visual question-answering tasks, outperforming models like Flamingo, IDEFICS, and Llava-1.5.
ChameleonChameleon is a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. The models are trained using a stable approach, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. Chameleon is evaluated on a range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed-modal generation. It demonstrates broad capabilities, including state-of-the-art performance in image captioning, outperforms Llama-2 in text-only tasks, and performs non-trivial image generation. It also matches or exceeds the performance of larger models like Gemini Pro and GPT-4V in human evaluations.
Chameleon uses a unified architecture with fully token-based representations for both image and text modalities. By quantizing images into discrete tokens, it applies the same transformer architecture to sequences of both image and text tokens, without separate encoders. This early-fusion approach allows seamless reasoning and generation across modalities but presents technical challenges in optimization stability and scaling. The paper introduces architectural innovations and training techniques to address these challenges, including query-key normalization and revised layer norms. These techniques enable the training of Chameleon-34B on 5x more tokens than Llama-2, achieving state-of-the-art performance on various benchmarks.
Chameleon is evaluated on a diverse set of tasks, including visual question answering, image captioning, text generation, and image generation. It outperforms models like Flamingo, IDEFICS, and Llava-1.5 in image captioning and matches models like Mixtral 8x7B and Gemini-Pro in text-only tasks. In a human evaluation, Chameleon-34B outperforms Gemini-Pro and GPT-4V in mixed-modal generation, achieving a 60.4% preference rate against Gemini-Pro and a 51.6% preference rate against GPT-4V.
Chameleon is trained on a large-scale dataset consisting of text-only, text-image, and interleaved text-image data. The pre-training stage is divided into two parts, with the first stage using a mixture of text-only and text-image data, and the second stage incorporating higher quality datasets. The models are trained using a combination of architectural innovations and training techniques, including query-key normalization, dropout, and normalization reordering, to ensure stable training.
Chameleon is evaluated on a variety of tasks, including text-only reasoning, math problems, and world knowledge. It outperforms Llama-2 on several tasks, with Chameleon-34B even outperforming Llama-2 70B on 5/8 tasks. It also performs well on image captioning and visual question-answering tasks, outperforming models like Flamingo, IDEFICS, and Llava-1.5.
Chameleon