Understanding Chameleon%3A Mixed-Modal Early-Fusion Foundation Models

Chameleon is a family of early-fusion token-based mixed-modal models designed to understand and generate images and text in any arbitrary sequence. The models are trained from inception using a stable approach, an alignment recipe, and architectural parameterization tailored for the mixed-modal setting. Chameleon demonstrates broad capabilities across various tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed-modal generation. It outperforms state-of-the-art models in image captioning tasks and matches or exceeds larger models like Gemini Pro and GPT-4V in text-only tasks and mixed-modal generation, respectively. Chameleon's unified approach uses fully token-based representations for both image and textual modalities, allowing seamless reasoning and generation across modalities. The paper introduces architectural innovations and training techniques to address challenges in mixed-modal learning, such as query-key normalization and revised layer norm placement. Extensive evaluations show Chameleon's superior performance on diverse benchmarks, with human evaluations further confirming its unique capabilities in mixed-modal reasoning and generation. Chameleon represents a significant step forward in the development of unified foundation models capable of flexible reasoning and generation of multimodal content.Chameleon is a family of early-fusion token-based mixed-modal models designed to understand and generate images and text in any arbitrary sequence. The models are trained from inception using a stable approach, an alignment recipe, and architectural parameterization tailored for the mixed-modal setting. Chameleon demonstrates broad capabilities across various tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed-modal generation. It outperforms state-of-the-art models in image captioning tasks and matches or exceeds larger models like Gemini Pro and GPT-4V in text-only tasks and mixed-modal generation, respectively. Chameleon's unified approach uses fully token-based representations for both image and textual modalities, allowing seamless reasoning and generation across modalities. The paper introduces architectural innovations and training techniques to address challenges in mixed-modal learning, such as query-key normalization and revised layer norm placement. Extensive evaluations show Chameleon's superior performance on diverse benchmarks, with human evaluations further confirming its unique capabilities in mixed-modal reasoning and generation. Chameleon represents a significant step forward in the development of unified foundation models capable of flexible reasoning and generation of multimodal content.

Chameleon: Mixed-Modal Early-Fusion Foundation Models

May 17, 2024 | Chameleon Team