SEED-X is a unified and versatile multimodal foundation model that extends the capabilities of SEED-LLaMA by integrating two key features: understanding images of arbitrary sizes and ratios, and enabling multi-granularity image generation. This model can be fine-tuned for various real-world applications, such as interactive design, personal assistance, and content creation. SEED-X supports both high-level image generation and low-level image manipulation, allowing it to handle diverse tasks across different domains. The model uses a visual tokenizer to unify image comprehension and generation, with a multi-granularity de-tokenization process that facilitates high-precision image manipulation. Additionally, it employs dynamic resolution image encoding to handle images of arbitrary sizes and aspect ratios, enabling the model to scale to any image resolution. SEED-X is pre-trained on massive multimodal data and further fine-tuned with instruction tuning to align with human instructions across various domains. The model achieves competitive performance on benchmark tasks and demonstrates state-of-the-art results in image generation. SEED-X is released with its code, datasets, and models, aiming to inspire future research on the potential of multimodal foundation models in real-world applications.SEED-X is a unified and versatile multimodal foundation model that extends the capabilities of SEED-LLaMA by integrating two key features: understanding images of arbitrary sizes and ratios, and enabling multi-granularity image generation. This model can be fine-tuned for various real-world applications, such as interactive design, personal assistance, and content creation. SEED-X supports both high-level image generation and low-level image manipulation, allowing it to handle diverse tasks across different domains. The model uses a visual tokenizer to unify image comprehension and generation, with a multi-granularity de-tokenization process that facilitates high-precision image manipulation. Additionally, it employs dynamic resolution image encoding to handle images of arbitrary sizes and aspect ratios, enabling the model to scale to any image resolution. SEED-X is pre-trained on massive multimodal data and further fine-tuned with instruction tuning to align with human instructions across various domains. The model achieves competitive performance on benchmark tasks and demonstrates state-of-the-art results in image generation. SEED-X is released with its code, datasets, and models, aiming to inspire future research on the potential of multimodal foundation models in real-world applications.