CLAY is a controllable large-scale generative model designed to create high-quality 3D assets by transforming human imagination into intricate 3D digital structures. It supports text, image, and 3D-aware controls, and is built on a multi-resolution Variational Autoencoder (VAE) and a minimalistic latent Diffusion Transformer (DiT) to extract rich 3D priors. CLAY generates 3D geometry with 1.5 billion parameters and physically-based rendering (PBR) textures, enabling the creation of detailed 3D assets from conceptual designs to production-ready models. The model is trained on a large 3D dataset processed through a standardized pipeline, including remeshing and annotation using GPT-4V. CLAY's architecture allows for efficient generation of high-quality 3D geometries and textures, with support for various conditioning modalities such as text, images, voxels, and bounding boxes. The model's ability to generate diverse 3D assets with high fidelity and realism makes it a versatile tool for applications in gaming, film, and virtual simulations. CLAY's training and adaptation processes, including LoRA fine-tuning and multi-view material diffusion, enable precise control over 3D asset creation, demonstrating its effectiveness in generating a wide range of 3D objects with intricate details and textures.CLAY is a controllable large-scale generative model designed to create high-quality 3D assets by transforming human imagination into intricate 3D digital structures. It supports text, image, and 3D-aware controls, and is built on a multi-resolution Variational Autoencoder (VAE) and a minimalistic latent Diffusion Transformer (DiT) to extract rich 3D priors. CLAY generates 3D geometry with 1.5 billion parameters and physically-based rendering (PBR) textures, enabling the creation of detailed 3D assets from conceptual designs to production-ready models. The model is trained on a large 3D dataset processed through a standardized pipeline, including remeshing and annotation using GPT-4V. CLAY's architecture allows for efficient generation of high-quality 3D geometries and textures, with support for various conditioning modalities such as text, images, voxels, and bounding boxes. The model's ability to generate diverse 3D assets with high fidelity and realism makes it a versatile tool for applications in gaming, film, and virtual simulations. CLAY's training and adaptation processes, including LoRA fine-tuning and multi-view material diffusion, enable precise control over 3D asset creation, demonstrating its effectiveness in generating a wide range of 3D objects with intricate details and textures.