This paper proposes ViLD, a method for open-vocabulary object detection that leverages knowledge distillation from a pretrained open-vocabulary image classification model. The method distills knowledge from a teacher model (e.g., CLIP or ALIGN) into a student detector, enabling the student to detect objects described by arbitrary text inputs. The key idea is to align region embeddings of detected boxes with text and image embeddings inferred by the teacher model. ViLD is trained using two-stage detection, where the first stage generates region proposals, and the second stage classifies these proposals using the teacher model's knowledge. The method is evaluated on the LVIS dataset, where it achieves 16.1 mask AP$_r$ with a ResNet-50 backbone, outperforming the supervised counterpart by 3.8. When trained with a stronger teacher model ALIGN, ViLD achieves 26.3 AP$_r$, which is close to the 2020 LVIS Challenge winner. ViLD can directly transfer to other datasets without fine-tuning, achieving 72.2 AP$_{50}$ on PASCAL VOC, 36.6 AP on COCO, and 11.8 AP on Objects365. On COCO, ViLD outperforms the previous state-of-the-art by 4.8 on novel AP and 11.4 on overall AP. The method is efficient and generalizable, as it works with off-the-shelf open-vocabulary image classifiers. The code and demo are open-sourced at https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild.This paper proposes ViLD, a method for open-vocabulary object detection that leverages knowledge distillation from a pretrained open-vocabulary image classification model. The method distills knowledge from a teacher model (e.g., CLIP or ALIGN) into a student detector, enabling the student to detect objects described by arbitrary text inputs. The key idea is to align region embeddings of detected boxes with text and image embeddings inferred by the teacher model. ViLD is trained using two-stage detection, where the first stage generates region proposals, and the second stage classifies these proposals using the teacher model's knowledge. The method is evaluated on the LVIS dataset, where it achieves 16.1 mask AP$_r$ with a ResNet-50 backbone, outperforming the supervised counterpart by 3.8. When trained with a stronger teacher model ALIGN, ViLD achieves 26.3 AP$_r$, which is close to the 2020 LVIS Challenge winner. ViLD can directly transfer to other datasets without fine-tuning, achieving 72.2 AP$_{50}$ on PASCAL VOC, 36.6 AP on COCO, and 11.8 AP on Objects365. On COCO, ViLD outperforms the previous state-of-the-art by 4.8 on novel AP and 11.4 on overall AP. The method is efficient and generalizable, as it works with off-the-shelf open-vocabulary image classifiers. The code and demo are open-sourced at https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild.