22 Nov 2021 | Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, Pengchuan Zhang
The paper introduces Florence, a new computer vision foundation model designed to generalize well across a wide range of tasks with minimal customization. Florence is trained on a large-scale, diverse dataset of 900 million image-text pairs, enabling it to handle tasks such as classification, retrieval, object detection, VQA, image captioning, video retrieval, and action recognition. The model's architecture includes a hierarchical Vision Transformer (CoSwin) and a language encoder, allowing it to learn representations from coarse to fine, static to dynamic, and single modality to multiple modalities. Florence demonstrates superior performance in zero-shot, few-shot, and zero-shot transfer learning, achieving state-of-the-art results on various benchmarks, including ImageNet-1K, COCO, VQA, and Kinetics-600. The paper also details the model's training infrastructure, which includes techniques to reduce memory consumption and increase training efficiency. Overall, Florence represents a significant advancement in the field of computer vision, offering a versatile and powerful foundation for a wide range of vision tasks.The paper introduces Florence, a new computer vision foundation model designed to generalize well across a wide range of tasks with minimal customization. Florence is trained on a large-scale, diverse dataset of 900 million image-text pairs, enabling it to handle tasks such as classification, retrieval, object detection, VQA, image captioning, video retrieval, and action recognition. The model's architecture includes a hierarchical Vision Transformer (CoSwin) and a language encoder, allowing it to learn representations from coarse to fine, static to dynamic, and single modality to multiple modalities. Florence demonstrates superior performance in zero-shot, few-shot, and zero-shot transfer learning, achieving state-of-the-art results on various benchmarks, including ImageNet-1K, COCO, VQA, and Kinetics-600. The paper also details the model's training infrastructure, which includes techniques to reduce memory consumption and increase training efficiency. Overall, Florence represents a significant advancement in the field of computer vision, offering a versatile and powerful foundation for a wide range of vision tasks.