22 Nov 2021 | Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, Pengchuan Zhang
Florence is a new computer vision foundation model that expands visual representations from coarse (scene) to fine (object), from static (images) to dynamic (videos), and from RGB to multiple modalities (caption, depth). It is trained on Web-scale image-text data and can be easily adapted for various computer vision tasks, including classification, retrieval, object detection, VQA, image captioning, video retrieval, and action recognition. Florence demonstrates outstanding performance in transfer learning, including zero-shot and few-shot learning. It achieves new state-of-the-art results on 44 benchmarks, such as ImageNet-1K zero-shot classification with top-1 accuracy of 83.74 and top-5 accuracy of 97.18, 62.4 mAP on COCO fine-tuning, 80.36 on VQA, and 87.8 on Kinetics-600. Florence is built using a two-tower architecture with a hierarchical Vision Transformer as the image encoder and a language encoder. It is trained on a 900 million image-text pair dataset and uses a unified image-text contrastive learning objective. Florence is extended to learn object-level visual representations and fine-grained vision-language representations. It is also adapted to video recognition by modifying the image encoder for video processing. Florence outperforms existing models in zero-shot transfer, linear probing, and fine-tuning on various tasks. It is designed to be scalable and efficient, with a training infrastructure that reduces memory usage and improves training throughput. Florence is a general-purpose vision system that can be adapted to a wide range of vision tasks and applications.Florence is a new computer vision foundation model that expands visual representations from coarse (scene) to fine (object), from static (images) to dynamic (videos), and from RGB to multiple modalities (caption, depth). It is trained on Web-scale image-text data and can be easily adapted for various computer vision tasks, including classification, retrieval, object detection, VQA, image captioning, video retrieval, and action recognition. Florence demonstrates outstanding performance in transfer learning, including zero-shot and few-shot learning. It achieves new state-of-the-art results on 44 benchmarks, such as ImageNet-1K zero-shot classification with top-1 accuracy of 83.74 and top-5 accuracy of 97.18, 62.4 mAP on COCO fine-tuning, 80.36 on VQA, and 87.8 on Kinetics-600. Florence is built using a two-tower architecture with a hierarchical Vision Transformer as the image encoder and a language encoder. It is trained on a 900 million image-text pair dataset and uses a unified image-text contrastive learning objective. Florence is extended to learn object-level visual representations and fine-grained vision-language representations. It is also adapted to video recognition by modifying the image encoder for video processing. Florence outperforms existing models in zero-shot transfer, linear probing, and fine-tuning on various tasks. It is designed to be scalable and efficient, with a training infrastructure that reduces memory usage and improves training throughput. Florence is a general-purpose vision system that can be adapted to a wide range of vision tasks and applications.