Visual Knowledge in the Big Model Era: Retrospect and Prospect

Visual Knowledge in the Big Model Era: Retrospect and Prospect

5 Apr 2024 | Wenguan WANG, Yi YANG and Yunhe PAN
Visual knowledge is a new form of knowledge representation that can encapsulate visual concepts and their relations in a succinct, comprehensive, and interpretable manner, with deep roots in cognitive psychology. As knowledge about the visual world is recognized as an essential component of human cognition and intelligence, visual knowledge is poised to play a pivotal role in establishing machine intelligence. With recent advances in AI techniques, large AI models (or foundation models) have emerged as powerful tools capable of extracting versatile patterns from broad data as implicit knowledge and abstracting them into a large amount of numeric parameters. To create AI machines empowered by visual knowledge in this new wave, we present a timely review that investigates the origins and development of visual knowledge in the pre-big model era and emphasizes the opportunities and unique role of visual knowledge in the big model era. Visual knowledge theory posits that next-generation AI needs to fully express visual concepts and their attributes, as well as reason about their transformations, compositions, comparisons, predictions, and narrations, through a unified, abstract, and interpretable form of representation. The emergence of large language models like GPT-3 has transformed the field of natural language processing, while the Segment Anything Model (SAM) has ushered in the era of visual foundation models in computer vision. However, large AI models still suffer from several deficiencies, including opacity, data and computational resource demands, and the potential for generating nonsensical or unfaithful content. Visual knowledge, with its expressive and interpretable representation, manipulation, and reasoning capabilities, can potentially alleviate these weaknesses. Visual knowledge is defined as stable mental representations of visual objects and the commonalities in the inherent rules among various tasks. It is constructed from four essential components: visual concept, visual relation, visual operation, and visual reasoning. Visual concepts are categories of visual objects defined by prototype and scope. Visual relations include geometric, temporal, semantic, functional, and causal relations. Visual operations involve transformations over visual concepts or objects in space or time, such as composition, decomposition, replacement, combination, deformation, motion, comparison, destruction, restoration, and prediction. Visual reasoning is the process of applying visual concepts, relations, and operations to interpret visual data, solve problems, and make informed decisions. In the pre-big model era, visual knowledge has been explored in various fundamental computer vision tasks, including image classification and segmentation. Recent research has focused on modeling geometric, semantic, temporal, functional, and causal relations, as well as visual operations and reasoning. These studies have shown the potential of visual knowledge in enhancing AI systems' ability to process and interpret visual information. However, challenges remain in capturing the scope of visual concepts, modeling geometric relations, and learning visual semantic relations. The development of visual knowledge is crucial for creating more powerful AI systems that can overcome the weaknesses of large AI models.Visual knowledge is a new form of knowledge representation that can encapsulate visual concepts and their relations in a succinct, comprehensive, and interpretable manner, with deep roots in cognitive psychology. As knowledge about the visual world is recognized as an essential component of human cognition and intelligence, visual knowledge is poised to play a pivotal role in establishing machine intelligence. With recent advances in AI techniques, large AI models (or foundation models) have emerged as powerful tools capable of extracting versatile patterns from broad data as implicit knowledge and abstracting them into a large amount of numeric parameters. To create AI machines empowered by visual knowledge in this new wave, we present a timely review that investigates the origins and development of visual knowledge in the pre-big model era and emphasizes the opportunities and unique role of visual knowledge in the big model era. Visual knowledge theory posits that next-generation AI needs to fully express visual concepts and their attributes, as well as reason about their transformations, compositions, comparisons, predictions, and narrations, through a unified, abstract, and interpretable form of representation. The emergence of large language models like GPT-3 has transformed the field of natural language processing, while the Segment Anything Model (SAM) has ushered in the era of visual foundation models in computer vision. However, large AI models still suffer from several deficiencies, including opacity, data and computational resource demands, and the potential for generating nonsensical or unfaithful content. Visual knowledge, with its expressive and interpretable representation, manipulation, and reasoning capabilities, can potentially alleviate these weaknesses. Visual knowledge is defined as stable mental representations of visual objects and the commonalities in the inherent rules among various tasks. It is constructed from four essential components: visual concept, visual relation, visual operation, and visual reasoning. Visual concepts are categories of visual objects defined by prototype and scope. Visual relations include geometric, temporal, semantic, functional, and causal relations. Visual operations involve transformations over visual concepts or objects in space or time, such as composition, decomposition, replacement, combination, deformation, motion, comparison, destruction, restoration, and prediction. Visual reasoning is the process of applying visual concepts, relations, and operations to interpret visual data, solve problems, and make informed decisions. In the pre-big model era, visual knowledge has been explored in various fundamental computer vision tasks, including image classification and segmentation. Recent research has focused on modeling geometric, semantic, temporal, functional, and causal relations, as well as visual operations and reasoning. These studies have shown the potential of visual knowledge in enhancing AI systems' ability to process and interpret visual information. However, challenges remain in capturing the scope of visual concepts, modeling geometric relations, and learning visual semantic relations. The development of visual knowledge is crucial for creating more powerful AI systems that can overcome the weaknesses of large AI models.
Reach us at info@study.space
Understanding Visual Knowledge in the Big Model Era%3A Retrospect and Prospect