| Jia Li† Member, IEEE James Z. Wang† Member, IEEE
The paper presents a statistical modeling approach to automatic linguistic indexing of pictures, aiming to automatically translate the content of images into linguistic terms. The method uses categorized images to train a dictionary of statistical models, each representing a concept. Images are characterized by wavelet-based features, and a two-dimensional multiresolution hidden Markov model (2-D MHMM) is built for each concept. The likelihood of an image occurring based on the 2-D MHMM is used to measure its association with the concept. The system is evaluated using a database of 600 concepts, each with about 40 training images, and tested on 4,600 images outside the training set. The results demonstrate the system's good accuracy and potential in linguistic indexing of photographic images. The approach has several advantages, including the ability to independently train and retrain models for different concepts, handle a large number of concepts, and consider spatial relations among image pixels. However, the system has limitations, such as the use of 2-D images without a sense of object size and the potential bias in the training database. Future work includes improving indexing speed, using rule-based systems to eliminate conflicting semantics, and assigning weights to words for better description appropriateness.The paper presents a statistical modeling approach to automatic linguistic indexing of pictures, aiming to automatically translate the content of images into linguistic terms. The method uses categorized images to train a dictionary of statistical models, each representing a concept. Images are characterized by wavelet-based features, and a two-dimensional multiresolution hidden Markov model (2-D MHMM) is built for each concept. The likelihood of an image occurring based on the 2-D MHMM is used to measure its association with the concept. The system is evaluated using a database of 600 concepts, each with about 40 training images, and tested on 4,600 images outside the training set. The results demonstrate the system's good accuracy and potential in linguistic indexing of photographic images. The approach has several advantages, including the ability to independently train and retrain models for different concepts, handle a large number of concepts, and consider spatial relations among image pixels. However, the system has limitations, such as the use of 2-D images without a sense of object size and the potential bias in the training database. Future work includes improving indexing speed, using rule-based systems to eliminate conflicting semantics, and assigning weights to words for better description appropriateness.