1996 | Erling Wold, Thom Blum, Douglas Keislar, and James Wheaton
The article discusses the development of a content-based classification, search, and retrieval system for audio data. The system reduces audio signals to perceptual and acoustic features, allowing users to search or retrieve sounds based on these features, previously learned classes, or reference sounds. The authors highlight the challenges of treating audio as an opaque collection of bytes and the need for more sophisticated indexing and retrieval methods. They propose several methods for accessing sounds, including simile, acoustical/perceptual features, subjective features, and onomatopoeia. The system uses statistical techniques to classify and retrieve sounds, focusing on acoustic features such as loudness, pitch, brightness, bandwidth, and harmonicity. The article also covers the training of the system using example sounds and the classification of new sounds based on distance measures. The authors demonstrate the system's effectiveness through various examples, including laughter, female speech, and touchtones. The technology is applicable to audio databases, file systems, audio editors, and surveillance applications. Future directions include adding more analytic features, improving phrase-level content-based retrieval, source separation, and sound synthesis.The article discusses the development of a content-based classification, search, and retrieval system for audio data. The system reduces audio signals to perceptual and acoustic features, allowing users to search or retrieve sounds based on these features, previously learned classes, or reference sounds. The authors highlight the challenges of treating audio as an opaque collection of bytes and the need for more sophisticated indexing and retrieval methods. They propose several methods for accessing sounds, including simile, acoustical/perceptual features, subjective features, and onomatopoeia. The system uses statistical techniques to classify and retrieve sounds, focusing on acoustic features such as loudness, pitch, brightness, bandwidth, and harmonicity. The article also covers the training of the system using example sounds and the classification of new sounds based on distance measures. The authors demonstrate the system's effectiveness through various examples, including laughter, female speech, and touchtones. The technology is applicable to audio databases, file systems, audio editors, and surveillance applications. Future directions include adding more analytic features, improving phrase-level content-based retrieval, source separation, and sound synthesis.