2013 | Florian Eyben, Felix Weninger, Florian Groß, Björn Schuller
The paper presents recent advancements in openSMILE, an open-source multimedia feature extraction toolkit. Version 2.0 integrates feature extraction paradigms from speech, music, and general sound events with basic video features, enabling multi-modal processing. The toolkit supports joint processing of audio and video descriptors, allowing for time synchronization, online incremental processing, and offline batch processing. It also extracts statistical functionals such as moments, peaks, and regression parameters. Postprocessing capabilities include statistical classifiers like support vector machines and file export for toolkits like Weka and HTK. Available low-level descriptors include popular features from speech, music, and video, as well as voice activity detection, pitch tracking, and face detection. openSMILE is implemented in C++ and is fast, cross-platform, and modular, with a focus on real-time and incremental processing. It has been widely used in research, particularly in computational paralinguistics, and has been featured in over 50 accepted research papers. The paper discusses the design principles, functionality, and case studies demonstrating openSMILE's effectiveness in various multimedia recognition tasks, including paralinguistic information extraction, speaker characterization in web videos, and violence detection in movies. Future developments include a joint front-end for audio and video input, online audio enhancement algorithms, and a TCP/IP network interface.The paper presents recent advancements in openSMILE, an open-source multimedia feature extraction toolkit. Version 2.0 integrates feature extraction paradigms from speech, music, and general sound events with basic video features, enabling multi-modal processing. The toolkit supports joint processing of audio and video descriptors, allowing for time synchronization, online incremental processing, and offline batch processing. It also extracts statistical functionals such as moments, peaks, and regression parameters. Postprocessing capabilities include statistical classifiers like support vector machines and file export for toolkits like Weka and HTK. Available low-level descriptors include popular features from speech, music, and video, as well as voice activity detection, pitch tracking, and face detection. openSMILE is implemented in C++ and is fast, cross-platform, and modular, with a focus on real-time and incremental processing. It has been widely used in research, particularly in computational paralinguistics, and has been featured in over 50 accepted research papers. The paper discusses the design principles, functionality, and case studies demonstrating openSMILE's effectiveness in various multimedia recognition tasks, including paralinguistic information extraction, speaker characterization in web videos, and violence detection in movies. Future developments include a joint front-end for audio and video input, online audio enhancement algorithms, and a TCP/IP network interface.