Speech emotion recognition via graph-based representations

Speech emotion recognition via graph-based representations

2024 | Anastasia Pentari, George Kafentzis & Manolis Tsiknakis
This paper presents a novel approach for speech emotion recognition (SER) using graph-based representations. The method leverages graph theory to extract statistical and structural information from speech signals, which are then used as features for emotion classification. The proposed approach outperforms existing methods on three datasets: EMODB (German, acted), AESDD (Greek, acted), and DEMoS (Italian, in-the-wild). The method uses a Random Forest classifier in a Leave-One-Speaker-Out Cross Validation (LOSO-CV) scheme and achieves an average increase in UAR of 18%, 8%, and 13% on these datasets, respectively. The paper discusses the challenges of SER, including the variability of speech signals across speakers, languages, and cultures, and the limitations of traditional machine learning and deep learning approaches. It highlights the need for alternative methods that can handle these challenges. The proposed method uses graph-based features derived from two different adjacency matrices: one based on structural information and the other on statistical information of the speech signals. These features are then used to create a speaker-based emotional motif, which provides a unique identity for each speaker's emotional state. The method is evaluated on three public datasets and compared with two state-of-the-art approaches involving hand-crafted features and deep learning architectures. The results show that the proposed method achieves higher classification accuracy, particularly in the case of imbalanced datasets. The paper also discusses the importance of feature selection and the impact of different graph-based features on the performance of the classification model. The results indicate that the structural-based density and clustering coefficient are the most important graph-based features for SER. The proposed method is a novel approach that combines graph-based theory with machine learning to analyze speech signals. It provides a unique emotional identity for each speaker's emotional state and has the potential to improve the accuracy of emotion recognition in various applications. The method is evaluated on three public datasets and shows promising results in terms of classification accuracy and performance. The paper concludes that the proposed method is more effective than existing approaches and has the potential to be applied in various real-world scenarios.This paper presents a novel approach for speech emotion recognition (SER) using graph-based representations. The method leverages graph theory to extract statistical and structural information from speech signals, which are then used as features for emotion classification. The proposed approach outperforms existing methods on three datasets: EMODB (German, acted), AESDD (Greek, acted), and DEMoS (Italian, in-the-wild). The method uses a Random Forest classifier in a Leave-One-Speaker-Out Cross Validation (LOSO-CV) scheme and achieves an average increase in UAR of 18%, 8%, and 13% on these datasets, respectively. The paper discusses the challenges of SER, including the variability of speech signals across speakers, languages, and cultures, and the limitations of traditional machine learning and deep learning approaches. It highlights the need for alternative methods that can handle these challenges. The proposed method uses graph-based features derived from two different adjacency matrices: one based on structural information and the other on statistical information of the speech signals. These features are then used to create a speaker-based emotional motif, which provides a unique identity for each speaker's emotional state. The method is evaluated on three public datasets and compared with two state-of-the-art approaches involving hand-crafted features and deep learning architectures. The results show that the proposed method achieves higher classification accuracy, particularly in the case of imbalanced datasets. The paper also discusses the importance of feature selection and the impact of different graph-based features on the performance of the classification model. The results indicate that the structural-based density and clustering coefficient are the most important graph-based features for SER. The proposed method is a novel approach that combines graph-based theory with machine learning to analyze speech signals. It provides a unique emotional identity for each speaker's emotional state and has the potential to improve the accuracy of emotion recognition in various applications. The method is evaluated on three public datasets and shows promising results in terms of classification accuracy and performance. The paper concludes that the proposed method is more effective than existing approaches and has the potential to be applied in various real-world scenarios.
Reach us at info@futurestudyspace.com