Multimodal Sensing for Depression Risk Detection: Integrating Audio, Video, and Text Data

Multimodal Sensing for Depression Risk Detection: Integrating Audio, Video, and Text Data

7 June 2024 | Zhenwei Zhang, Shengming Zhang, Dong Ni, Zhaoguo Wei, Kongjun Yang, Shan Jin, Gan Huang, Zhen Liang, Li Zhang, Linling Li, Huijun Ding, Zhiguo Zhang and Jianhong Wang
This paper proposes a novel framework, the Audio, Video, and Text Fusion-Three Branch Network (AVTF-TBN), for depression risk detection by integrating audio, video, and text data. The AVTF-TBN model consists of three branches—Audio Branch, Video Branch, and Text Branch—each responsible for extracting features from the corresponding modality. These features are then fused through a multimodal fusion (MMF) module, which uses attention and residual mechanisms to combine the features, emphasizing the importance of each modality while minimizing feature loss. The research introduces an emotion elicitation paradigm based on two tasks—reading and interviewing—to collect a rich, sensor-based depression risk detection dataset. The AVTF-TBN model achieves the best performance when data from both tasks are used for detection, with an F1 Score of 0.78, Precision of 0.76, and Recall of 0.81. The model's results confirm the validity of the paradigm and demonstrate the effectiveness of the AVTF-TBN model in detecting depression risk. The study also highlights the importance of sensor-based data in mental health detection. The AVTF-TBN model uses different branches to selectively extract features from three modalities, helping the model fully extract effective features from each modality. After extracting the feature vectors of the three modalities, the model fuses them through the MMF module, which not only highlights vital information but also preserves the original information in the features of each modality before fusion, thereby improving the performance of the model. The study also compares the ability of different tasks and questions to stimulate emotions, showing that the interviewing task provides deeper emotional stimulation than the reading task. The AVTF-TBN model achieves the best performance when using the MMF module to fuse feature vectors, proving the effectiveness of the MMF module. The study also verifies the effectiveness of the MHA modules in different branches through ablation experiments. The decision fusion model, which combines data from two different tasks, achieves the best detection performance, with an F1 Score of 0.78, Precision of 0.76, and Recall of 0.81. The study also compares the performance of the AVTF-TBN model with other models, showing that it achieves the best performance based on different task data. The results indicate that the AVTF-TBN model is effective in detecting depression risk by integrating audio, video, and text data.This paper proposes a novel framework, the Audio, Video, and Text Fusion-Three Branch Network (AVTF-TBN), for depression risk detection by integrating audio, video, and text data. The AVTF-TBN model consists of three branches—Audio Branch, Video Branch, and Text Branch—each responsible for extracting features from the corresponding modality. These features are then fused through a multimodal fusion (MMF) module, which uses attention and residual mechanisms to combine the features, emphasizing the importance of each modality while minimizing feature loss. The research introduces an emotion elicitation paradigm based on two tasks—reading and interviewing—to collect a rich, sensor-based depression risk detection dataset. The AVTF-TBN model achieves the best performance when data from both tasks are used for detection, with an F1 Score of 0.78, Precision of 0.76, and Recall of 0.81. The model's results confirm the validity of the paradigm and demonstrate the effectiveness of the AVTF-TBN model in detecting depression risk. The study also highlights the importance of sensor-based data in mental health detection. The AVTF-TBN model uses different branches to selectively extract features from three modalities, helping the model fully extract effective features from each modality. After extracting the feature vectors of the three modalities, the model fuses them through the MMF module, which not only highlights vital information but also preserves the original information in the features of each modality before fusion, thereby improving the performance of the model. The study also compares the ability of different tasks and questions to stimulate emotions, showing that the interviewing task provides deeper emotional stimulation than the reading task. The AVTF-TBN model achieves the best performance when using the MMF module to fuse feature vectors, proving the effectiveness of the MMF module. The study also verifies the effectiveness of the MHA modules in different branches through ablation experiments. The decision fusion model, which combines data from two different tasks, achieves the best detection performance, with an F1 Score of 0.78, Precision of 0.76, and Recall of 0.81. The study also compares the performance of the AVTF-TBN model with other models, showing that it achieves the best performance based on different task data. The results indicate that the AVTF-TBN model is effective in detecting depression risk by integrating audio, video, and text data.
Reach us at info@study.space
[slides] Multimodal Sensing for Depression Risk Detection%3A Integrating Audio%2C Video%2C and Text Data | StudySpace