Multimodal Sensing for Depression Risk Detection: Integrating Audio, Video, and Text Data

Multimodal Sensing for Depression Risk Detection: Integrating Audio, Video, and Text Data

7 June 2024 | Zhenwei Zhang, Shengming Zhang, Dong Ni, Zhaoguo Wei, Kongjun Yang, Shan Jin, Gan Huang, Zhen Liang, Li Zhang, Linling Li, Huijun Ding, Zhiguo Zhang, Jianhong Wang
This paper introduces a novel framework called the Audio, Video, and Text Fusion-Three Branch Network (AVTF-TBN) for detecting depression risk by integrating audio, video, and text data. The AVTF-TBN model consists of three dedicated branches—Audio Branch, Video Branch, and Text Branch—each responsible for extracting salient features from its respective modality. These features are then fused through a multimodal fusion (MMF) module, which uses attention and residual mechanisms to combine the features from all three modalities. The MMF module ensures that important information from each modality is retained while minimizing feature loss. The authors designed an emotion elicitation paradigm that includes reading and interviewing tasks to collect a rich, sensor-based dataset for depression risk detection. The dataset was collected from 1911 subjects, with 621 classified as at risk for depression and 1290 as healthy. The AVTF-TBN model was trained and tested on this dataset, and its performance was evaluated using Precision, Recall, and F1 Score metrics. The experimental results show that the AVTF-TBN model performs well when using data from both reading and interviewing tasks, achieving an F1 Score of 0.78, Precision of 0.76, and Recall of 0.81. The model also outperforms other multimodal feature processing frameworks in detecting depression risk. The study highlights the effectiveness of the proposed paradigm and the AVTF-TBN model in capturing and integrating multimodal cues for accurate depression risk detection.This paper introduces a novel framework called the Audio, Video, and Text Fusion-Three Branch Network (AVTF-TBN) for detecting depression risk by integrating audio, video, and text data. The AVTF-TBN model consists of three dedicated branches—Audio Branch, Video Branch, and Text Branch—each responsible for extracting salient features from its respective modality. These features are then fused through a multimodal fusion (MMF) module, which uses attention and residual mechanisms to combine the features from all three modalities. The MMF module ensures that important information from each modality is retained while minimizing feature loss. The authors designed an emotion elicitation paradigm that includes reading and interviewing tasks to collect a rich, sensor-based dataset for depression risk detection. The dataset was collected from 1911 subjects, with 621 classified as at risk for depression and 1290 as healthy. The AVTF-TBN model was trained and tested on this dataset, and its performance was evaluated using Precision, Recall, and F1 Score metrics. The experimental results show that the AVTF-TBN model performs well when using data from both reading and interviewing tasks, achieving an F1 Score of 0.78, Precision of 0.76, and Recall of 0.81. The model also outperforms other multimodal feature processing frameworks in detecting depression risk. The study highlights the effectiveness of the proposed paradigm and the AVTF-TBN model in capturing and integrating multimodal cues for accurate depression risk detection.
Reach us at info@study.space