4 April 2024 | Nawshad Farruque¹ · Randy Goebel¹ · Sudhakar Sivapalan² · Osmar R. Zaiane¹
This paper presents a semi-supervised learning (SSL) framework for depression symptoms detection (DSD) using social media text. The framework leverages a state-of-the-art large mental health forum text pre-trained language model, fine-tuned on a clinician-annotated DSD dataset, and a zero-shot learning model for DSD. The approach combines these models to harvest depression symptoms-related samples from a large self-curated depressive tweets repository (DTR). The clinician-annotated dataset is the largest of its kind, and DTR is created from tweets of self-disclosed depressed users. The SSL process iteratively retrain the initial DSD model with harvested data, and the final dataset is the largest of its kind. The DSD and depression post detection models trained on this dataset achieve significantly better accuracy than their initial versions. The study also discusses the stopping criteria and limitations of the SSL process, and elaborates on the underlying constructs that play a vital role in the overall SSL process. The results show that the SSL framework improves the accuracy of depression symptoms detection and depression detection models. The study also highlights the importance of linguistic features, such as n-grams, psycholinguistic and sentiment lexicons, word and sentence embeddings, in detecting depression from social media text. The paper also discusses the distribution of depression symptoms in the datasets and the impact of data harvesting on the accuracy of the models. The study concludes that the SSL framework is effective in curating small but distributionally relevant samples through both sample distribution and bi-gram distribution for all the labels. The paper also discusses the limitations of the study, including the small size of the overall dataset and the lack of continuous human annotation in the iterative harvesting process.This paper presents a semi-supervised learning (SSL) framework for depression symptoms detection (DSD) using social media text. The framework leverages a state-of-the-art large mental health forum text pre-trained language model, fine-tuned on a clinician-annotated DSD dataset, and a zero-shot learning model for DSD. The approach combines these models to harvest depression symptoms-related samples from a large self-curated depressive tweets repository (DTR). The clinician-annotated dataset is the largest of its kind, and DTR is created from tweets of self-disclosed depressed users. The SSL process iteratively retrain the initial DSD model with harvested data, and the final dataset is the largest of its kind. The DSD and depression post detection models trained on this dataset achieve significantly better accuracy than their initial versions. The study also discusses the stopping criteria and limitations of the SSL process, and elaborates on the underlying constructs that play a vital role in the overall SSL process. The results show that the SSL framework improves the accuracy of depression symptoms detection and depression detection models. The study also highlights the importance of linguistic features, such as n-grams, psycholinguistic and sentiment lexicons, word and sentence embeddings, in detecting depression from social media text. The paper also discusses the distribution of depression symptoms in the datasets and the impact of data harvesting on the accuracy of the models. The study concludes that the SSL framework is effective in curating small but distributionally relevant samples through both sample distribution and bi-gram distribution for all the labels. The paper also discusses the limitations of the study, including the small size of the overall dataset and the lack of continuous human annotation in the iterative harvesting process.