Multi task opinion enhanced hybrid BERT model for mental health analysis

The dataset employed in this study is an extensive and painstakingly assembled set of utterances from several platforms annotated with mental health statuses. It combines information from other Kaggle datasets, such as the 3k Conversations dataset for chatbots and depressions. The Reddit Cleaned, Human Stress Prediction, Bipolar Mental Health Dataset, Reddit Mental Health Data, Students Anxiety and Depression Dataset, Suicidal Mental Health Dataset, and Suicidal Tweet Detection Dataset are resources related to mental health and anxiety prediction. One of the following seven mental health statuses is associated with the statements in this dataset: Normal, Depression, Suicidal, Anxiety, Stress, Bipolar, or Personality Disorder. Every entry in the data was marked with a particular mental health condition, and it was gathered from a variety of sites, such as Reddit and social media. The dataset, obtained from Kaggle, is already labeled and does not include any user-level information, such as the number of posts per user.

The dataset, which is organized with variables such as unique_id, statements, and mental health status, is a priceless tool for researching mental health trends, creating sophisticated chatbots for mental health, and conducting in-depth sentiment analyses.

Important insights into the attitudes and mental health conditions of the sample are shown in Fig. 1. The distribution of mental health statuses is shown in Fig. 1a, where “Normal” accounts for the biggest percentage (31%), followed by “Depression” (20.2%) and “Suicidal” instances (20.2%). Smaller percentages are explained by other categories including “Anxiety” (7.3%), “Bipolar” (5.3%), “Stress” (4.9%), and “Personality Disorder” (2.0%). The frequency of mental health conditions such as depression and suicidal thoughts in the data was demonstrated by this distribution.

The sentiment distribution is shown in Fig. 1b, where “Positive” sentiment is most prevalent at 40.3%, followed by “Negative” emotion at 37.6% and “Neutral” sentiment at 22.1%. This sentiment breakdown provides crucial context for sentiment analysis in the study of mental health trends by reflecting the emotional tones connected to mental health statuses.

Figure 2 shows that the majority of posts in the dataset are relatively short, with most containing fewer than 500 words and a significant concentration below 100 words. The sharp decline in the number of posts as the word count increases highlights the dominance of shorter posts in user-generated content. This pattern suggests that users tend to communicate their thoughts in brief formats, which is typical for social media or digital platforms where brevity is often encouraged. Posts exceeding 1,000 words are rare, further reinforcing the preference for concise communication. Understanding this distribution is critical for natural language processing tasks, as it helps us set appropriate tokenization limits, ensuring that most posts are fully captured while minimizing the need for truncation or excessive padding. Additionally, this insight allows us to design models optimized for the typical length of user content, improving efficiency and accuracy in downstream tasks like classification, sentiment analysis, or topic modeling.

Table of Contents

Data pre-processing

Data cleaning

Data cleaning requires several important procedures to ensure the quality and integrity³³. Duplicate entries were identified and eliminated to avoid redundancy. Special characters, URLs, and other non-alphanumeric components were removed to streamline text data. Furthermore, the text was transformed to lowercase to preserve uniformity and eliminate problems due to case sensitivity. Missing values were carefully handled by imputing suitable values depending on the context or eliminating impacted rows. Finally, to improve the analysis’s emphasis on the main ideas, stopwords-common words that had minimal bearing on the statements’ meanings-were eliminated. All of these procedures worked together to provide a clean and trustworthy dataset for further processing and examination.

Lemmatization

Lemmatization, which focuses on breaking words down to their most basic or root forms, is an essential stage in text preparation. In contrast to stemming, which frequently truncates words, lemmatization seeks to yield legitimate words by considering meaning and context. First, to maintain uniformity, all punctuation was removed and the text was converted to lowercase. Following the tokenization of the text into individual words, frequent stop words were eliminated. Next, each surviving word is lemmatized with the help of WordNetLemmatizer from NLTK^34,35, which guarantees that terms like “running” and “runs” are reduced to their root, “run.” This method aids in text normalization, which increases the accuracy of the ensuing analytical jobs. The lemmatized tokens were subsequently combined into a processed statement, prepared for the subsequent stages of feature extraction and modeling.

Data augmentation with synonym replacement

As part of our data augmentation process, we used a synonym replacement technique to improve the variety and resilience of the dataset^36,37. This method finds words and replaces them with synonyms by using the WordNet lexical database³⁸. Using WordNet synsets, the procedure begins by extracting synonyms for every word in a given text. A synonym is randomly selected from the list of potential synonyms for every word in the text. This replacement introduces modifications in wording while maintaining the textual meaning up to a predetermined number of times (n=3, in this example). Subsequently, the modified texts were combined and added to the dataset, increasing both the quantity and variety. To provide a more complete dataset, several copies of each original text item were created and appended using synonyms. Using the synonym_replacement function, this method significantly increases the diversity of the dataset and can help machine learning models that have been trained on it perform better and become more broadly applicable. For consistency in further analysis or training, the original labels were retained in the augmented dataset created using both original and freshly generated texts.

Sentiment analysis with TextBlob

During the sentiment analysis stage, TextBlob is used to assess each statement in the supplemented dataset to ascertain its sentiment polarity³⁹. Based on the polarity score, $p$, where $p \in [-1, 1]$, TextBlob divides attitudes into three categories: “Positive,” “Neutral,” or “Negative.” The following section describes how sentiment categorization was performed.

Positive sentiment: $p > 0$
Neutral sentiment: $p = 0$
Negative sentiment: $p < 0$

Function $\text {analysis}\_\text {sentiment}\_\text {by}\_\text {status}$ was used to perform this categorization. It processes each statement $s$ and the status that accompanies it $\text {status}$, evaluates the sentiment, and yields a tuple $(s, \text {status}, \text {sentiment})$. After the sentiments are combined, we can determine the number of sentiment types for each status by examining a DataFrame called $\text {sentiment}\_\text {df}$. In particular, the counts were acquired by employing

$$\begin{aligned} \text {sentiment}\_\text {counts}_{ij} = \text {Count} \left( \text {sentiment}\_\text {df}[\text {status} = i \text { and } \text {sentiment} = j] \right) \end{aligned}$$

where the number of statements with status $i$ and sentiment $j$ is represented by $\text {sentiment}\_\text {counts}_{ij}$. This distribution sheds light on the distribution of feelings across various mental health states and is described in $\text {sentiment}\_\text {counts}$. Analyzing the emotional tone of utterances in connection with reported mental health issues requires an understanding of this distribution.

To ensure the robustness of our findings, we also experimented with other sentiment analysis tools, including VADER and Afinn. These tools offer varying approaches to sentiment scoring and were used to reassess the dataset’s sentiment labels. The comparative analysis revealed that while TextBlob provided a balanced distribution of sentiment categories, VADER detected more nuanced polarity shifts, especially in “Neutral” and “Negative” sentiments. A summary of these results and their impact on the final model is presented in the Results section.

Opinions

We identified important opinion-related terms using spaCy’s English NLP model to extract and sanitize subjective expressions from textual input⁴⁰. We refined the extraction procedure by removing auxiliary verbs and concentrating on adjectives, adverbs, and verbs likely to communicate subjective feelings by running each text through the model. For convenience, the detected opinions were combined into a string format. In conclusion, we have used a cleaning mechanism after extraction. This function eliminated non-alphabetic letters and phrases that were too repetitive or insignificant, leaving only the significant and original expressions. The end product is a dataset enhanced with pertinent and clean opinion phrases that, offers a strong basis for additional research or model training.

Table 1 Average number of opinions per sentiment category.

The average number of opinions for each sentiment type is presented in Table 1. With 36.31 opinions on average, the “Positive” category has the most opinions, closely followed by the “Negative” category with 33.80 opinions. The “Neutral” category, on the other hand, has a far lower average-just 3.64 opinions per feeling. This shows that users record fewer neutral comments and prefer to express stronger sentiments more frequently in terms of positivity or negativity. There might be a general trend toward polarized sentiment in the dataset, as seen by the imbalance in the number of neutral thoughts.

Table 2 Most frequent opinions.

The opinions most frequently used in the dataset are listed in Table 2. With 81,114 occurrences, the verb “feel” is the most common, suggesting that emotional expressiveness is important to the dataset. Verbs like “know” and “want,” which appear 52,940 and 50,827 times, respectively, are closely followed, indicating that knowledge and want are strongly associated with user attitudes. Additional frequently used words such as “get,” “even,” and “really” denote typical declarations of emphasis or intensity. These frequently expressed opinions draw attention to important phrases that users frequently use to communicate their feelings, ideas, and behaviors, thereby providing insights into the recurrent themes of the dataset.

Data tokenization and label encoding

We used a two-step procedure of tokenization and label encoding to obtain textual input for model training. First, the raw text data were transformed into token IDs and attention masks appropriate for BERT-based models using BertTokenizer⁹. After processing the phrases in the DataFrame’s statement column, the tokenize_data function generates input_ids and attention_masks using the given parameters of padding and truncating to a maximum length of 100 tokens, among other things. The text data were structured correctly for the model input owing to this tokenization.

We set the input sequence length to 100 tokens after conducting an exploratory analysis of our dataset. Initially, we examined the distribution of tokenized sentences and observed that the majority of instances contained fewer than 100 tokens. Specifically, our preliminary data analysis showed that approximately 90% of sentences in the dataset had a token count below this threshold, minimizing the amount of unnecessary padding for most samples. Selecting a slightly longer length would have increased computational overhead, while a shorter length risked truncating relevant context. Thus, a padding length of 100 tokens represented a balanced compromise between computational efficiency and preserving important textual information.

The next step involved label encoding, which converts category labels into numerical values required for model training⁴¹. For the status, emotion, and opinions_str columns, we initialized label encoders. These encoders enable the model to comprehend and handle categorical data efficiently by mapping distinct class labels to integers. The opinions labels were kept as integer encodings, whereas the status and emotion labels were transformed into one-hot encoded vectors by using the to_categorical function. Thorough preparation guarantees that both categorical and textual data are correctly prepared and incorporated into the training pipeline. These encoded labels are now included in the new DataFrame, which makes managing the data and training the models easier.

Data splitting

The dataset was systematically divided into training, validation, and test sets to simplify the training and evaluation of the model. For compatibility with the scikit-learn’s train_test_split function, the input tensors (input_ids and attention_masks) and encoded labels (status_labels, sentiment_labels, and opinions_labels) were first transformed from TensorFlow tensors to NumPy arrays.

Initially, the dataset was split into training and temporary sets, with 30% set aside for testing and validation and 70% going toward training. The temporary set was then divided evenly into the test and validation sets, each comprising 15% of the initial data. This method guarantees solid model training and objective assessment by evenly distributing data. The corresponding dataset sizes were 1,10,365 samples for the training, 23,650 samples for validation, and 23,650 samples for testing. These splits are essential for evaluating model performance, fine-tuning hyperparameters, and guaranteeing the applicability of the model to new data.

link