Sentiment Analysis Model

View the Project on GitHub

View On GitHub

Sentiment_Analysis_Mental_Health

Project Overview

Builded a ML model to classify mental health status from text data using NLTK and XGBoost.
Preprocessed the text, did feature engineering and apply TF-IDF as bag of words
After tunning I saved and tested the model with a 82% accuracy

Exploratory Data Analysis

Understanded and analized the distribution of mental health statuses in the dataset
Visualized word clouds for different mental health statuses
Analyzed other important features like text lenghth and average words

See the full code HERE

Data Preprocessing

Encoded the labels and splited the data into training and test sets
Resampled the training data to address class imbalance
Preprocessed the text data by converting to lowercase, removing non-alphanumeric characters, and applying stemming

See the full code HERE

Model Training

Used TF-IDF vectorization to convert the text data into numerical features
Performed feature selection using SelectKBest and chi-square
Trained an XGBoost classifier on the selected features
Evaluated the model performance using classification report obtaining a 82% accuracy

See the full code HERE

Model Testing

Loaded the trained XGBoost model, TF-IDF vectorizer, and feature selector
Implemented a function to predict the mental health status of new text inputs
Tested the model on sample inputs and printed the predicted labels

See the full code HERE

Conclusions

We reached a hight accuracy (82%) despite using classical ML and TT-IDF
We can try a lot of different approaches to improve performance like LSTM, Transformers and embbedings