Introduction
So, what exactly is sentiment? Sentiment relates to the meaning of a word or sequence of words and is usually associated with an opinion or emotion. And analysis? Well, this is the process of looking at data and making inferences; in this case, using machine learning to learn and predict whether a movie review is positive or negative.
Objective: Determine Review Polarity
Given a review, our main objective is to determine if the review is positive or negative. We can do this by using two approaches.
Data Information
Exploratory Data Analysis
Overview of the data:
Train data consists of 156060 rows and 4 features in it.
Data Fields:
- Phrase Id
- Sentence Id
- Phrase
- Sentiment
First, let us import the necessary libraries.
- Next, we load the data from CSV files into a pandas dataframe.
- Let’s print the total no: of null values each row has
We can see there are no null values in the data.
- Let’s check for the duplicates of rows
There are no duplicate rows in the data.
- Let’s check for number of reviews corresponding to each of the ratings
1. Naive Way
The naive way of doing this is to categorise all reviews with Ratings 3 and 4 as positive. And, all reviews with ratings 0and 1 as negative. We will ignore all such reviews where the rating is 2 because intuitively if you think about it, 2 is neither positive nor negative. It’s a neutral reviews.
2. Using the review text data and perform Natural Language Processing (NLP) tasks
Firstly we need to perform some data cleaning and then text preprocessing and convert the texts as vectors so that we can train some model on those vectors and predict polarity of the review.
1.Data Cleaning
(i) Data Deduplication
2. Text Preprocessing
[1] HTML Tag Removal
[2] Punctuations Removal
[3] Removal of words with numbers
[4] Expand the most common English contractions
[5] Stopwords
Stop words usually refers to the most common words in a language are generally filtered out before or after processing of natural language data. Sometimes it is avoided to remove the stop words to support phrase search.
[6] Stemming
Porter Stemmer: Most commonly used stemmer without a doubt, also one of the most gentle stemmers. Though it is also the most computationally intensive of the algorithms. It is also the oldest stemming algorithm by a large margin.
SnowBall Stemmer(Porter2): Nearly universally regarded as an improvement over porter, and for good reason. Porter himself in fact admits that it is better than his original algorithm. Slightly faster computation time than Porter, with a fairly large community around it.
Preprocessing output for one review
Positive and Negative words in reviews
Word Cloud of Whole Dataset
Featurization
BAG OF WORDS
A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms. The approach is very simple and flexible, and can be used in a myriad of ways for extracting features from documents. Suppose we have N reviews in our dataset and we want to convert the words in our reviews to vectors. We can use BOW as a method to do this. What it does is that for each unique word in the data corpus, it creates a dimension. Then it counts how many number of times a word is present in a review. And then this number is placed under that word for a corresponding review. We will get a Sparse Matrix representation for all the words in the review.
Bi-Grams and n-Grams
TF-IDF
In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
Let’s assume we have data corpus D, which contains N reviews {r1,r2,r3,r4…rN}. Let’s say our review r1 contains the following words {w1,w2,w3,w1,w9,w6,w7,w9,w9}.
TF or Term Frequency for a word is basically the number of times a word occurs in a review divided by the total number of words present in that same review. For example, in the text corpus that we have considered in the above example, the TF for word w1 is (2/9) and for word w9 is (1/3). Intuitively, higher the occurence of a word in a text is, greater will be its TF value. TF values lies between 0 and 1.
IDF or Inverse Document Frequency for a word is given by the formula log(N/n), where ’N’ is equal to the total number of reviews in the corpus ‘D’ and ’n’ refers to the number of reviews in ‘D’ which contains that specific word. Intuitively, IDF will be higher for words which occur rarely and will be less for words which occurs more frequently. IDF values are more than 0.
So for each word in each review we will consider the product of (TF x IDF), and represent it in a d dimensional vector.
TF-IDF basically doesn’t consider the semantic meaning of words. But what is does is that it gives more importance to words which occurs less frequently in the whole data corpus and also gives much importance to the most frequent words that occurs in each review.
Avg W2V
In this model we convert each word present in a review to vectors. For each sentence we will compute the average word to vec representation. Let’s look at the below demo example.
Suppose we have N words in a sentence {w1,w2,w3,w4,w5,w6 … , wN}. We will convert each word to a vector, sum them up and divide by the total number of words (N) present in that particular sentence. So our final vector will look like (1/N) * [word2vec(w1) + word2vec(w2) + word2vec(w3) …. + word2vec(wN)]
TFIDF weighted W2V
In this model we convert each word present in a review to vectors. For each sentence we will compute the tf-idf average word to vec representation. Let’s look at the below demo example.
Suppose we have N words in a sentence {w1,w2,w3,w4,w5,w6 … , wN}. We will compute the tf-idf for each word in a review for all reviews. Lets say the corresponding tf-idfs are {t1,t2,t3,t4,t5,t6……tN}. We will convert each word to a vector, sum them up and divide by the summation of tf-idf vectors for all words present in that particular sentence. So our final vector will look like [1/(t1+t2+t3+t4+t5+t6+ ….. +tN)] * [word2vec(w1) + word2vec(w2) + word2vec(w3) …. + word2vec(wN)]
Training different Models
1. k-Nearest Neighbors
In this code block :
- We define a function which is used to seperate the positive and the negative data points for any input dataset using their corresponding class labels, using KNN algorithm.
- We split the input dataset into train set and test set. For the training set I have taken the old 80% data. For the test set I have taken the latest 20% data. The idea here is to see how the model behaves when it’s tested on ‘new unseen’ data after getting trained on an old data.
- We use cross validation to determine the optimal value of K, and use this value of K as our number nearest neighbours to train the final model.
- Finally, we will use accuracy as a metric to evaluate this models performance on unseen data.
KNN on the Bag of Words model created using ‘CleanedText’
KNN on the TF-IDF model created using ‘CleanedText’ texts
KNN on the Average Word2Vec using a 100 dimensional vector representation of each word
KNN on the TF-IDF weighted Average Word2Vec representation on the reviews
Conclusion:
2. Naive Bayes
In this code block:
- We define a function which is used to perform column standardization on any give input matrix.
- We define a function which is used to get the top 50 features from both the negative and the positive review classes.
- We define a function which is used to measure the various performance metrics for a given model. We will use accuracy as a metric to evaluate this models performance on unseen data.
- We define a function which is used to obtain the optima value of alpha along with the best model estimator, using time series cross validation along with grid search CV.
- We define a function which is used to plot and visually represent the errors vs hyperparameter plot.
- We fit the naive base classifier to our training data and make the final model.
Naive Bayes on the Bag of Words model created using ‘CleanedText’
Naive Bayes on the TF-IDF model created using ‘CleanedText’ texts
Conclusion:
3. Logistic Regression
Logistic Regression on BOW
Applying Logistic Regression with L1 regularization on BOW
Applying Logistic Regression with L2 regularization on BOW
Logistic Regression on TFIDF
Applying Logistic Regression with L1 regularization on TFIDF
Applying Logistic Regression with L2 regularization on TFIDF
Logistic Regression on AVG W2V
Applying Logistic Regression with L1 regularization on AVG W2V
Applying Logistic Regression with L2 regularization on AVG W2V
Logistic Regression on TFIDF W2V
Applying Logistic Regression with L1 regularization on TFIDF W2V
Applying Logistic Regression with L2 regularization on TFIDF W2V
Conclusions
4. SVM
Applying Linear SVM on BOW + L1 Regularization
Applying Linear SVM on BOW + L2 Regularization
Applying Linear SVM on TFIDF + L1 Regularization
Applying Linear SVM on TFIDF + L2 Regularization
Applying Linear SVM on AVG W2V + L1 Regularization
Applying Linear SVM on AVG W2V + L2 Regularization
Applying Linear SVM on TFIDF W2V + L1 Regularization
Applying Linear SVM on TFIDF W2V + L2 Regularization
Conclusions
5. Decision Trees
Applying Decision Trees on BOW
Applying Decision Trees on TFIDF
Applying Decision Trees on AVG W2V
Applying Decision Trees on TFIDF W2V
Conclusions:
6. Random Forest
Applying Random Forests on BOW
Applying Random Forests on TFIDF
Applying Random Forests on AVG W2V
Applying Random Forests on TFIDF W2V
Conclusions:
Result:
Logistic Regression performed well as compare to other models.
Model Deployment
Often people ignore this step ,but it is one of the important step in data science life cycle. I am going to deploy the model using Flask and AWS EC2 instance.
Prerequisites :
- Knowledge about AWS EC2 instance
- Python Flask
- HTML, CSS, JavaScript
- Linux commands like ssh, scp
On AWS side one first needs to create free tier ec2 instance of “ubuntu server” and launch it. Make sure you have inbound rule set as “All Traffic”, “All Ports” in security group. Your app needs to have low latency. Avoid reloading of saved weights in “final_prediction” function read all the files outside of the function else it will cause undesirable behaviour.
After successful creation of EC2 instance to connect with the instance use following command.
ssh -i “deployment.pem” ubuntu@ec2–3–21–156–107.us-east-2.compute.amazonaws.com
Copy the deployment_new folder on instance using following command :
scp -i “deployment.pem” -r deployment_new ubuntu@ec2–3–21–156–107.us-east-2.compute.amazonaws.com:~/deployment_new
This EC2 instance might not have all libraries required to run the app.py file install all the required libraries using pip3 install <library name> command finally to run the app use below command,
nohup python3 app.py &
Here is the link of my app you can try using it :http://ec2-18-222-1-146.us-east-2.compute.amazonaws.com:8080/index
Conclusion
After tuning lots of model we were able to achieve 0.82 as highest AUC. I hope this case study will help you understand the sentiment analysis in better way.
Future Work
In this blog I focused more on classical machine learning algorithms. The Deep Learning algos like Transformers and BERT can also be implemented.
Source Code
https://github.com/INZA111/Sentiment-Analysis-on-Movie-Reviews