Sentiment Analysis on Movie Reviews

17 min readFeb 11, 2021

Introduction

So, what exactly is sentiment? Sentiment relates to the meaning of a word or sequence of words and is usually associated with an opinion or emotion. And analysis? Well, this is the process of looking at data and making inferences; in this case, using machine learning to learn and predict whether a movie review is positive or negative.

Objective: Determine Review Polarity

Given a review, our main objective is to determine if the review is positive or negative. We can do this by using two approaches.

Data Information

Exploratory Data Analysis

Overview of the data:

Train data consists of 156060 rows and 4 features in it.

Data Fields:

Phrase Id
Sentence Id
Phrase
Sentiment

First, let us import the necessary libraries.

Next, we load the data from CSV files into a pandas dataframe.

Let’s print the total no: of null values each row has

We can see there are no null values in the data.

Let’s check for the duplicates of rows

There are no duplicate rows in the data.

Let’s check for number of reviews corresponding to each of the ratings

1. Naive Way

The naive way of doing this is to categorise all reviews with Ratings 3 and 4 as positive. And, all reviews with ratings 0and 1 as negative. We will ignore all such reviews where the rating is 2 because intuitively if you think about it, 2 is neither positive nor negative. It’s a neutral reviews.

2. Using the review text data and perform Natural Language Processing (NLP) tasks

Firstly we need to perform some data cleaning and then text preprocessing and convert the texts as vectors so that we can train some model on those vectors and predict polarity of the review.

1.Data Cleaning

(i) Data Deduplication

2. Text Preprocessing

[1] HTML Tag Removal

[2] Punctuations Removal

[3] Removal of words with numbers

[4] Expand the most common English contractions

[5] Stopwords

Stop words usually refers to the most common words in a language are generally filtered out before or after processing of natural language data. Sometimes it is avoided to remove the stop words to support phrase search.

[6] Stemming

Porter Stemmer: Most commonly used stemmer without a doubt, also one of the most gentle stemmers. Though it is also the most computationally intensive of the algorithms. It is also the oldest stemming algorithm by a large margin.

SnowBall Stemmer(Porter2): Nearly universally regarded as an improvement over porter, and for good reason. Porter himself in fact admits that it is better than his original algorithm. Slightly faster computation time than Porter, with a fairly large community around it.

Preprocessing output for one review

Positive and Negative words in reviews

Word Cloud of Whole Dataset

Featurization

BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms. The approach is very simple and flexible, and can be used in a myriad of ways for extracting features from documents. Suppose we have N reviews in our dataset and we want to convert the words in our reviews to vectors. We can use BOW as a method to do this. What it does is that for each unique word in the data corpus, it creates a dimension. Then it counts how many number of times a word is present in a review. And then this number is placed under that word for a corresponding review. We will get a Sparse Matrix representation for all the words in the review.

Bi-Grams and n-Grams

TF-IDF

In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

Let’s assume we have data corpus D, which contains N reviews {r1,r2,r3,r4…rN}. Let’s say our review r1 contains the following words {w1,w2,w3,w1,w9,w6,w7,w9,w9}.

TF or Term Frequency for a word is basically the number of times a word occurs in a review divided by the total number of words present in that same review. For example, in the text corpus that we have considered in the above example, the TF for word w1 is (2/9) and for word w9 is (1/3). Intuitively, higher the occurence of a word in a text is, greater will be its TF value. TF values lies between 0 and 1.

IDF or Inverse Document Frequency for a word is given by the formula log(N/n), where ’N’ is equal to the total number of reviews in the corpus ‘D’ and ’n’ refers to the number of reviews in ‘D’ which contains that specific word. Intuitively, IDF will be higher for words which occur rarely and will be less for words which occurs more frequently. IDF values are more than 0.

So for each word in each review we will consider the product of (TF x IDF), and represent it in a d dimensional vector.

TF-IDF basically doesn’t consider the semantic meaning of words. But what is does is that it gives more importance to words which occurs less frequently in the whole data corpus and also gives much importance to the most frequent words that occurs in each review.

Avg W2V

In this model we convert each word present in a review to vectors. For each sentence we will compute the average word to vec representation. Let’s look at the below demo example.

Suppose we have N words in a sentence {w1,w2,w3,w4,w5,w6 … , wN}. We will convert each word to a vector, sum them up and divide by the total number of words (N) present in that particular sentence. So our final vector will look like (1/N) * [word2vec(w1) + word2vec(w2) + word2vec(w3) …. + word2vec(wN)]

TFIDF weighted W2V

In this model we convert each word present in a review to vectors. For each sentence we will compute the tf-idf average word to vec representation. Let’s look at the below demo example.

Suppose we have N words in a sentence {w1,w2,w3,w4,w5,w6 … , wN}. We will compute the tf-idf for each word in a review for all reviews. Lets say the corresponding tf-idfs are {t1,t2,t3,t4,t5,t6……tN}. We will convert each word to a vector, sum them up and divide by the summation of tf-idf vectors for all words present in that particular sentence. So our final vector will look like [1/(t1+t2+t3+t4+t5+t6+ ….. +tN)] * [word2vec(w1) + word2vec(w2) + word2vec(w3) …. + word2vec(wN)]

Training different Models

1. k-Nearest Neighbors

In this code block :

We define a function which is used to seperate the positive and the negative data points for any input dataset using their corresponding class labels, using KNN algorithm.
We split the input dataset into train set and test set. For the training set I have taken the old 80% data. For the test set I have taken the latest 20% data. The idea here is to see how the model behaves when it’s tested on ‘new unseen’ data after getting trained on an old data.
We use cross validation to determine the optimal value of K, and use this value of K as our number nearest neighbours to train the final model.
Finally, we will use accuracy as a metric to evaluate this models performance on unseen data.

KNN on the Bag of Words model created using ‘CleanedText’

KNN on the TF-IDF model created using ‘CleanedText’ texts

KNN on the Average Word2Vec using a 100 dimensional vector representation of each word

KNN on the TF-IDF weighted Average Word2Vec representation on the reviews

Conclusion:

2. Naive Bayes

In this code block:

We define a function which is used to perform column standardization on any give input matrix.
We define a function which is used to get the top 50 features from both the negative and the positive review classes.
We define a function which is used to measure the various performance metrics for a given model. We will use accuracy as a metric to evaluate this models performance on unseen data.
We define a function which is used to obtain the optima value of alpha along with the best model estimator, using time series cross validation along with grid search CV.
We define a function which is used to plot and visually represent the errors vs hyperparameter plot.
We fit the naive base classifier to our training data and make the final model.

Naive Bayes on the Bag of Words model created using ‘CleanedText’

Naive Bayes on the TF-IDF model created using ‘CleanedText’ texts

Conclusion:

3. Logistic Regression

Logistic Regression on BOW

Applying Logistic Regression with L1 regularization on BOW

Applying Logistic Regression with L2 regularization on BOW

Logistic Regression on TFIDF

Applying Logistic Regression with L1 regularization on TFIDF

Applying Logistic Regression with L2 regularization on TFIDF

Logistic Regression on AVG W2V

Applying Logistic Regression with L1 regularization on AVG W2V

Applying Logistic Regression with L2 regularization on AVG W2V

Logistic Regression on TFIDF W2V

Applying Logistic Regression with L1 regularization on TFIDF W2V

Applying Logistic Regression with L2 regularization on TFIDF W2V

Conclusions

4. SVM

Applying Linear SVM on BOW + L1 Regularization

Applying Linear SVM on BOW + L2 Regularization

Applying Linear SVM on TFIDF + L1 Regularization

Applying Linear SVM on TFIDF + L2 Regularization

Applying Linear SVM on AVG W2V + L1 Regularization

Applying Linear SVM on AVG W2V + L2 Regularization

Applying Linear SVM on TFIDF W2V + L1 Regularization

Applying Linear SVM on TFIDF W2V + L2 Regularization

Conclusions

5. Decision Trees

Applying Decision Trees on BOW

Applying Decision Trees on TFIDF

Applying Decision Trees on AVG W2V

Applying Decision Trees on TFIDF W2V

Conclusions:

6. Random Forest

Applying Random Forests on BOW

Applying Random Forests on TFIDF

Applying Random Forests on AVG W2V

Applying Random Forests on TFIDF W2V

Conclusions:

Result:

Logistic Regression performed well as compare to other models.

Model Deployment

Often people ignore this step ,but it is one of the important step in data science life cycle. I am going to deploy the model using Flask and AWS EC2 instance.

Prerequisites :

Knowledge about AWS EC2 instance
Python Flask
HTML, CSS, JavaScript
Linux commands like ssh, scp

On AWS side one first needs to create free tier ec2 instance of “ubuntu server” and launch it. Make sure you have inbound rule set as “All Traffic”, “All Ports” in security group. Your app needs to have low latency. Avoid reloading of saved weights in “final_prediction” function read all the files outside of the function else it will cause undesirable behaviour.

After successful creation of EC2 instance to connect with the instance use following command.

ssh -i “deployment.pem” ubuntu@ec2–3–21–156–107.us-east-2.compute.amazonaws.com

Copy the deployment_new folder on instance using following command :

scp -i “deployment.pem” -r deployment_new ubuntu@ec2–3–21–156–107.us-east-2.compute.amazonaws.com:~/deployment_new

This EC2 instance might not have all libraries required to run the app.py file install all the required libraries using pip3 install <library name> command finally to run the app use below command,

nohup python3 app.py &

Here is the link of my app you can try using it :http://ec2-18-222-1-146.us-east-2.compute.amazonaws.com:8080/index

Conclusion

After tuning lots of model we were able to achieve 0.82 as highest AUC. I hope this case study will help you understand the sentiment analysis in better way.

Future Work

In this blog I focused more on classical machine learning algorithms. The Deep Learning algos like Transformers and BERT can also be implemented.

Source Code

https://github.com/INZA111/Sentiment-Analysis-on-Movie-Reviews