This project sought to find an accurate sentiment analysis model for Ben & Jerry's ice cream reviews using machine learning classification. I created a dynamic web based platform to accept new reviews and classify them in real time. This platform also allows for user feedback and a continuously improving model...
Python
css
html
javascript
Flask
This project sought to find an accurate sentiment analysis model for Ben & Jerry's ice cream reviews using machine learning classification. I created a dynamic web based platform to accept new reviews and classify them in real time. This platform also allows for user feedback and a continuously improving model. This resulted in a site where the user can enter a review, see their predicted sentiment, and verify the prediction.
I started by determining the dataset that I could use as a base for my project. I found a Kaggle dataset that contained Ben & Jerry's reviews. The link to this dataset is:
https://www.kaggle.com/datasets/tysonpo/ice-cream-dataset
This dataset contains information about the product, review author and date, the number of stars given out of five, the title, and the actual review. I only needed the review text and the number of stars for my project, so I removed the other columns in the dataset. I was left with simply the review and the number of stars given. I also only wanted to classify the reviews based on whether they had a positive or negative context, so I mapped the ratings between 1 and 3 stars to 0, or negative, and those from 4 to 5 stars to 1, or positive. I created a new file with this modified dataset.
Next, the data needed to be cleaned. I defined a function that cleaned the text by removing all of the unwanted characters and converting the text to lowercase. Next, I used Scikit Learn's Count Vectorizer to create a Bag of Words model based on the review text. This was used to downweight the frequently occurring words in the reviews. This makes words like 'the' or 'this' less important. Additionally, I tokenized the data such that words like 'talking' would be shortened to simply 'talk.' I removed the stop words and then created a training and test dataset.
In order to optimize the machine learning model, I used a grid search to find the best values for each parameter in the pipeline. This pipeline included the cleaning function that I created, the Term Frequency Inverse Document Frequency transformer I made to weight insignificant words less, and a stochastic gradient descent (SGD) classification algorithm. The grid search incorporated different options for each element in the pipeline. For the SGD Classifier, the alpha value, type of loss, and penalty type were varied. The use of stop words and the ngram range were changed for the Count Vectorizer. Finally, for the TFIDF transformer, the use of the 'idf' was varied. The grid search determined that the best parameters for each element were:
This optimized pipeline performed very well in general, with an accuracy of about 92%.
I used an out-of-core learning method to perform the complete process of cleaning, processing, streaming, vectorizing, and classifying the input data. This allowed me to create a Flask application based on a standalone Python file by pickeling the method.
I wanted the web app to have a very summery feel, reminiscent of ice cream. I decided to use purple, yellow, and blue as my main color scheme, as shown below.
I thought having a more bubbly and playful font for the titles would further the feeling of summer and ice cream. I chose a font called Stylish and modified it to have an outer stroke and to be filled with the main yellow color. For the main header font, I chose Josefin Sans for its playful essence, but readable nature. Finally, I chose a serif font called Corben for the body text.
The website informs users of how the model was built and encourages them to participate in writing a review. i created a popup form that allows users to write their review, see the estimated positivity, and correct the model if it produced an incorrect sentiment. This process allows the model to continuously improve and the dataset to get larger over time. The following image shows the design of the website. The design was made in Figma and was then used to create the web app using HTML and CSS. Flask was used to connect the Python based machine learning model to the website.