Sentiment analysis, or opinion mining, is a form of emotion AI and uses natural language processing and computational linguistics to analyse text and infer the sentiment. Sentiment analysis has loads of practical uses.
If you use Grammarly, you may have noticed it predicting the tone of your emails when you reply, which is a great way to ensure the tone of your emails isn’t misinterpreted. I recently used sentiment analysis to examine product and service reviews from our a range of online retailers to identify the service and product issues that matter most to consumers. It’s very powerful.
In the tutorial below, we’ll be training a recurrent neural network model using Keras and TensorFlow. Keras is a Python package for deep learning that can be run on top of various other systems and it makes interfacing with Google’s TensorFlow deep learning platform a little easier. TensorFlow is really doing the hard work here and it’s an immensely powerful system, being widely used in businesses and research, as well as behind the scenes at Google.
Tensorflow can be used to run a range of models, but the one we’ll be using is Long Short-Term Memory or LSTM, which is a type of recurrent neural network or RNN.
Before you start, you’ll probably need to install Keras and TensorFlow, which you can do by entering the commands
pip3 install keras and
pip3 install tensorflow. This may take a short while as the TensorFlow libraries come in at around 300MB. If you have an Nvidia GPU, I’d recommend installing the
tensorflow-gpu package instead of the bog standard TensorFlow as it runs much, much faster.
Once they’re installed, load up the packages we’ll be using. We’ll be using the Long-Short Term Memory recurrent neural network (LSTM RNN) from
layers and the Dense and Embedding packages to handle the text analysis, plus the Sequential model and the sequence preprocessing package.
from keras.layers import LSTM, Dense, Embedding from keras.models import Sequential from keras.preprocessing import sequence from keras.datasets import imdb
Next, we’ll load up the data. Rather helpfully, Keras comes with some built-in datasets. We’re using the IMDB movie review sentiment classification dataset. This contains 25,000 movie reviews from the IMDB website which have been labeled with their sentiment (positive or negative). The model will examine the text in the training data and learn which characteristics define positive or negative sentiments.
The really handy thing about the IMDB data set provided in Keras is that the data have already been preprocessed. Before you give text to an RNN, you need to preprocess it to turn it into numeric data.
Therefore, rather than containing the actual text of the reviews, the data set contains special vectors that can be used by the neural network. Therefore, rather than the usual Pandas dataframe, the
load_data() function here returns a tuple of Numpy arrays. If we set the
num_words argument, we’ll limit the number of words examined to save time.
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=20000)
If you print the
X_train data, you’ll see that it just contains a load of lists of numbers. Each number represents a word ranked by its frequency, so number 1 is the most common word in the dataset and number 1622 is the 1622nd. Any unknown words are assigned a zero. The labels stored in
y_train are just 1s and 0s denoting whether the sentiment of the text was positive or negative.
As recurrent neural networks can take a long time to train, and this dataset is fairly large, we can use the Keras preprocessing sequence package’s
pad_sequences() function to modify the data and speed things up. The
pad_sequences() function essentially makes all of the sequences the same length by padding zeros at the beginning or end. The
maxlen argument is used to truncate any sequences that are over a particular length. We’ll limit our sequences to 100 characters to see if this improves the speed.
X_train = sequence.pad_sequences(X_train, maxlen=100) X_test = sequence.pad_sequences(X_test, maxlen=100)
Now we have our data, and it’s rather helpfully been preprocessed, we can move on to the creation of the neural network. The specific neural network we’re using to analyse review sentiment is a recurrent neural network called LSTM or Long Short-Term Memory. This model is really useful and can be used for a variety of things, including time series analysis, anomaly detection and speech and handwriting recognition.
First, we’ll load the Sequential model class and set up something called the embedding layer. This basically converts the Numpy arrays into “dense vectors” of a fixed size using padding, so it’s more convenient for the neural network to handle. The embedding layer has a vocabulary size of 20000 words (because that’s the
num_words argument we passed when we loaded up the data), and while the 128 value denotes a 128 unit output dimension.
We then add the LSTM model, set the dropout rates and finally use
Dense and the sigmoid function to determine the sentiment as either 1 or 0.
model = Sequential() model.add(Embedding(20000, 128)) model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2)) model.add(Dense(1, activation='sigmoid'))
The next step is to use
compile() to determine how we run the model. Just as in scikit-learn, Keras lets you
define the settings the model uses. There are three main things to configure with the
compile() function: the losses, the metrics and the optimizers.
Losses, or loss functions, tell the model what it should try and reduce during the training process. For regression problems that might be the mean squared error or mean absolute error, while for probabilistic problems, like this one, you’d use something like binary cross entropy, categorical cross entropy or the Poisson function.
binary_crossentropy. Kera supports several different optimizers, which are suited to different tasks. I’ve tried
nadam here and found it a little better than
adam. Finally, as with scikit-learn, you can also define the metric by which you are judging your model’s performance -
accuracy is fine for this task.
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
With everything now set up and ready to go, we can fit our model to the training data. The
batch_size argument tells the model how many samples to “propagate” through the neural network, the
epochs argument tells Keras how many how many training batches to run. It’s a bit like a fold in K folds cross validation.
verbose argument prints out the results as they happen, which is handy, because this is going to take a while to run. On my overclocked 4GHz Ryzen 3700X data science workstation with 64GB of RAM this takes just over 10 minutes to complete.
model.fit(X_train, y_train, batch_size=32, epochs=10, verbose=2, validation_data=(X_test, y_test))
Epoch 1/10 782/782 - 64s - loss: 0.4305 - accuracy: 0.7977 - val_loss: 0.3509 - val_accuracy: 0.8442 Epoch 2/10 782/782 - 63s - loss: 0.2433 - accuracy: 0.9036 - val_loss: 0.4023 - val_accuracy: 0.8279 Epoch 3/10 782/782 - 63s - loss: 0.1547 - accuracy: 0.9427 - val_loss: 0.4512 - val_accuracy: 0.8388 Epoch 4/10 782/782 - 63s - loss: 0.1077 - accuracy: 0.9623 - val_loss: 0.6039 - val_accuracy: 0.8344 Epoch 5/10 782/782 - 63s - loss: 0.0766 - accuracy: 0.9731 - val_loss: 0.6015 - val_accuracy: 0.8319 Epoch 6/10 782/782 - 63s - loss: 0.0499 - accuracy: 0.9835 - val_loss: 0.6356 - val_accuracy: 0.8315 Epoch 7/10 782/782 - 63s - loss: 0.0445 - accuracy: 0.9844 - val_loss: 0.7427 - val_accuracy: 0.8304 Epoch 8/10 782/782 - 63s - loss: 0.0319 - accuracy: 0.9895 - val_loss: 0.7467 - val_accuracy: 0.8119 Epoch 9/10 782/782 - 63s - loss: 0.0243 - accuracy: 0.9924 - val_loss: 0.8662 - val_accuracy: 0.8254 Epoch 10/10 782/782 - 63s - loss: 0.0176 - accuracy: 0.9943 - val_loss: 0.9792 - val_accuracy: 0.8232 <tensorflow.python.keras.callbacks.History at 0x7fe551a5f1f0>
Assuming your computer didn’t catch fire under the strain, you should now have seen the results of each epoch appear on your screen showing you the
val_loss and the
val_accuracy for each round.
Just as in K folds cross validation, it’s normal to get slightly different results in each epoch, so to assess the overall performance we can calculate the mean of these scores. Keras includes an
evaluate() function which makes this very straightforward. Simply pass it the
y_test data and it will return the results.
We get an overall accuracy of 0.8226 and a loss of 0.8526 which looks pretty good for a first attempt. You can try tweaking the
compile() settings to see if you can generate any further improvements.
score, accuracy = model.evaluate(X_test, y_test, batch_size=32, verbose=2)
782/782 - 7s - loss: 0.9792 - accuracy: 0.8232
To examine the predictions we can use
model.predict(). Here, we’ll get predictions for the first five rows in the
X_test data. The predictions are returned as probabilities (so this is effectively a bit like
predict_proba() in scikit-learn), so anything under 0.5 is negative in sentiment and anything above 0.5 is positive. As this is a preprocessed dataset, unfortunately, we don’t have the original source data to join back to this to examine how good the predictions are.
predictions = model.predict(X_test[:5]) predictions
array([[0.8680382 ], [0.9999167 ], [0.6412297 ], [0.07330499], [0.9999981 ]], dtype=float32)
In the next article, I’ll explain the steps you can follow to preprocess data to get it ready for your machine learning model.
Matt Clarke, Tuesday, March 02, 2021