How to detect sarcasm using machine learning

Can you tell when someone is taking the piss, when they haven't used a winking smiley? In this project we'll use machine learning to find the answer...

How to detect sarcasm using machine learning
Nice suit, mate. Picture by Andrea Piacquadio, Pexels.
16 minutes to read

I love sarcasm, but unfortunately I have a shaky ability to easily detect it in the voices of others, an aptitude for misinterpreting serious comments for sarcasm and then inappropriately laughing at them, and a gift for sounding sarcastic when, in fact, I am being totally serious.

While I find this infelicitous trait a rather amusing affliction, it can cause raised eyebrows in meetings, and means I would be ill-advised to compliment a woman on her appearance. This plight is perhaps nowhere better depicted than in the below interview with Dr Oliver Sacks.

Detecting sarcasm depends on tone of voice, body language, and facial expressions, as well the context of the utterance to the preceding words. For example, “What a fantastic musician!”, would be a compliment if I said it after a comment about Noel Gallagher, but would clearly be sarcastic if said after a comment on, say, Kanye West.

Detecting sarcasm is much harder in text, as there are no additional cues. This is why emails are often misinterpreted, and why we’ve adopted winking smileys to denote the presence of sarcasm. In this project, we’ll see if we can build sarcasm detection model using machine learning. Let’s get started. This could be the most fun you have all day!*

* Not sarcasm.

Load the packages

Open a Jupyter notebook and load up the packages below. We’ll keep things simple and use a single model - Multinomial Naive Bayes, which we’ll use with the Count Vectorizer from scikit-learn, and a range of tools for evaluating model performance.

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
pd.set_option('max_colwidth', 100)

Load the data

For this project I’ve used the News Headlines Dataset for Sarcasm Detection. As a pedant, I would point out that these are technically satirical and not sarcastic, but I’ll let this slide for the purposes of this demonstration. Load these into Pandas using read_json(). Use the Pandas rename function to change the headline column to text.

df =  pd.read_json('Sarcasm_Headlines_Dataset.json', lines=True)
df.rename(columns={'headline': 'text'}, inplace=True)

The data consist of real news headlines from The Huffington Post and satirical news headlines from The Onion. Obviously, this is a very simple dataset, so we can’t examine preceding text to better determine whether a phrase is sarcastic, but should still be able to pick up some sarcastic or satirical nuances.

article_link text is_sarcastic
0 former versace store clerk sues over secret 'black code' for minority shoppers 0
1 the 'roseanne' revival catches up to our thorny political mood, for better and worse 0
2 mom starting to fear son's web series closest thing she will have to grandchild 1
3 boehner just wants wife to listen, not come up with alternative debt-reduction ideas 1
4 j.k. rowling wishes snape happy birthday in the most magical way 0

The dataset includes 26,709 headlines, of which 11,724 are sarcastic and 14,985 are genuine - something we can calculate easily using the value_counts() function. The data are therefore slightly imbalanced. We’ll need to deal with this to prevent the model favouring the dominant class when making its predictions.

(26709, 3)
0    14985
1    11724
Name: is_sarcastic, dtype: int64

Examine the data

To see what we’re dealing with, let’s filter the dataframe by non-sarcastic and sarcastic comments to see how they differ. Looking at the apparently non-sarcastic/non-satirical headlines, it’s clear that this might be a bit challenging. Apparently, “j.k. rowling wishes snape happy birthday in the most magical way” was a real headline…

df_serious = df[df['is_sarcastic']==0].head(10)
0 former versace store clerk sues over secret 'black code' for minority shoppers
1 the 'roseanne' revival catches up to our thorny political mood, for better and worse
4 j.k. rowling wishes snape happy birthday in the most magical way
5 advancing the world's women
6 the fascinating case for eating lab-grown meat
7 this ceo will send your kids to school, if you work for his company
9 friday's morning email: inside trump's presser for the ages
10 airline passengers tackle man who rushes cockpit in bomb threat
11 facebook reportedly working on healthcare features and apps
12 north korea praises trump and urges us voters to reject 'dull hillary'

Similarly, with the sarcastic or satirical headlines, there are some in here which will surely be hard to spot. “ex-con back behind bar” could be a legit headline, for example. It looks like quotes, question marks, and exclamation points might be more common on the sarcastic/satircal posts, so these might be a useful feature.

df_sarcastic = df[df['is_sarcastic']==1].head(10)
2 mom starting to fear son's web series closest thing she will have to grandchild
3 boehner just wants wife to listen, not come up with alternative debt-reduction ideas
8 top snake handler leaves sinking huckabee campaign
15 nuclear bomb detonates during rehearsal for 'spider-man' musical
16 cosby lawyer asks why accusers didn't come forward to be smeared by legal team years ago
17 stock analysts confused, frightened by boar market
20 courtroom sketch artist has clear manga influences
21 trump assures nation that decision for syrian airstrikes came after carefully considering all hi...
27 ex-con back behind bar
28 after careful consideration, bush recommends oil drilling

Feature engineering

Next we’ll create some features. One trick I’ve seen used several times is to pick out any punctuation present and add a word representing this, so the count vectorization picks it up. This seems a really hacky approach to me, however, it does work quite well, even when you apply stemming or lemmatization.

df['text'] = df['text'].replace('!', ' exclamation ')
df['text'] = df['text'].replace('?', ' question ')
df['text'] = df['text'].replace('\'', ' quotation ')
df['text'] = df['text'].replace('\"', ' quotation ')

Create a Bag of Words

Next we’ll create a Bag of Words. We simply instantiate CountVectorizer() and use fit_transform() on the text column, and then use NumPy to convert the resulting Bag of Words (or BOW) to a dense array, so it can be used by the model.

Ordinarily, I would recommend using additional NLP preprocessing techniques, including tokenization, Porter Stemming, and Lemmatization. However, after trying them, I actually obtained the best results with simple count vectorization, so the model used is very simple.

count_vec = CountVectorizer()
bow = count_vec.fit_transform(df['text'])
bow = np.array(bow.todense())

Split the data

Next we’ll pass the bow Bag of Words to X to use as our features set, then define is_sarcastic as our target variable. We’ll use a stratified split to put equal proportions of the positive class in the training and test sets and will define the test size as 30% of the total data.

X = bow
y = df['is_sarcastic']
X_train, X_test, y_train, y_test = train_test_split(X, y, 

Fit the model

Again, there are various models you can use for this, but Multinomial Naive Bayes is typically one of the best on data of this type. We’ll fit the MultinomialNB() model to the training data and then generate some predictions on the unseen test data.

model = MultinomialNB().fit(X_train, y_train)
y_pred = model.predict(X_test)

Evaluate model performance

Finally, we can assess the performance of the model by examining how well it did on the test data. For this I’ve used accuracy score, F1 score, and the ROC/AUC score. Results are pretty good for a simple model. Turns out we can detect sarcasm (or satire), despite no previous context.

print('Accuracy:', accuracy_score(y_test, y_pred))
print('F1 score:', f1_score(y_test, y_pred, average="macro"))
print('ROC AUC:', roc_auc_score(y_test, y_pred))
Accuracy: 0.8478722076625483
F1 score: 0.8444177618637974
ROC AUC: 0.8422391318425907

We get 84.78% accuracy with a ROC/AUC of 0.84, which is pretty decent for a first attempt. Given that we skipped out some of the usual steps for brevity (i.e. model selection and tuning), we should easily be able to improve on this score with a bit more work.

print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.85      0.89      0.87      4496
           1       0.85      0.80      0.82      3517

    accuracy                           0.85      8013
   macro avg       0.85      0.84      0.84      8013
weighted avg       0.85      0.85      0.85      8013

Where did it go wrong?

To see where the model went wrong we can create a new dataframe of results and add the y_pred prediction and the y_test actual, then merge them with the original dataframe. This allows us to see each set of data, its predicted class, and its actual class.

results = pd.DataFrame(data={'predicted': y_pred, 'actual': y_test})
predictions = results.join(df)

To allow us to filter the data according to whether the model predicted the class correctly or not, we can creata a lambda function, apply() it on the column and create a summary dataframe containing the results.

def is_correct(predicted, actual):
    if predicted == actual:
        return True
        return False

predictions['correct'] = predictions.apply(lambda x: is_correct(x.predicted, x.actual), axis=1)
predictions = predictions[['text','predicted','actual','correct']]

Examining a random sample of errors shows perhaps why the model failed. Some of these might take a bit of careful thought from a human to ensure they were correctly classified. The same approach we’ve applied here can, of course, be used for almost any kind of text classification task, including fake news detection models.

text predicted actual correct
22795 troy aikman: i 'knock on wood' hoping i stay healthy after concussions 1 0 False
17199 north korea successfully detonates nuclear scientist 0 1 False
13790 overpopulation of the earth: will it create valuable new markets? 0 1 False
11416 chuck todd imitates yoda -- and it's actually pretty good 1 0 False
1211 police seek suspect in series of random later hostings 0 1 False
4596 national board of steve jaskoviak requests $10 billion bailout 0 1 False
24638 constructionist supreme court to revisit women's suffrage 0 1 False
26414 andrew w.k. submits the necessary paperwork to form 'the party party' 1 0 False
22180 taliban leaders already know which westernized schools the first to go as soon as u.s. troops le... 0 1 False
26523 hillary's last name dropped from senate race 0 1 False

Matt Clarke, Friday, March 12, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.