How to detect fake news with machine learning

Picture by Wesley Tingey, Unsplash.

22 minutes to read

Long before Donald Trump erroneously applied it to mean “news that he didn’t agree with”, the term “fake news” referred to disinformation and misleading editorial content. In recent years, it’s become widespread, helping influence voters and spread conspiracy theories.

The growth of fake news has also harmed the reputations of social networks who allow it to be distributed without clearly identifying it as false, so most are now tackling it by trying to remove or flag stories, posts, or tweets, they identify as disinformation.

In this project, we’ll apply NLP and machine learning techniques to see how hard it is to identify fake news from real news. Is it really that hard for social networks to identify and flag disinformation with a high degree of accuracy? (Spoiler alert: No, it’s not!) You can even use the same approach to create a sarcasm detection model. Here’s how it’s done.

Load the packages

First, open a Jupyter notebook and import the packages below. We need a ton of packages for this project. We’ll be using the Natural Language Toolkit (NLTK) for NLP preprocessing the text, we’ll vectorize it using the Term-Frequency Inverse Document Frequency (TF-IDF) model from scikit-learn, and we’ll assess a range of different scikit-learn classifiers to find the best suited to the problem.

import time
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report

from sklearn.svm import LinearSVC
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import ExtraTreeClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB

Load the data

I’m using the Fake and real news dataset created by Clement Bisaillon. This is split into two files, one containing fake news and one containing real or true news, so we need to load them separately, label them, and then merge them into a single file.

df_fake = pd.read_csv('Fake.csv')
df_true = pd.read_csv('True.csv')

df_fake['label'] = 1
df_true['label'] = 0

df = pd.concat([df_fake, df_true], axis=0)
df.head()

	title	text	subject	date	label
0	Donald Trump Sends Out Embarrassing New Year’...	Donald Trump just couldn t wish all Americans ...	News	December 31, 2017	1
1	Drunk Bragging Trump Staffer Started Russian ...	House Intelligence Committee Chairman Devin Nu...	News	December 31, 2017	1
2	Sheriff David Clarke Becomes An Internet Joke...	On Friday, it was revealed that former Milwauk...	News	December 30, 2017	1
3	Trump Is So Obsessed He Even Has Obama’s Name...	On Christmas day, Donald Trump announced that ...	News	December 29, 2017	1
4	Pope Francis Just Called Out Donald Trump Dur...	Pope Francis used his annual Christmas Day mes...	News	December 25, 2017	1

If you run df.label.value_counts() you’ll notice that we have 23,481 items in the positive class (i.e. fake news) and 21,417 in the negative class, so things are fairly balanced, but not perfectly so.

df['label'].value_counts()

1    23481
0    21417
Name: label, dtype: int64

After getting exceptionally high levels of accuracy, I removed the subject column to double-check I wasn’t inadvertently introducing any information that could cause data leakage. However, the scores remained very high, suggesting its removal didn’t make much difference.

df['all_text'] = df['title'] + df['text']

Apply tokenization

To preprocess the text, the first step we need to take is called “tokenization”. As the name suggests, this converts each string of words into “tokens” representing individual elements, such as words. The NLTK package’s word_tokenize() function can do this for us.

nltk.download('punkt');

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

def tokenize(column):
    """Tokenizes a Pandas dataframe column and returns a list of tokens.
    
    Args:
        column: Pandas dataframe column (i.e. df['text']).
    
    Returns:
        tokens (list): Tokenized list, i.e. [Donald, Trump, tweets]
    
    """
    
    tokens = nltk.word_tokenize(column)
    return [w for w in tokens if w.isalpha()]    

After running the tokenize() function via a lambda function, we can write the list of values back to our dataframe to use in the next steps, which can’t run on a regular string.

df['tokenized'] = df.apply(lambda x: tokenize(x['all_text']), axis=1)
df[['title', 'tokenized']].head()

	title	tokenized
0	Donald Trump Sends Out Embarrassing New Year’...	[Donald, Trump, Sends, Out, Embarrassing, New,...
1	Drunk Bragging Trump Staffer Started Russian ...	[Drunk, Bragging, Trump, Staffer, Started, Rus...
2	Sheriff David Clarke Becomes An Internet Joke...	[Sheriff, David, Clarke, Becomes, An, Internet...
3	Trump Is So Obsessed He Even Has Obama’s Name...	[Trump, Is, So, Obsessed, He, Even, Has, Obama...
4	Pope Francis Just Called Out Donald Trump Dur...	[Pope, Francis, Just, Called, Out, Donald, Tru...

Create punctuation features

One thing I noticed when checking out the headlines was that certain punctuation features, such as exclamation marks and question marks, were common in the headline. As they have potential value as features, I held off from removing them and instead converted them to text-based features instead.

def punctuation_to_features(df, column):
    """Identify punctuation within a column and convert to a text representation.
    
    Args:
        df (object): Pandas dataframe.
        column (string): Name of column containing text. 
        
    Returns:
        df[column]: Original column with punctuation converted to text, 
                    i.e. "Wow! > "Wow exclamation"
    
    """
    
    df[column] = df[column].replace('!', ' exclamation ')
    df[column] = df[column].replace('?', ' question ')
    df[column] = df[column].replace('\'', ' quotation ')
    df[column] = df[column].replace('\"', ' quotation ')
    
    return df[column]

df['all_text'] = punctuation_to_features(df, 'all_text')

Remove stopwords

Lots of words that appear in typical sentences don’t contribute anything. These so-called “stopwords” can be removed by downloading a stopwords dataset from NLTK and then looping over the tokenized values in each row of the dataframe to leave only the non-stopwords.

nltk.download('stopwords');

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

def remove_stopwords(tokenized_column):
    """Return a list of tokens with English stopwords removed. 
    
    Args:
        column: Pandas dataframe column of tokenized data from tokenize()
    
    Returns:
        tokens (list): Tokenized list with stopwords removed.
    
    """
    stops = set(stopwords.words("english"))
    return [word for word in tokenized_column if not word in stops]

df['stopwords_removed'] = df.apply(lambda x: remove_stopwords(x['tokenized']), axis=1)
df[['title', 'stopwords_removed']].head()

	title	stopwords_removed
0	Donald Trump Sends Out Embarrassing New Year’...	[Donald, Trump, Sends, Out, Embarrassing, New,...
1	Drunk Bragging Trump Staffer Started Russian ...	[Drunk, Bragging, Trump, Staffer, Started, Rus...
2	Sheriff David Clarke Becomes An Internet Joke...	[Sheriff, David, Clarke, Becomes, An, Internet...
3	Trump Is So Obsessed He Even Has Obama’s Name...	[Trump, Is, So, Obsessed, He, Even, Has, Obama...
4	Pope Francis Just Called Out Donald Trump Dur...	[Pope, Francis, Just, Called, Out, Donald, Tru...

Apply stemming

Next, we’ll use another NLP technique called stemming. This reduces each word down to its “stem” or “root”, so “functioning”, “functioned”, and “functionally” all become “function”. This reduces the number of unique values for the model to deal with and generally improves results a little.

def apply_stemming(tokenized_column):
    """Return a list of tokens with Porter stemming applied.
    
    Args:
        column: Pandas dataframe column of tokenized data with stopwords removed.
    
    Returns:
        tokens (list): Tokenized list with words Porter stemmed.
    
    """
    
    stemmer = PorterStemmer() 
    return [stemmer.stem(word).lower() for word in tokenized_column]

df['porter_stemmed'] = df.apply(lambda x: apply_stemming(x['stopwords_removed']), axis=1)
df[['title', 'porter_stemmed']].head()

	title	porter_stemmed
0	Donald Trump Sends Out Embarrassing New Year’...	[donald, trump, send, out, embarrass, new, yea...
1	Drunk Bragging Trump Staffer Started Russian ...	[drunk, brag, trump, staffer, start, russian, ...
2	Sheriff David Clarke Becomes An Internet Joke...	[sheriff, david, clark, becom, an, internet, j...
3	Trump Is So Obsessed He Even Has Obama’s Name...	[trump, is, so, obsess, he, even, ha, obama, n...
4	Pope Francis Just Called Out Donald Trump Dur...	[pope, franci, just, call, out, donald, trump,...

Rejoin words

Now that the preprocessing is complete, we need to rejoin the list of tokenized and stemmed words back into a single string, so the data are ready for the next step. I’ve used the join() function for this, and have added a space before each word to separate them.

def rejoin_words(tokenized_column):
    return ( " ".join(tokenized_column))

The preprocessed text are then written back to the all_text column so they’re ready for vectorization - a process which converts the text to a numeric form that can be used within our model.

df['all_text'] = df.apply(lambda x: rejoin_words(x['porter_stemmed']), axis=1)
df[['title', 'all_text']].head()

	title	all_text
0	Donald Trump Sends Out Embarrassing New Year’...	donald trump send out embarrass new year eve m...
1	Drunk Bragging Trump Staffer Started Russian ...	drunk brag trump staffer start russian collus ...
2	Sheriff David Clarke Becomes An Internet Joke...	sheriff david clark becom an internet joke for...
3	Trump Is So Obsessed He Even Has Obama’s Name...	trump is so obsess he even ha obama name code ...
4	Pope Francis Just Called Out Donald Trump Dur...	pope franci just call out donald trump dure hi...

Create training and test data

To start the modeling process, we’ll assign the all_text column to our X feature set and the label column to our target variable y. We’ll then use the scikit-learn train_test_split() function to divide this into a training and test set, with 30% of the data being held back for testing.

X = df['all_text']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, shuffle=True)

Create a baseline model

To see how much work we have to do, we’ll first create a scikit-learn pipeline that takes our data and uses the TF-IDF vectorizer to turn it into a numeric form we can model. We’ll then pick a random classification model - I’ve used Linear Support Vector Classifier (LinearSVC) - and fit it to the training data, and generate some predictions, which I’ve stored in y_pred.

bundled_pipeline = Pipeline([("tfidf", TfidfVectorizer()), ("clf", LinearSVC())])
bundled_pipeline.fit(X_train, y_train)
y_pred = bundled_pipeline.predict(X_test)

Since this is now just a regular binary classification problem, we can assess its accuracy and performance using common classification metrics, such as the accuracy score, F1 score, ROC/AUC score, and classification report. Initial results are awesome. We get 99.48% accuracy with a ROC/AUC of 99.483, which is impressive.

print(classification_report(y_test, y_pred))
print('Accuracy:',accuracy_score(y_test, y_pred))
print('F1 score:',f1_score(y_test, y_pred))
print('ROC/AUC score:',roc_auc_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99      6417
           1       1.00      0.99      1.00      7053

    accuracy                           0.99     13470
   macro avg       0.99      0.99      0.99     13470
weighted avg       0.99      0.99      0.99     13470

Accuracy: 0.9948032665181886
F1 score: 0.9950333475237689
ROC/AUC score: 0.9948338125408192

Run model selection

With scores this high, we’re obviously not going to get any big further improvements during the model selection or tuning steps. To assess whether another un-tuned model was capable of beating the high score generated by the LinearSVC, I’ve created a dictionary of classifiers to test.

For each classifier, I’ve created a pipeline that runs the TF-IDF vectorizer and fits the model, then uses cross-validation to assess performance. The results, and timings, for each model are then written to a Pandas dataframe, so we can see which one works best.

classifiers = {}
classifiers.update({"DummyClassifier": DummyClassifier(strategy='most_frequent')})
classifiers.update({"LinearSVC": LinearSVC()})
classifiers.update({"MultinomialNB": MultinomialNB()})
classifiers.update({"XGBClassifier": XGBClassifier()})
classifiers.update({"LGBMClassifier": LGBMClassifier()})
classifiers.update({"RandomForestClassifier": RandomForestClassifier()})
classifiers.update({"DecisionTreeClassifier": DecisionTreeClassifier()})
classifiers.update({"ExtraTreeClassifier": ExtraTreeClassifier()})
classifiers.update({"AdaBoostClassifier": AdaBoostClassifier()})
classifiers.update({"KNeighborsClassifier": KNeighborsClassifier()})
classifiers.update({"RidgeClassifier": RidgeClassifier()})
classifiers.update({"SGDClassifier": SGDClassifier()})
classifiers.update({"BaggingClassifier": BaggingClassifier()})
classifiers.update({"BernoulliNB": BernoulliNB()})

df_models = pd.DataFrame(columns=['model', 'run_time', 'roc_auc', 'roc_auc_std'])

for key in classifiers:
    
    start_time = time.time()
    
    pipeline = Pipeline([("tfidf", TfidfVectorizer()), ("clf", classifiers[key] )])
    
    cv = cross_val_score(pipeline, X, y, cv=5, scoring='roc_auc')

    row = {'model': key,
           'run_time': format(round((time.time() - start_time)/60,2)),
           'roc_auc': cv.mean(),
           'roc_auc_std': cv.std(),
    }
    
    df_models = df_models.append(row, ignore_index=True)
    
df_models = df_models.sort_values(by='roc_auc', ascending=False)

After 15-20 minutes, the model selection process had completed. This identified that XGBoost was the top performing model, with a ROC/AUC of 99.97! It turns out that, on this dataset, identifying fake news from real news may not actually be so hard. We get 99.70% accuracy without tuning.

df_models.head(15)

	model	run_time	roc_auc	roc_auc_std
3	XGBClassifier	1.29	0.999738	0.000200
4	LGBMClassifier	1.81	0.999583	0.000326
1	LinearSVC	0.47	0.999399	0.000371
10	RidgeClassifier	0.64	0.999358	0.000360
5	RandomForestClassifier	3.28	0.999224	0.000405
8	AdaBoostClassifier	3.56	0.999195	0.000530
11	SGDClassifier	0.47	0.998795	0.000726
12	BaggingClassifier	10.73	0.997424	0.000873
6	DecisionTreeClassifier	2.5	0.991004	0.002819
13	BernoulliNB	0.46	0.980483	0.015148
2	MultinomialNB	0.45	0.975347	0.014388
9	KNeighborsClassifier	2.67	0.859701	0.079845
7	ExtraTreeClassifier	0.5	0.839716	0.055215
0	DummyClassifier	0.45	0.500000	0.000000

Matt Clarke, Friday, March 12, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.