How to preprocess text for NLP in four easy steps

Learn how to apply tokenization, stopword removal, Porter stemming, and re-joining to preprocess your NLP model inputs prior to vectorization.

How to preprocess text for NLP in four easy steps
Picture by Jason Leung, Unsplash.
13 minutes to read

There’s often a lot of repetition in many data science projects. In tasks that utilise Natural Language Processing (or NLP), for example, you’ll always need to preprocess your text to remove misleading junk and noise in order to get the best results from your model.

In this project, we’ll go over the four simple steps you need to follow when working on an NLP project so that you can speed up the time it takes to preprocess your data and ensure you maximise the performance of your model. Let’s get started.

Load the packages

First, open a Jupyter notebook and import Pandas, the Natural Language Toolkit (NLTK), the stopwords module and the PorterStemmer class. There are other ways to preprocess, but these usually work well, so are plenty to start off with.

import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer 

Load the data

Next, load up your text data into a Pandas dataframe. You can use any data you like. I’m using a real vs. fake news dataset, as there’s loads of noise in it to clean up.

df_fake = pd.read_csv('Fake.csv')
df_true = pd.read_csv('True.csv')

df_fake['label'] = 1
df_true['label'] = 0

df = pd.concat([df_fake, df_true], axis=0)
df = df.head(100)
df.head()
title text subject date label
0 Donald Trump Sends Out Embarrassing New Year’... Donald Trump just couldn t wish all Americans ... News December 31, 2017 1
1 Drunk Bragging Trump Staffer Started Russian ... House Intelligence Committee Chairman Devin Nu... News December 31, 2017 1
2 Sheriff David Clarke Becomes An Internet Joke... On Friday, it was revealed that former Milwauk... News December 30, 2017 1
3 Trump Is So Obsessed He Even Has Obama’s Name... On Christmas day, Donald Trump announced that ... News December 29, 2017 1
4 Pope Francis Just Called Out Donald Trump Dur... Pope Francis used his annual Christmas Day mes... News December 25, 2017 1

Once the data have been loaded, the next common step most data scientists follow is to merge the individual text columns together into a single column. Sometimes this can throw away potentially useful features, but it’s usually fine on most projects.

df['all_text'] = df['title'] + df['subject'] + df['text']

Step 1: Tokenization

Our first step is tokenization. This important task takes your long string of text and converts each word into a “token” or value and places them within a list. The list values are much easier to manipulate by later steps. We’ll create a reusable function to handle this for us.

nltk.download('punkt');
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
def tokenize(column):
    """Tokenizes a Pandas dataframe column and returns a list of tokens.

    Args:
        column: Pandas dataframe column (i.e. df['text']).

    Returns:
        tokens (list): Tokenized list, i.e. [Donald, Trump, tweets]

    """

    tokens = nltk.word_tokenize(column)
    return [w for w in tokens if w.isalpha()]    

We can now use the Pandas apply() function with lambda to pass in the column containing our string of text, which is called all_text. The tokenize() function we created returns a list, which we’ll assign to a new column called tokenized.

df['tokenized'] = df.apply(lambda x: tokenize(x['all_text']), axis=1)
df[['title', 'tokenized']].head()
title tokenized
0 Donald Trump Sends Out Embarrassing New Year’... [Donald, Trump, Sends, Out, Embarrassing, New,...
1 Drunk Bragging Trump Staffer Started Russian ... [Drunk, Bragging, Trump, Staffer, Started, Rus...
2 Sheriff David Clarke Becomes An Internet Joke... [Sheriff, David, Clarke, Becomes, An, Internet...
3 Trump Is So Obsessed He Even Has Obama’s Name... [Trump, Is, So, Obsessed, He, Even, Has, Obama...
4 Pope Francis Just Called Out Donald Trump Dur... [Pope, Francis, Just, Called, Out, Donald, Tru...

Step 2: Stopword removal

In the next step, we’re going to reduce the noise in our data by removing “stopwords”. These are special language-specific words that appear within a sentence that add little value to the meaning. Removing them helps the model see the words that matter. First, download the stopwords using the below command.

nltk.download('stopwords');
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Next, we’ll make another function that we can re-use across our projects. This takes our tokenized column containing a list of words and then returns a list of words that are not in the set() of stopwords we loaded.

def remove_stopwords(tokenized_column):
    """Return a list of tokens with English stopwords removed. 

    Args:
        column: Pandas dataframe column of tokenized data from tokenize()

    Returns:
        tokens (list): Tokenized list with stopwords removed.

    """
    stops = set(stopwords.words("english"))
    return [word for word in tokenized_column if not word in stops]

We can then use the same apply() and lambda approach to run the remove_stopwords() function on each row of the dataframe, writing the new list of words to a column called stopwords_removed.

df['stopwords_removed'] = df.apply(lambda x: remove_stopwords(x['tokenized']), axis=1)
df[['title', 'stopwords_removed']].head()
title stopwords_removed
0 Donald Trump Sends Out Embarrassing New Year’... [Donald, Trump, Sends, Out, Embarrassing, New,...
1 Drunk Bragging Trump Staffer Started Russian ... [Drunk, Bragging, Trump, Staffer, Started, Rus...
2 Sheriff David Clarke Becomes An Internet Joke... [Sheriff, David, Clarke, Becomes, An, Internet...
3 Trump Is So Obsessed He Even Has Obama’s Name... [Trump, Is, So, Obsessed, He, Even, Has, Obama...
4 Pope Francis Just Called Out Donald Trump Dur... [Pope, Francis, Just, Called, Out, Donald, Tru...

Step 3: Stemming

The other useful text preprocessing technique we can apply is called “stemming”. This basically reduces each word down to its root, so “processing”, “processor”, and “process”, all become “process”. This is another way to remove noise and reduce the number of unique words in the text to help the model.

def apply_stemming(tokenized_column):
    """Return a list of tokens with Porter stemming applied.

    Args:
        column: Pandas dataframe column of tokenized data with stopwords removed.

    Returns:
        tokens (list): Tokenized list with words Porter stemmed.

    """

    stemmer = PorterStemmer() 
    return [stemmer.stem(word) for word in tokenized_column]

There are various stemming algorithms available, but Porter stemming is probably the most widely used and generally works very well. We can run our apply_stemming() function in the same way as the other two functions. We’ll pass in the stopwords_removed value from the previous function and generate a new column called porter_stemmed.

df['porter_stemmed'] = df.apply(lambda x: apply_stemming(x['stopwords_removed']), axis=1)
df[['title', 'porter_stemmed']].head()
title porter_stemmed
0 Donald Trump Sends Out Embarrassing New Year’... [donald, trump, send, out, embarrass, new, yea...
1 Drunk Bragging Trump Staffer Started Russian ... [drunk, brag, trump, staffer, start, russian, ...
2 Sheriff David Clarke Becomes An Internet Joke... [sheriff, david, clark, becom, An, internet, j...
3 Trump Is So Obsessed He Even Has Obama’s Name... [trump, Is, So, obsess, He, even, ha, obama, n...
4 Pope Francis Just Called Out Donald Trump Dur... [pope, franci, just, call, out, donald, trump,...

Step 4: Rejoin words

The final step is to take our last list of tokens and rejoin them back into a string so we can pass it to a vectorizer. This is dead easy. We just pass the tokenized_column (which will be porter_stemmed for us) and use join() to join the words into a string, placing a space between each word.

def rejoin_words(tokenized_column):
    """Rejoins a tokenized word list into a single string. 
    
    Args:
        tokenized_column (list): Tokenized column of words. 
        
    Returns:
        string: Single string of untokenized words. 
    """
    
    return ( " ".join(tokenized_column))
df['rejoined'] = df.apply(lambda x: rejoin_words(x['porter_stemmed']), axis=1)
df[['title', 'rejoined']].head()
title rejoined
0 Donald Trump Sends Out Embarrassing New Year’... donald trump send out embarrass new year eve m...
1 Drunk Bragging Trump Staffer Started Russian ... drunk brag trump staffer start russian collus ...
2 Sheriff David Clarke Becomes An Internet Joke... sheriff david clark becom An internet joke for...
3 Trump Is So Obsessed He Even Has Obama’s Name... trump Is So obsess He even ha obama name code ...
4 Pope Francis Just Called Out Donald Trump Dur... pope franci just call out donald trump dure hi...

And that’s it. Follow these four steps in order before you pass your text into an NLP model and you should generate better results.

Matt Clarke, Friday, March 12, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.