How to use CountVectorizer for n-gram analysis

Learn how to use scikit-learn CountVectorizer for n-gram analysis and analyse unigrams, bigrams, and trigrams in text.

How to use CountVectorizer for n-gram analysis
Picture by Pixabay, Pexels.
20 minutes to read

CountVectorizer is a scikit-learn package that uses count vectorization to convert a collection of text documents to a matrix of token counts. Given a corpus of text documents, such as web pages or product descriptions, CountVectorizer can return a matrix outlining the number of occurrences of each word or phrase to help you identify common text patterns in the documents.

In this example, we’ll use CountVectorizer to perform some basic n-gram analysis (or ngram analysis) on some product descriptions stored in a Pandas dataframe. N-grams (also called Q-grams or shingles) are single or multi- word phrases found within documents and they can reveal the underlying topic to help data scientists understand the topic, or be used within NLP models, such as text classification.

Load the packages

Open a Jupyter notebook and load the packages below. We will use the scikit-learn CountVectorizer package to create the matrix of token counts and Pandas to load and view the data.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
pd.set_option('max_columns', 10)
pd.set_option('max_rows', 10)

Load the data

Next, we’ll load a simple dataset containing some text data. I’ve used a small ecommerce dataset consisting of some product descriptions of sports nutrition products. You can load the same data by importing the CSV data from my GitHub repository.

df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/gonutrition.csv')
df.head()
product_name product_description
0 Whey Protein Isolate 90 What is Whey Protein Isolate? Whey Protein Iso...
1 Whey Protein 80 What is Whey Protein 80? Whey Protein 80 is an...
2 Volt Preworkout™ What is Volt™? Our Volt pre workout formula in...

Fit the CountVectorizer

To understand a little about how CountVectorizer works, we’ll fit the model to a column of our data. CountVectorizer will tokenize the data and split it into chunks called n-grams, of which we can define the length by passing a tuple to the ngram_range argument. For example, 1,1 would give us unigrams or 1-grams such as “whey” and “protein”, while 2,2 would give us bigrams or 2-grams, such as “whey protein”.

To use CountVectorizer we’ll first instantiate the class and pass in our arguments, then we’ll call fit_transform() which first runs the fit() function on the data and then the transform() function. This creates a vocabulary of n-grams from the documents and encodes a vector.

We’ll then use toarray() to convert the vector to an array and then feed the data into Pandas. If you print the output of the dataframe you’ll see that each row represents one of the n-grams found in the vocabulary and each column represents one of the “documents” or product descriptions from the original dataset. If you print the shape of the dataframe you’ll see that it finds 754 items in the vocabulary of unigrams.

text = df['product_description']
model = CountVectorizer(ngram_range = (1, 1))
matrix = model.fit_transform(text).toarray()
df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names())
df_output.T.tail(5)
0 1 2
would 0 0 1
you 10 7 7
your 8 6 7
zinc 0 0 1
zma 0 0 1
df_output.shape
(3, 754)

Remove stop words

One thing you’ll notice from the data above is that some of the words detected in the vocabulary of unique n-grams is that some of the words have little value, such as “would”, “you”, or “your”. These are so-called “stop words” and can safely be removed from the data.

Stop words generally don’t contribute very much and can massively bloat the size of your dataset, which increases model training times and causes various other issues. As a result, it’s a common practice to remove stop words.

If you want to, you can pass in a specific list of words you want to remove directly to the stop_words argument, but the easiest way is to do this automatically by passing the argument stop_words='english'.

If you re-run the code, you’ll see that the output now excludes these less insightful words from the vocabulary. Printing the shape of the output dataframe reveals that the number of unigrams in the dictionary has now dropped from 754 to 630.

text = df['product_description']
model = CountVectorizer(ngram_range = (1, 1), stop_words='english')
matrix = model.fit_transform(text).toarray()
df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names())
df_output.T.tail(5)
0 1 2
working 0 0 3
workout 2 4 18
world 0 2 0
zinc 0 0 1
zma 0 0 1
df_output.shape
(3, 630)

Increase the n-gram range

The other thing you’ll want to do is adjust the ngram_range argument. In the simple example above, we set the CountVectorizer to 1, 1 to return unigrams or single words. Increasing the ngram_range will mean the vocabulary is expanded from single words to short phrases of your desired lengths. For example, setting the ngram_range to 2, 2 will return bigrams (2-grams) or two word phrases. Printing the shape reveals that the vocabulary size has now increased from 630 to 1200, even when stop words are removed.

text = df['product_description']
model = CountVectorizer(ngram_range = (2, 2), stop_words='english')
matrix = model.fit_transform(text).toarray()
df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names())
df_output.T.tail(5)
0 1 2
workout volt 0 0 1
world milk 0 1 0
world renowned 0 1 0
zinc magnesium 0 0 1
zma zinc 0 0 1
df_output.shape
(3, 1200)

You can also select n-grams of multiple sizes all at once by setting an unequal ngram_range. For example, setting the range from 1, 5 will return n-grams containing one, two, three, four, and five words after stop words have been removed. As you’d imagine, this adds some potentially useful phrases to the vocabulary, as well as some nonsense, and increases the vocabulary size to 6091.

text = df['product_description']
model = CountVectorizer(ngram_range = (1, 5), stop_words='english')
matrix = model.fit_transform(text).toarray()
df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names())
df_output.T.tail(5)
0 1 2
zma 0 0 1
zma zinc 0 0 1
zma zinc magnesium 0 0 1
zma zinc magnesium recovery 0 0 1
zma zinc magnesium recovery formula 0 0 1
df_output.shape
(3, 6091)

Setting max_features

CountVectorizer includes a very useful optional argument called max_features that can be used to control the size of the vocabulary created so it includes only the most commonly encounted terms based on their term frequency across documents within the corpus. To see how this works, let’s look at the default setting. With no max_features value, we generate a vocabulary of 6091 items, as the whole corpus is used.

text = df['product_description']
model = CountVectorizer(ngram_range = (1, 5), stop_words='english')
matrix = model.fit_transform(text).toarray()
df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names())
df_output.T.tail(5)
0 1 2
zma 0 0 1
zma zinc 0 0 1
zma zinc magnesium 0 0 1
zma zinc magnesium recovery 0 0 1
zma zinc magnesium recovery formula 0 0 1
df_output.shape
(3, 6091)

Setting max_features to 100 limits the vocabulary size generated to the top 100 n-grams, as shown in the shape of the output dataframe. Whether you apply this optional argument really depends on your aims. It can cut out a lot of junk data, but it will also remove potentially insightful rarer words that are unique to certain documents in the corpus.

text = df['product_description']
model = CountVectorizer(ngram_range = (1, 5), max_features = 100, stop_words='english')
matrix = model.fit_transform(text).toarray()
df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names())
df_output.T.tail(5)
0 1 2
whey protein concentrate 4 3 0
whey protein isolate 11 0 0
work 0 2 3
workout 2 4 18
workout formula 0 0 6
df_output.shape
(3, 100)

Create a function to get n-grams

Finally, we’ll create a reusable function to perform n-gram analysis on a Pandas dataframe column. This will use CountVectorizer to create a matrix of token counts found in our text. We’ll use the ngram_range parameter to specify the size of n-grams we want to use, so 1, 1 would give us unigrams (one word n-grams) and 1-3, would give us n-grams from one to three words.

We’ll use the stop_words parameter to specify the stop words we want to remove. We’ll also pass in the optional max_features value and will set this to a high number so it includes most of the words in the corpus by default.

def get_ngrams(text, ngram_from=2, ngram_to=2, n=None, max_features=20000):
    
    vec = CountVectorizer(ngram_range = (ngram_from, ngram_to), 
                          max_features = max_features, 
                          stop_words='english').fit(text)
    bag_of_words = vec.transform(text)
    sum_words = bag_of_words.sum(axis = 0) 
    words_freq = [(word, sum_words[0, i]) for word, i in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key = lambda x: x[1], reverse = True)
   
    return words_freq[:n]

Get unigrams or 1-grams

unigrams = get_ngrams(df['product_description'], ngram_from=1, ngram_to=1, n=15)
unigrams_df = pd.DataFrame(unigrams)
unigrams_df.columns=["Unigram", "Frequency"]
unigrams_df.head()
Unigram Frequency
0 protein 80
1 whey 53
2 workout 24
3 volt 24
4 gn 23

Get bigrams or 2-grams

bigrams = get_ngrams(df['product_description'], ngram_from=2, ngram_to=2, n=15)
bigrams_df = pd.DataFrame(bigrams)
bigrams_df.columns=["Bigram", "Frequency"]
bigrams_df.head()
Bigram Frequency
0 whey protein 45
1 pre workout 14
2 protein isolate 11
3 gn whey 11
4 protein 80 11

Get trigrams or 3-grams

trigrams = get_ngrams(df['product_description'], ngram_from=3, ngram_to=3, n=15)
trigrams_df = pd.DataFrame(trigrams)
trigrams_df.columns=["Trigram", "Frequency"]
trigrams_df.head()
Trigram Frequency
0 whey protein isolate 11
1 whey protein 80 11
2 whey protein concentrate 7
3 gn whey protein 6
4 pre workout formula 6

Get quadgrams or 4-grams

quadgrams = get_ngrams(df['product_description'], ngram_from=4, ngram_to=4, n=15)
quadgrams_df = pd.DataFrame(quadgrams)
quadgrams_df.columns=["Quadgram", "Frequency"]
quadgrams_df.head()
Quadgram Frequency
0 gn whey protein 80 6
1 gn whey isolate 90 4
2 volt pre workout formula 4
3 free range grass fed 3
4 range grass fed cows 3

Get 5-grams

quadgrams = get_ngrams(df['product_description'], ngram_from=5, ngram_to=5, n=15)
quadgrams_df = pd.DataFrame(quadgrams)
quadgrams_df.columns=["Quadgram", "Frequency"]
quadgrams_df.head()
Quadgram Frequency
0 free range grass fed cows 3
1 whey protein concentrate whey protein 3
2 whey protein isolate whey protein 2
3 protein isolate whey protein isolate 2
4 isolate whey protein isolate 90 2

Matt Clarke, Friday, December 24, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.

Extreme Gradient Boosting with XGBoost

Learn the fundamentals of gradient boosting and build state-of-the-art machine learning models using XGBoost to solve classification and regression problems.

Start course for FREE