CountVectorizer is a scikit-learn package that uses count vectorization to convert a collection of text documents to a matrix of token counts. Given a corpus of text documents, such as web pages or product descriptions, CountVectorizer can return a matrix outlining the number of occurrences of each word or phrase to help you identify common text patterns in the documents.
In this example, we’ll use CountVectorizer to perform some basic n-gram analysis (or ngram analysis) on some product descriptions stored in a Pandas dataframe. N-grams (also called Q-grams or shingles) are single or multi- word phrases found within documents and they can reveal the underlying topic to help data scientists understand the topic, or be used within NLP models, such as text classification.
Open a Jupyter notebook and load the packages below. We will use the scikit-learn CountVectorizer package to create the matrix of token counts and Pandas to load and view the data.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
pd.set_option('max_columns', 10)
pd.set_option('max_rows', 10)
Next, we’ll load a simple dataset containing some text data. I’ve used a small ecommerce dataset consisting of some product descriptions of sports nutrition products. You can load the same data by importing the CSV data from my GitHub repository.
df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/gonutrition.csv')
df.head()
product_name | product_description | |
---|---|---|
0 | Whey Protein Isolate 90 | What is Whey Protein Isolate? Whey Protein Iso... |
1 | Whey Protein 80 | What is Whey Protein 80? Whey Protein 80 is an... |
2 | Volt Preworkout™ | What is Volt™? Our Volt pre workout formula in... |
To understand a little about how CountVectorizer works, we’ll fit the model to a column of our data. CountVectorizer will tokenize the data and split it into chunks called n-grams, of which we can define the length by passing a tuple to the ngram_range
argument. For example, 1,1
would give us unigrams or 1-grams such as “whey” and “protein”, while 2,2
would give us bigrams or 2-grams, such as “whey protein”.
To use CountVectorizer we’ll first instantiate the class and pass in our arguments, then we’ll call fit_transform()
which first runs the fit()
function on the data and then the transform()
function. This creates a vocabulary of n-grams from the documents and encodes a vector.
We’ll then use toarray()
to convert the vector to an array and then feed the data into Pandas. If you print the output of the dataframe you’ll see that each row represents one of the n-grams found in the vocabulary and each column represents one of the “documents” or product descriptions from the original dataset. If you print the shape
of the dataframe you’ll see that it finds 754 items in the vocabulary of unigrams.
text = df['product_description']
model = CountVectorizer(ngram_range = (1, 1))
matrix = model.fit_transform(text).toarray()
df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names())
df_output.T.tail(5)
0 | 1 | 2 | |
---|---|---|---|
would | 0 | 0 | 1 |
you | 10 | 7 | 7 |
your | 8 | 6 | 7 |
zinc | 0 | 0 | 1 |
zma | 0 | 0 | 1 |
df_output.shape
(3, 754)
One thing you’ll notice from the data above is that some of the words detected in the vocabulary of unique n-grams is that some of the words have little value, such as “would”, “you”, or “your”. These are so-called “stop words” and can safely be removed from the data.
Stop words generally don’t contribute very much and can massively bloat the size of your dataset, which increases model training times and causes various other issues. As a result, it’s a common practice to remove stop words.
If you want to, you can pass in a specific list of words you want to remove directly to the stop_words
argument, but the easiest way is to do this automatically by passing the argument stop_words='english'
.
If you re-run the code, you’ll see that the output now excludes these less insightful words from the vocabulary. Printing the shape
of the output dataframe reveals that the number of unigrams in the dictionary has now dropped from 754 to 630.
text = df['product_description']
model = CountVectorizer(ngram_range = (1, 1), stop_words='english')
matrix = model.fit_transform(text).toarray()
df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names())
df_output.T.tail(5)
0 | 1 | 2 | |
---|---|---|---|
working | 0 | 0 | 3 |
workout | 2 | 4 | 18 |
world | 0 | 2 | 0 |
zinc | 0 | 0 | 1 |
zma | 0 | 0 | 1 |
df_output.shape
(3, 630)
The other thing you’ll want to do is adjust the ngram_range
argument. In the simple example above, we set the CountVectorizer to 1, 1
to return unigrams or single words. Increasing the ngram_range
will mean the vocabulary is expanded from single words to short phrases of your desired lengths. For example, setting the ngram_range
to 2, 2
will return bigrams (2-grams) or two word phrases. Printing the shape
reveals that the vocabulary size has now increased from 630 to 1200, even when stop words are removed.
text = df['product_description']
model = CountVectorizer(ngram_range = (2, 2), stop_words='english')
matrix = model.fit_transform(text).toarray()
df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names())
df_output.T.tail(5)
0 | 1 | 2 | |
---|---|---|---|
workout volt | 0 | 0 | 1 |
world milk | 0 | 1 | 0 |
world renowned | 0 | 1 | 0 |
zinc magnesium | 0 | 0 | 1 |
zma zinc | 0 | 0 | 1 |
df_output.shape
(3, 1200)
You can also select n-grams of multiple sizes all at once by setting an unequal ngram_range
. For example, setting the range from 1, 5
will return n-grams containing one, two, three, four, and five words after stop words have been removed. As you’d imagine, this adds some potentially useful phrases to the vocabulary, as well as some nonsense, and increases the vocabulary size to 6091.
text = df['product_description']
model = CountVectorizer(ngram_range = (1, 5), stop_words='english')
matrix = model.fit_transform(text).toarray()
df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names())
df_output.T.tail(5)
0 | 1 | 2 | |
---|---|---|---|
zma | 0 | 0 | 1 |
zma zinc | 0 | 0 | 1 |
zma zinc magnesium | 0 | 0 | 1 |
zma zinc magnesium recovery | 0 | 0 | 1 |
zma zinc magnesium recovery formula | 0 | 0 | 1 |
df_output.shape
(3, 6091)
CountVectorizer includes a very useful optional argument called max_features
that can be used to control the size of the vocabulary created so it includes only the most commonly encounted terms based on their term frequency across documents within the corpus. To see how this works, let’s look at the default setting. With no max_features
value, we generate a vocabulary of 6091 items, as the whole corpus is used.
text = df['product_description']
model = CountVectorizer(ngram_range = (1, 5), stop_words='english')
matrix = model.fit_transform(text).toarray()
df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names())
df_output.T.tail(5)
0 | 1 | 2 | |
---|---|---|---|
zma | 0 | 0 | 1 |
zma zinc | 0 | 0 | 1 |
zma zinc magnesium | 0 | 0 | 1 |
zma zinc magnesium recovery | 0 | 0 | 1 |
zma zinc magnesium recovery formula | 0 | 0 | 1 |
df_output.shape
(3, 6091)
Setting max_features
to 100 limits the vocabulary size generated to the top 100 n-grams, as shown in the shape
of the output dataframe. Whether you apply this optional argument really depends on your aims. It can cut out a lot of junk data, but it will also remove potentially insightful rarer words that are unique to certain documents in the corpus.
text = df['product_description']
model = CountVectorizer(ngram_range = (1, 5), max_features = 100, stop_words='english')
matrix = model.fit_transform(text).toarray()
df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names())
df_output.T.tail(5)
0 | 1 | 2 | |
---|---|---|---|
whey protein concentrate | 4 | 3 | 0 |
whey protein isolate | 11 | 0 | 0 |
work | 0 | 2 | 3 |
workout | 2 | 4 | 18 |
workout formula | 0 | 0 | 6 |
df_output.shape
(3, 100)
Finally, we’ll create a reusable function to perform n-gram analysis on a Pandas dataframe column. This will use CountVectorizer to create a matrix of token counts found in our text. We’ll use the ngram_range
parameter to specify the size of n-grams we want to use, so 1, 1
would give us unigrams (one word n-grams) and 1-3
, would give us n-grams from one to three words.
We’ll use the stop_words
parameter to specify the stop words we want to remove. We’ll also pass in the optional max_features
value and will set this to a high number so it includes most of the words in the corpus by default.
def get_ngrams(text, ngram_from=2, ngram_to=2, n=None, max_features=20000):
vec = CountVectorizer(ngram_range = (ngram_from, ngram_to),
max_features = max_features,
stop_words='english').fit(text)
bag_of_words = vec.transform(text)
sum_words = bag_of_words.sum(axis = 0)
words_freq = [(word, sum_words[0, i]) for word, i in vec.vocabulary_.items()]
words_freq = sorted(words_freq, key = lambda x: x[1], reverse = True)
return words_freq[:n]
unigrams = get_ngrams(df['product_description'], ngram_from=1, ngram_to=1, n=15)
unigrams_df = pd.DataFrame(unigrams)
unigrams_df.columns=["Unigram", "Frequency"]
unigrams_df.head()
Unigram | Frequency | |
---|---|---|
0 | protein | 80 |
1 | whey | 53 |
2 | workout | 24 |
3 | volt | 24 |
4 | gn | 23 |
bigrams = get_ngrams(df['product_description'], ngram_from=2, ngram_to=2, n=15)
bigrams_df = pd.DataFrame(bigrams)
bigrams_df.columns=["Bigram", "Frequency"]
bigrams_df.head()
Bigram | Frequency | |
---|---|---|
0 | whey protein | 45 |
1 | pre workout | 14 |
2 | protein isolate | 11 |
3 | gn whey | 11 |
4 | protein 80 | 11 |
trigrams = get_ngrams(df['product_description'], ngram_from=3, ngram_to=3, n=15)
trigrams_df = pd.DataFrame(trigrams)
trigrams_df.columns=["Trigram", "Frequency"]
trigrams_df.head()
Trigram | Frequency | |
---|---|---|
0 | whey protein isolate | 11 |
1 | whey protein 80 | 11 |
2 | whey protein concentrate | 7 |
3 | gn whey protein | 6 |
4 | pre workout formula | 6 |
quadgrams = get_ngrams(df['product_description'], ngram_from=4, ngram_to=4, n=15)
quadgrams_df = pd.DataFrame(quadgrams)
quadgrams_df.columns=["Quadgram", "Frequency"]
quadgrams_df.head()
Quadgram | Frequency | |
---|---|---|
0 | gn whey protein 80 | 6 |
1 | gn whey isolate 90 | 4 |
2 | volt pre workout formula | 4 |
3 | free range grass fed | 3 |
4 | range grass fed cows | 3 |
quadgrams = get_ngrams(df['product_description'], ngram_from=5, ngram_to=5, n=15)
quadgrams_df = pd.DataFrame(quadgrams)
quadgrams_df.columns=["Quadgram", "Frequency"]
quadgrams_df.head()
Quadgram | Frequency | |
---|---|---|
0 | free range grass fed cows | 3 |
1 | whey protein concentrate whey protein | 3 |
2 | whey protein isolate whey protein | 2 |
3 | protein isolate whey protein isolate | 2 |
4 | isolate whey protein isolate 90 | 2 |
Matt Clarke, Friday, December 24, 2021