How to create a Naive Bayes product classification model

Learn how to use NLP techniques to create a Multinomial Naive Bayes sklearn product classification model to automatically assign products to the right category.

How to create a Naive Bayes product classification model
Picture by Daniel Romero, Unsplash.
10 minutes to read

Assigning products to the right categories is crucial to allowing customers to find what they’re looking for, so product classification models are commonly used by online marketplaces to ensure that products are assigned to the right product categories when listed by third parties.

Product classifiers are also really useful for competitor analysis projects, since they allow you to compare the products sold per category across retailers, through mapping them all to a single information architecture or category tree.

Product classification models typically take the name of the product, which differs across retailers, and predicts which product category it would be assigned to based on a set of labelled training data. In this project, we’ll use a Multinomial Naive Bayes model and apply Natural Language Processing (NLP) techniques to predict product categories from product names.

Load the packages

For this project we’ll need to load Pandas, Numpy and a range of Scikit-Learn packages, including CountVectorizer for turning our text data into a numeric form, plus the MultinomialNB model and some packages for assessing model performance.

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score

Load the data

The dataset I’m using is a PriceRunner dataset which is ideal for product classification problems. It includes 35,311 product names from various online retailers which map to 12,849 product names assigned to 10 different product categories.

The names the vendors have used for the products are all slightly different. Our aim is to predict which category_label a product will have from its product name.

df = pd.read_csv('pricerunner_aggregate.csv', 
                names=['product_id','product_title','vendor_id','cluster_id',
                       'cluster_label','category_id','category_label'])
df.head()
product_id product_title vendor_id cluster_id cluster_label category_id category_label
0 1 apple iphone 8 plus 64gb silver 1 1 Apple iPhone 8 Plus 64GB 2612 Mobile Phones
1 2 apple iphone 8 plus 64 gb spacegrau 2 1 Apple iPhone 8 Plus 64GB 2612 Mobile Phones
2 3 apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim... 3 1 Apple iPhone 8 Plus 64GB 2612 Mobile Phones
3 4 apple iphone 8 plus 64gb space grey 4 1 Apple iPhone 8 Plus 64GB 2612 Mobile Phones
4 5 apple iphone 8 plus gold 5.5 64gb 4g unlocked ... 5 1 Apple iPhone 8 Plus 64GB 2612 Mobile Phones

Printing the value_counts() of the category_label column we’re trying to predict shows that we have 10 different classes present, each of which has 2212 to 5501 product titles in the dataset. This is good because we have plenty of data from which to make our predictions.

df.category_label.value_counts().to_frame()
category_label
Fridge Freezers 5501
Mobile Phones 4081
Washing Machines 4044
CPUs 3862
Fridges 3584
TVs 3564
Dishwashers 3424
Digital Cameras 2697
Microwaves 2342
Freezers 2212

Preprocess the data

Before the model can classify text we need to transform it into a numeric form. There are various steps that can be used here, such as the removal of stopwords, lemmatization, Porter stemming, and the use of different algorithms, such as Term-Frequency Inverse Document Frequency (TF-IDF).

However, the basic CountVectorizer approach gives good results out of the box. This takes all of the words in the data and counts them, and then assigns a number to each one based on its prevalence in the dataset, creating a bag of words matrix required by the model. The final step is to convert it to a dense array so it can be used by the Naive Bayes object.

count_vec = CountVectorizer()
bow = count_vec.fit_transform(df['product_title'])
bow = np.array(bow.todense())

Create test and train data

To create the test and train data we’ll assign the bag of words to X as our feature set, and then pass the category_label class column to y. Passing this into the train_test_split() function will create our test and train datasets. We’re assigning 30% of the data to the test group and using the stratify argument to balance out the data.

X = bow
y = df['category_label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)

Fit the model

Now the data have been prepared, we’ll fit the multinomial naive bayes model. The Multinomial Naive Bayes model

There are quite a few hyperparameters that can be passed to this model, but we’ll just fit the default one for now.

model = MultinomialNB().fit(X_train, y_train)

Assess performance

Once the model has been fitted to the data, we can make our predictions on the X_test dataset and then calculate the accuracy score and F1 score. This gives us a decent score on the test data, with 94.95% accuracy and an F1 score of 0.945.

y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)
0.9495941098735133
f1_score(y_test, y_pred, average="macro")
0.9450395183071821

Examine the predictions

To check how well the model did in a bit more detail we can examine the precision, recall, and F1 score for each of the classes using a classification report.

print(classification_report(y_test, y_pred))
                  precision    recall  f1-score   support

            CPUs       1.00      1.00      1.00      1159
 Digital Cameras       0.99      0.99      0.99       809
     Dishwashers       0.95      0.98      0.96      1027
        Freezers       0.97      0.64      0.77       664
 Fridge Freezers       0.85      0.96      0.90      1651
         Fridges       0.91      0.89      0.90      1075
      Microwaves       0.98      0.98      0.98       703
   Mobile Phones       1.00      0.99      0.99      1224
             TVs       0.98      0.99      0.98      1069
Washing Machines       0.97      0.97      0.97      1213

        accuracy                           0.95     10594
       macro avg       0.96      0.94      0.95     10594
    weighted avg       0.95      0.95      0.95     10594

To examine the predictions themselves, we can join the y_pred predictions to the y_test data and put the results in a dataframe. By using a bit of Numpy, we can also add a binary flag of 1 or 0 to identify whether the prediction was correct or incorrect. I’ve sorted them with the errors at the top, so we can see where the model failed.

The results look pretty good for a first attempt. The model performs very well on most product types, particularly CPUs, but it gets a bit confused over some Freezers, Fridge Freezers, and Fridges, which is to be expected I guess, as there’s far more overlap in the words appearing in these similar product categories than there are in others.

Of course, this is just a very basic example to show the overall approach. Additional work on the data, model selection, cross validation, and tuning the model’s hyper-parameters should improve this further.

results = pd.DataFrame(data={'predicted': y_pred, 'actual': y_test})
results['result'] = np.where(results['predicted']==results['actual'], 1, 0)
results.sort_values(by='result').head(20)
predicted actual result
33025 Fridge Freezers Fridges 0
25846 Fridges Freezers 0
25118 Fridge Freezers Freezers 0
25048 Fridge Freezers Freezers 0
28979 Fridges Fridge Freezers 0
32306 Fridge Freezers Fridges 0
25550 Fridges Freezers 0
31539 Fridges Fridge Freezers 0
30107 Fridges Fridge Freezers 0
32466 Fridge Freezers Fridges 0
24509 Fridge Freezers Freezers 0
26096 Fridge Freezers Freezers 0
31563 Fridges Fridge Freezers 0
28696 Dishwashers Fridge Freezers 0
24830 Fridge Freezers Freezers 0
23761 Fridge Freezers Washing Machines 0
30515 Fridges Fridge Freezers 0
24213 Fridge Freezers Freezers 0
34920 Washing Machines Fridges 0
26148 Fridges Freezers 0

Matt Clarke, Saturday, March 13, 2021

Matt Clarke Matt is a Digital Director who uses data science to help in his work. He has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.

Extreme Gradient Boosting with XGBoost

Learn the fundamentals of gradient boosting and build state-of-the-art machine learning models using XGBoost to solve classification and regression problems.

Start course for FREE

Comments