Naive Bayes classifiers are commonly used for machine learning text classification problems, such as predicting the sentiment of a tweet, identifying the language of a piece of text, or categorising a support ticket. They’re a mainstay of Natural Language Processing or NLP.
There are actually quite a few different Naive Bayes classification algorithms, which all utilise a technique known as Bayes’ Theorem. The basic concept of this, and the “naive” part of the name, is that every pair of features is considered independent of each other and is equal.
Obviously, this assumption is often wrong in real-world situations, which is why it’s called “naive” Bayes. However, that doesn’t stop it being extremely effective. In this simple Python example, I’ll show you the code you need to create a basic text classification model using the Multinomial Naive Bayes algorithm via the
MultinomialNB module in scikit-learn.
import pandas as pd import numpy as np from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import train_test_split from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import accuracy_score from sklearn.metrics import classification_report from sklearn.metrics import f1_score
You can import any text dataset you like for this simple project. I’m using a Microsoft support ticket classification dataset. This contains the text from the support ticket a user sent to the help desk, plus some data on how it’s been categorised, it’s impact, urgency, the ticket type, and the category.
Since we have lots of training data, we could create a model that could examine the text of past tickets and predict the categorisation of future tickets. Support staff waste lots of time doing this manually, so this could save the support team a lot of time and save money for their business.
df = pd.read_csv('all_tickets.csv')
|text||licenses for wednesday march licenses dear found out interns require accounts please provide licenses kind regards engineer||retrieve library from recycle bin friday pm re retrieve library recycle bin hi asks deleted restore please ask restore site tuesday pm retrieve library recycle bin hello please advised queue kind regards analyst ext hub tuesday pm retrieve library recycle bin hi guys please assist retrieving library old were migrated thank leader||internal transfer request tuesday november pm re transfer form si mine tuesday november re transfer form document tuesday november transfer form hello va transfer form transfer date administration officer|
This dataset contains support tickets that have been categorised by support engineers using the
impact variables. We could create a model to predict any one of these values from the text data present. Let’s take a look at a few of the potential target variables we could predict.
1 34621 0 13928 Name: ticket_type, dtype: int64
4 34061 5 9634 6 2628 7 921 11 612 8 239 9 191 3 137 1 72 12 45 0 4 2 3 10 2 Name: category, dtype: int64
32 8174 36 3685 68 3589 67 2858 4 2527 ... 37 1 69 1 81 1 17 1 0 1 Name: business_service, Length: 103, dtype: int64
3 34621 1 6748 2 5528 0 1652 Name: urgency, dtype: int64
4 34621 3 13184 0 471 2 228 1 45 Name: impact, dtype: int64
Let’s go with the
ticket_type variable. If we create a model that can predict the
ticket_type class for each ticket, we can help staff to prioritise the tickets by their type. Using
value_counts() shows that we have two values numbered 0-1, making this a binary text classification model problem.
The other variables have more values and are multiclass text classification problems. They can be handled in the same way, using exactly the same model, however, it will be much harder to generate accurate predictions and the model is going to require a lot more tuning, so we’ll keep things simple instead.
Before we can do anything, we first need to preprocess our text to convert it to a numeric form. Machine learning models can’t use text, so the first step is to use a text preprocessing technique called Count Vectorization to turn the text into a vector of numbers via the
CountVectorizer module in scikit-learn.
CountVectorizer is widely used in Natural Language Processing (NLP), both in the model building process and for the analysis of text via n-grams (also called Q-grams or shingles).
CountVectorizer takes our text and returns a count for each time a word present in the entire “corpus” (that is, the whole dataset of tickets) appears in an individual ticket.
CountVectorizer, we’ll instantiate it, then use
fit_transform() to pass it the
text and return a Bag of Words model. We’ll then use Numpy to convert that Bag of Words to a dense array that can better be used by the model. Note that there are many other ways to preprocess text for machine learning models - this is one of many methods used.
count_vec = CountVectorizer() bow = count_vec.fit_transform(df['text']) bow = np.array(bow.todense())
Next, we’ll pass our Bag of Words data stored in
X and use it as the feature set for our model. We’ll assign the
impact column of our dataset as the target variable
y that we’re aiming to predict.
We’ll then use the
train_test_split() function to create our train and test data, allocating 30% for testing or validation and using stratification to ensure that the proportions are split equally across the datasets.
X = bow y = df['ticket_type']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)
Now everything is set up, we’ll fit a Multinomial Naive Bayes classification model using the
MultinomialNB module from scikit-learn. We’ll use the
fit() function to pass this our
y_train data to train the model to predict the
ticket_type from the vectors of the ticket text.
The Multinomial Naive Bayes model can be used for classification on any data with discrete features, such as our word counts from the count vectorization process. Usually integer counts, such as those from count vectorizer are needed to get good results, but the model can work OK with fractional counts from other text preprocessing techniques such as TF-IDF (Term Frequency Inverse Document Frequency).
model = MultinomialNB().fit(X_train, y_train)
Finally, in this very basic example, we can generate predictions from our trained model by passing the
X_test validation dataset to
predict() and returning the predictions in
There are various model evaluation metrics we can use. However, since this is a simple binary text classification problem, we’ll use
f1_score. These both give high scores of over 97%, so the base model performs quite well, without more advanced processes such as cross validation or hyperparameter tuning.
y_pred = model.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred)) print('F1 score:', f1_score(y_test, y_pred, average="macro"))
Accuracy: 0.9796086508753862 F1 score: 0.9754309418775351
classification_report() function is also worth running on the data. For each class, this effectively shows how good the model was at predicting each class. The closer to 1 the better the performance.
precision recall f1-score support 0 0.94 0.99 0.97 4178 1 1.00 0.98 0.99 10387 accuracy 0.98 14565 macro avg 0.97 0.98 0.98 14565 weighted avg 0.98 0.98 0.98 14565
There are, of course, many ways we could further improve our model. We could try different preprocessing approaches, obtain or engineer additional features, utilise cross validation, model selection and hyperparameter tuning among others.
The results from our simple base model show that a basic, un-tuned model is easily capable of making good predictions, but some additional effort will likely see even better results.
Matt Clarke, Sunday, May 08, 2022
Learn the fundamentals of gradient boosting and build state-of-the-art machine learning models using XGBoost to solve classification and regression problems.Start course for FREE