How to create a product matching model using XGBoost

Product matching algorithms find identical products on ecommerce sites so users can compare products and retailers can compare prices. Here’s how to build one.

How to create a product matching model using XGBoost
41 minutes to read

Product matching or data matching is a computational technique employing Natural Language Processing and machine learning which aims to identify identical products being sold on different websites, where product names might not always be a perfect match.

While product matching really has a single purpose - identifying products that are the same - it actually has a number of different applications in ecommerce:

  • Product comparison: Price comparison sites are one of the main places where product matching is used. Here the aim is to allow consumers to compare like-for-like matches of the same product across a range of websites, often from data that has been scraped.

  • Price comparison: Many retailers scrape prices from their competitors to check that they’re offering products at a competitive price. Product matching is an important step in this process and ensures that prices are being compared against identical products.

  • Multi-seller sites: On multi-seller sites and marketplaces such as Amazon, eBay, Walmart, and Wish, product matching algorithms are used to check that sellers don’t create duplicate products on the platform and cause items to be duplicated within search results.

  • Competitor analysis: Retailers also use product matching during competitor analysis to compare their product categories to their rivals’ and identify products they could add to their range, or when low competition allows them to increase their prices.

  • Product Knowledge Graphs: PKGs are a new concept in ecommerce and aim to identify relationships between products, such as complements, co-views, and substitutes, so the data can be used in product recommendations, marketing and advertising.

What makes product matching so difficult?

There are several reasons why product matching is difficult. Firstly, product content is remarkably inconsistent across retailers, and secondly, there’s no requirement for retailers to make it easy for their rivals to scrape their content, so unique product identifiers such as Global Trade Identification Numbers (GTINs) are often absent.

For example, take the “WH-1000XM3 Wireless Noise Cancelling Headphones” as Sony calls them. On every ecommerce site I checked, they have a different name, which the vendor has tweaked slightly to improve on-site “findability” and aid SEO.

The final issue is that product matching is computationally expensive. Creating a model that runs efficiently is more challenging than it is on more typical machine learning models.

Sony WH-1000XM3 Wireless Noise Cancelling Headphones
Currys SONY WH-1000XM3 Wireless Bluetooth Noise-Cancelling Headphones - Black
John Lewis Sony WH-1000XM3 Noise Cancelling Wireless Bluetooth NFC High Resolution Audio Over-Ear Headphones with Mic/Remote, Black
Amazon Sony WH-1000XM3 Noise Cancelling Wireless Headphones with Mic, 30 Hours Battery Life, Quick Charge, Gesture Control, Ambient Sound Mode, with Alexa Built-in – Black
Carters Sony WH1000XM3SCE7 Audio
Buywise Sony WH1000XM3BCE7 Headphone
ElectricShop Sony WH1000XM3SCE7 Over Ear Wireless Noise Cancelling Headphones Silver
Very WH-1000XM3 Wireless Noise-Cancelling Bluetooth Headphone with Built in Alexa
Richer Sounds Sony WH-1000XM3 (Black)
eBay WH-Gorsun 1000XM3 Bluetooth Headphone Active Noise Cancellation Earphone 55 Hour
eBay Sony WH-1000XM3 Wireless Noise Cancelling Headphones - Black

Product matching model

What are the common data issues?

There are many different issues that can be encountered when attempting data matching. Quite a few of them are evident in the tiny product sample above.

Naming inconsistency Absolutely none of the names above are a perfect match against the manufacturer's name, so whatever algorithm you use, it could never be sure of a perfect match.
Brand omission Very's product name doesn't include the word Sony, making it harder to detect the brand from the title alone.
Product condition One of the pairs of WH-1000XM3 headphones was found on eBay for £199.90, compared to around £300 for the other pairs. However, they're used and not new, so aren't a like-for-like match.
Formatting Sony's product name for the headphones is WH-1000XM3, but this is shown as WH1000XM3SCE7 and WH1000XM3BCE7 on Buywise, ElectricShop, and Carters. Richer Sounds uses the official name and also drops the hyphen in the product copy, to aid SEO.
"Fake" products The WH-Gorsun 1000XM3 looks visually similar to the Sony WH-1000XM3 and includes the "WH-" and "1000XM3" elements in its name, yet it costs a third of the price and isn't a Sony product.
Synonyms They don't really occur in this dataset, but matching algorithms need to understand that HP is the same as Hewlett Packard, and that GB is the same as gigabytes.

How was product matching done before?

Before machine learning, product matching was done via a process called “manual matching”. Basically, someone had the unfortunate task of going through each competitor and manually mapping each product on their sites to the ones sold on the vendor’s site.

Manual matching is still used (even on some commercial price comparison platforms for competitor analysis), but it’s considered crude, expensive, and laborious. However, it can be made far easier using even fairly simple computational approaches such as distance metrics, which can be used to present some best guesses to moderate.

There are, of course, reasons why manual matching may be the only option. For example, if you are the manufacturer of a product, you may wish to compare it to the closest alternative using your business knowledge, so product matching just wouldn’t work there.

What do product matching algorithms examine?

Obviously, the point of a product matching algorithm is to state whether a product in one retailer’s catalogue is the same as one in another. This is therefore a binary supervised classification problem - a product is either a “match” or “not a match”.

The classification is achieved using a wide range of features, which vary according to the complexity of the model. Features such as Levenshtein distance and a range of other similarity metrics are calculated and fed into the classifier, allowing it to go from a similarity score to a binary classification of “match” or “not a match”.

Product name similarity Various metrics, such as Levenshtein Distance, TF-IDF, Jaccard Similarity and Cosine Similarity are used to compare string similarities on product names or product titles in order to try and identify the closest matches.
Image similarity If a product on one retailer's website includes an image which is identical to that on another, or is visually similar, this is a good indicator that the product is the same.
Colours For clothing, and some other products, colours are often important in aiding product matching. These can either be extracted from the product text (and perhaps mapped to a colour dictionary which links royal blue to bright blue), or colour hex codes can be extracted from the image itself. Colour distances can then be calculated using the CIELAB Delta E (CIELAB ΔE*) metric.
Price outlier detection Detecting whether a product price is outside the normal range can be a useful indicator to the strength of the product match. For example, if the product is being sold for between £25.99 and £29.99 on most sites, and one is selling it for just £4.99, there's a strong likelihood that it's not the same product. Distance between prices is easily measured.
Number of variants Assuming retailers stock each variant within a range, or have the same approach to displaying them on a page (i.e. all variants on one page via a configurable product) then the number of variants can also be a potential match indicator. The numerical similarity of the number of variants can be used as a measure.
Product attributes One advanced feature of some product matching algorithms is the incorporation of Product Attribution Extraction or PAE models. These extract values such as "64GB" or "black" from the product data to help improve matching accuracy. Extracting these is often very complex.
Dictionary values Dictionary approaches also work well on attributes such as material. For example, Polyester, GoreTex, Acrylic, Olefin, Nylon and Modacrylic are all synthetic fabric materials, but one retailer may call the product "acrylic" and another "synthetic". These often get placed in clusters or families based on material similarity, i.e. `is_synthetic`.

Several researchers in the product matching field have utilised schema.org markup from ecommerce sites. Recently, Ralph Peeters and his team from the University of Mannheim, Germany, used solely schema.org markup and was able to achieve the current state-of-the-art F1 scores of 0.95 for product matching.

Load the packages

To create a product matching model we’ll need a range of packages, including the usual Pandas and Numpy for data manipulation. We’ll be using the Jellyfish and FuzzyWuzzy packages for calculating text similarities, such as Levenshtein distance, and a range of Scikit-Learn models and other machine learning packages.

import time
import re
import pandas as pd
import numpy as np
import jellyfish as jf
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import Perceptron
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score

Load the data

The dataset I’m using is a semi-synthetic product matching dataset I created from the PriceRunner product matching dataset found on Kaggle. I’ve explained how you can create a similar dataset in this article. The dataset comprises several fields, but we only need three columns - the external_name, the internal_name, and match.

The external_name represents the product name that each retailer has given to the product, while the internal_name represents the retailers product name to which it is mapped, and the match identifies whether the product names match or not.

df = pd.read_csv('product_matching_synthetic.csv')
df.head()
external_name internal_name category_label match
0 apple iphone 8 plus 64gb silver Apple iPhone 8 Plus 64GB Mobile Phones 1
1 apple iphone 8 plus 64 gb spacegrau Apple iPhone 8 Plus 64GB Mobile Phones 1
2 apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim... Apple iPhone 8 Plus 64GB Mobile Phones 1
3 apple iphone 8 plus 64gb space grey Apple iPhone 8 Plus 64GB Mobile Phones 1
4 apple iphone 8 plus gold 5.5 64gb 4g unlocked ... Apple iPhone 8 Plus 64GB Mobile Phones 1

Feature engineering

The feature engineering on this dataset is more simple than it would be ordinarily, because we’re only working with the product title, and do not have access to product attributes, prices, or images. I’ve used a range of different similarity metrics, such as Levenshtein distance, and various token ratios, which give a score corresponding to the level of similarity between the names.

The other feature added via the matching_numbers() function came from an idea I read about in article by Ertug Odabasi. This extracts numeric data from the product names, i.e. 64, and looks at their level of similarity between the two names. This proved to be a valuable feature.

def matching_numbers(external_name, internal_name):

    external_numbers = set(re.findall(r'[0-9]+', external_name))
    internal_numbers = set(re.findall(r'[0-9]+', internal_name))    
    union = external_numbers.union(internal_numbers)
    intersection = external_numbers.intersection(internal_numbers)

    if len(external_numbers)==0 and len(internal_numbers) == 0:
        return 1
    else:
        return (len(intersection)/ len(union))
def engineer_features(df):

    df['internal_name'] = df['internal_name'].str.lower()
    df['external_name'] = df['external_name'].str.lower()

    df['levenshtein_distance'] = df.apply(
    lambda x: jf.levenshtein_distance(x['external_name'], 
                                      x['internal_name']), axis=1)

    df['damerau_levenshtein_distance'] = df.apply(
    lambda x: jf.damerau_levenshtein_distance(x['external_name'], 
                                              x['internal_name']), axis=1)

    df['hamming_distance'] = df.apply(
    lambda x: jf.hamming_distance(x['external_name'], 
                                  x['internal_name']), axis=1)

    df['jaro_similarity'] = df.apply(
    lambda x: jf.jaro_similarity(x['external_name'], 
                                  x['internal_name']), axis=1)

    df['jaro_winkler_similarity'] = df.apply(
    lambda x: jf.jaro_winkler_similarity(x['external_name'], 
                                         x['internal_name']), axis=1)

    df['match_rating_comparison'] = df.apply(
    lambda x: jf.match_rating_comparison(x['external_name'], 
                                         x['internal_name']), axis=1).fillna(0).astype(int)

    df['ratio'] = df.apply(
    lambda x: fuzz.ratio(x['external_name'], 
                         x['internal_name']), axis=1)

    df['partial_ratio'] = df.apply(
    lambda x: fuzz.partial_ratio(x['external_name'], 
                                 x['internal_name']), axis=1)

    df['token_sort_ratio'] = df.apply(
    lambda x: fuzz.token_sort_ratio(x['external_name'], 
                                    x['internal_name']), axis=1)

    df['token_set_ratio'] = df.apply(
    lambda x: fuzz.token_set_ratio(x['external_name'], 
                                   x['internal_name']), axis=1)

    df['w_ratio'] = df.apply(
    lambda x: fuzz.WRatio(x['external_name'], 
                          x['internal_name']), axis=1)

    df['uq_ratio'] = df.apply(
    lambda x: fuzz.UQRatio(x['external_name'], 
                          x['internal_name']), axis=1)

    df['q_ratio'] = df.apply(
    lambda x: fuzz.QRatio(x['external_name'], 
                          x['internal_name']), axis=1)    

    df['matching_numbers'] = df.apply(
    lambda x: matching_numbers(x['external_name'], 
                               x['internal_name']), axis=1)

    df['matching_numbers_log'] = (df['matching_numbers']+1).apply(np.log)

    df['log_fuzz_score'] = (df['ratio'] + df['partial_ratio'] + 
                            df['token_sort_ratio'] + df['token_set_ratio']).apply(np.log)

    df['log_fuzz_score_numbers'] = df['log_fuzz_score'] + (df['matching_numbers']).apply(np.log)

    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    df.fillna(value=0, inplace=True)

    return df
df = engineer_features(df)

Examine correlations

Examinining the Pearson correlation coefficients between the features shows that we’ve got quite a few that are highly correlated with matches. The matching_numbers feature was the top performer, followed by various string similarity metrics. There’s a bit of collinearity in some of these so some of them do need to be removed through feature selection later.

df[df.columns[1:]].corr()['match'][:].sort_values(ascending=False)
match                           1.000000
matching_numbers_log            0.780429
matching_numbers                0.780419
token_set_ratio                 0.731331
log_fuzz_score_numbers          0.701562
partial_ratio                   0.669121
jaro_winkler_similarity         0.648080
jaro_similarity                 0.597436
log_fuzz_score                  0.597043
ratio                           0.579865
q_ratio                         0.576836
uq_ratio                        0.576830
token_sort_ratio                0.572287
match_rating_comparison         0.477758
w_ratio                         0.472424
hamming_distance               -0.143762
damerau_levenshtein_distance   -0.146023
levenshtein_distance           -0.146245
Name: match, dtype: float64

Create test and train data

Next, we’ll take the features above and add them to the X features set and assign the match column to our target y column. The test and training datasets can then be created in the usual manner using train_test_split().

X = df[['levenshtein_distance', 'damerau_levenshtein_distance', 'hamming_distance',
       'jaro_similarity','jaro_winkler_similarity','matching_numbers_log',
       'matching_numbers','token_set_ratio','token_sort_ratio','partial_ratio',
       'ratio','log_fuzz_score','log_fuzz_score_numbers','match_rating_comparison',
       'q_ratio','uq_ratio','w_ratio']].values
y = df['match'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)

Select the model

To identify which models are likely to be best suited to tackling the problem we’ll loop through a selection of them, fit the training data, and make predictions on the test data and then save the results to a dataframe.

def get_confusion_matrix_values(y_test, y_pred):
    cm = confusion_matrix(y_test, y_pred)
    return(cm[0][0], cm[0][1], cm[1][0], cm[1][1])

classifiers = {
    "DummyClassifier_stratified":DummyClassifier(strategy='stratified', random_state=0),    
    "KNeighborsClassifier":KNeighborsClassifier(3),
    "XGBClassifier":XGBClassifier(n_estimators=1000, learning_rate=0.1),
    "DecisionTreeClassifier":DecisionTreeClassifier(),
    "RandomForestClassifier":RandomForestClassifier(),
    "AdaBoostClassifier":AdaBoostClassifier(),
    "GradientBoostingClassifier":GradientBoostingClassifier(),
    "Perceptron": Perceptron(max_iter=40, eta0=0.1, random_state=1),
    "MLP": MLPClassifier(),
    "XGBClassifer tuned": XGBClassifier(colsample_bytree=0.8,
                      gamma=0.9,
                      max_depth=20,
                      min_child_weight=1,
                      scale_pos_weight=12,
                      subsample=0.9,
                      n_estimators=50, 
                      learning_rate=0.1)
}

df_results = pd.DataFrame(columns=['model', 'accuracy', 'mae', 'precision',
                                   'recall','f1','roc','run_time','tp','fp',
                                   'tn','fn'])

for key in classifiers:

    start_time = time.time()
    classifier = classifiers[key]
    model = classifier.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    mae = mean_absolute_error(y_test, y_pred)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, zero_division=0)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, zero_division=0)
    roc = roc_auc_score(y_test, y_pred)
    classification = classification_report(y_test, y_pred, zero_division=0)
    run_time = format(round((time.time() - start_time)/60,2))
    tp, fp, fn, tn = get_confusion_matrix_values(y_test, y_pred)

    row = {'model': key,
           'accuracy': accuracy,
           'mae': mae,
           'precision': precision,
           'recall': recall,
           'f1': f1,
           'roc': roc,
           'run_time': run_time,
           'tp': tp,
           'fp': fp,
           'tn': tn,
           'fn': fn,
          }
    df_results = df_results.append(row, ignore_index=True)

df_results.head(10)
model accuracy mae precision recall f1 roc run_time tp fp tn fn
0 DummyClassifier_stratified 0.832219 0.167781 0.094909 0.095907 0.095406 0.501478 0.0 95945 9832 1031 9719
1 KNeighborsClassifier 0.980073 0.019927 0.908967 0.871256 0.889712 0.931194 0.17 104839 938 9366 1384
2 XGBClassifier 0.986149 0.013851 0.948811 0.898326 0.922878 0.946700 0.72 105256 521 9657 1093
3 DecisionTreeClassifier 0.978211 0.021789 0.873329 0.893395 0.883248 0.940113 0.07 104384 1393 9604 1146
4 RandomForestClassifier 0.986716 0.013284 0.946699 0.907070 0.926461 0.950940 1.05 105228 549 9751 999
5 AdaBoostClassifier 0.980391 0.019609 0.933081 0.848279 0.888662 0.921048 0.24 105123 654 9119 1631
6 GradientBoostingClassifier 0.983257 0.016743 0.944708 0.869395 0.905489 0.932112 0.92 105230 547 9346 1404
7 Perceptron 0.973131 0.026869 0.971765 0.729953 0.833679 0.863899 0.02 105549 228 7847 2903
8 MLP 0.983429 0.016571 0.951654 0.864279 0.905865 0.929908 2.23 105305 472 9291 1459
9 XGBClassifer tuned 0.984879 0.015121 0.906991 0.931628 0.919145 0.960959 0.16 104750 1027 10015 735

Select and tune the best model

The XGBoost model generated the best overall result, so we’ll tune this to see how if we can increase the score. To do this we need to know the correct scale_pos_weight value, which can be calculated based on the ratio of the negative class over the positive class using the function I’ve written below.

def get_scale_pos_weight(target, square_root=False, gridsearch=False):
    """Return the scale_pos_weight parameter for the XGBoost model when data are imbalanced.
    The scale_pos_weight parameter is calculated from the ratio of the negative class over
    the positive class. The exact scale_pos_weight sometimes does not give the best result,
    so by passing the gridsearch=True parameter you can return a list of values to test with
    GridSearchCV. In addition, passing square_root=True changes the scale_pos_weight to the
    square root value, which can sometimes be beneficial on extremely imbalanced data.

    :param target: Pandas dataframe column containing the binary target
    :param square_root: Optional boolean parameter to convert to square root on extremely unbalanced data
    :param gridsearch: Optional boolean parameter to return a bracketed list for use in GridSearchCV

    Usage:
        scale_pos_weight = get_scale_pos_weight(df['target'], square_root=False, gridsearch=True)

    """

    import math

    scale_pos_weight = round((len(target) - sum(target)) / sum(target))

    if square_root:
        scale_pos_weight = round(math.sqrt(scale_pos_weight))

    if gridsearch:
        scale_pos_weight = [scale_pos_weight-2, scale_pos_weight-1, scale_pos_weight, 
                            scale_pos_weight+1, scale_pos_weight+2]

    return scale_pos_weight
scale_pos_weight = get_scale_pos_weight(df['match'], square_root=False, gridsearch=True)
scale_pos_weight
[8, 9, 10, 11, 12]

The scale_pos_weight values can be added to the param_grid lists and then GridSearchCV can be used to identify the optimum combination of model parameters to generate the best score.

n_estimators = [50]
learning_rate = [0.1]
max_depth = [5, 10, 20]
min_child_weight = [1, 2]
scale_pos_weight = [8, 9, 10, 11, 12]
gamma = [0.9, 1.0]
subsample = [0.9]
colsample_bytree = [0.8, 1.0]

start = time.perf_counter()

param_grid = dict(
                n_estimators=n_estimators,
                learning_rate=learning_rate,
                max_depth=max_depth,
                min_child_weight=min_child_weight,
                scale_pos_weight=scale_pos_weight,
                gamma=gamma,
                subsample=subsample,
                colsample_bytree=colsample_bytree,
)

model = XGBClassifier(random_state=0)

grid_search = GridSearchCV(estimator=model,
                           param_grid=param_grid,
                           scoring='roc_auc',
                           )

print('Running GridSearchCV...')
best_model = grid_search.fit(X_train, y_train)
best_score = round(best_model.score(X_test, y_test), 4)
best_params = best_model.best_params_

print('Score:', best_score)
print('Optimum parameters', best_params)

finish = time.perf_counter()
run_time = (finish - start / 60)
print(f"Completed task in {run_time:0.4f} minutes")

Fit selected model

Running GridSearchCV returned the below settings, so we’ll fit the XGBClassifier model using these on the training data and then make some predictions on the test dataset.

model = XGBClassifier(colsample_bytree=0.8,
                      gamma=0.9,
                      max_depth=20,
                      min_child_weight=1,
                      scale_pos_weight=12,
                      subsample=0.9,
                      n_estimators=50, 
                      learning_rate=0.1)
model = classifier.fit(X_train, y_train)
y_pred = model.predict(X_test)

Assess model performance

The classification report shows that the model scores pretty well (given that the current state of the art score is an F1 of 0.95), and we still have lots of other things to try.

print(classification_report(y_test, y_pred, labels=[1, 0], 
                            target_names=['match', 'not match']))
              precision    recall  f1-score   support

       match       0.91      0.93      0.92     10750
   not match       0.99      0.99      0.99    105777

    accuracy                           0.98    116527
   macro avg       0.95      0.96      0.96    116527
weighted avg       0.99      0.98      0.98    116527
results = pd.DataFrame(data={'predictions': y_pred, 'actual': y_test})
results['result'] = np.where(results['predictions']==results['actual'], 1, 0)
results.head(20)
predictions actual result
0 0 0 1
1 0 0 1
2 0 0 1
3 0 0 1
4 0 0 1
5 0 0 1
6 0 0 1
7 0 0 1
8 0 0 1
9 1 0 0
10 0 0 1
11 1 1 1
12 0 0 1
13 0 0 1
14 0 0 1
15 0 0 1
16 0 0 1
17 0 0 1
18 0 0 1
19 0 0 1

Generate new predictions

Generating new predictions is definitely the trickiest bit to get your head around. The model predicts whether a single product name is a match against another single product name. Therefore, testing every product in a retailer’s product catalogue for a potential match could entail comparing each product sold by one retailer against each product sold by another via the model.

Let’s say that both retailers have a modest 10,000 SKUs each. That would require 10,000 x 10,000, or 100 million predictions. If you were comparing prices against 10 retailers with similar inventories, this would quickly become an extremely time-consuming and expensive process and just wouldn’t scale. We therefore need a way of reducing the number of comparisons that need to be made by filtering out those with a lower probability of being a match before making a final check via the model.

Find the closest matches

To help reduce the number of products that need to be compared via the model, I used the Fuzzy Wuzzy token_set_ratio scorer take a single external product name and compare it to all the unique internal product names in the dataset and return a sorted list of token_set_ratio scores.

Each one of the closest matching internal_name values can be compared to the external_name using the model to see if there’s a match or not. This greatly reduces the volume of matches the model needs to test for each product in the dataset, with the function returning the top five.

def get_closest_matches(external_name):

    unique_internal_names = df['internal_name'].unique().tolist()
    closest_matches = process.extract(external_name, 
                  unique_internal_names, 
                  scorer=fuzz.token_set_ratio)

    return closest_matches

Preprocessing the data

Next we need to preprocess the data in the same way as we did during the training step. To do this I’ve assigned the same external name we’re trying to match to the external_name column and added the internal_name from the get_closest_matches() function to the internal_name.

def prepare_data(external_name):

    closest_matches = get_closest_matches(external_name)

    df = pd.DataFrame(columns=['external_name', 'internal_name'])

    for match in closest_matches:
        row = {'external_name': external_name, 'internal_name': match[0]}
        df = df.append(row, ignore_index=True)

    return df
closest_data = prepare_data("apple iphone x")
closest_data.head()
external_name internal_name
0 apple iphone x apple iphone x 64gb
1 apple iphone x apple iphone x 256gb
2 apple iphone x apple iphone 8 plus 64gb
3 apple iphone x apple iphone 7 plus 32gb
4 apple iphone x apple iphone 7 32gb

Now we can simply re-run the engineer_features() function we used during training to calculate the similarities between each of the product name pairs. That gives us data that’s formatted in the same way as our training data.

data = engineer_features(closest_data)
data = data[['levenshtein_distance', 'damerau_levenshtein_distance', 'hamming_distance',
       'jaro_similarity','jaro_winkler_similarity','matching_numbers_log',
       'matching_numbers','token_set_ratio','token_sort_ratio','partial_ratio',
       'ratio','log_fuzz_score','log_fuzz_score_numbers','match_rating_comparison',
       'q_ratio','uq_ratio','w_ratio']]
data.head()

Generate the predictions

Finally, we can generate new predictions by passing the dataframe through to the model’s predict_proba() function to identify whether any of the product name pairs were identified as matches by the model. Based on the external_name “apple iphone x” we get two matches, which are both correct.

y_pred = model.predict_proba(data)[:,1]
data = data.assign(prediction=y_pred)
data = data.merge(closest_data)
data[['external_name','internal_name','prediction']].head()
external_name internal_name prediction
0 apple iphone x apple iphone x 64gb 0.995701
1 apple iphone x apple iphone x 256gb 0.995701
2 apple iphone x apple iphone 8 plus 64gb 0.100689
3 apple iphone x apple iphone 7 plus 32gb 0.100689
4 apple iphone x apple iphone 8 plus 64gb 0.100689

Next steps

Clearly, this prototype proof-of-concept model only uses a small number of the potential features you’d use in the real world, which is going to limit the performance of the model. Ordinarily, you’d spend weeks or months building such a model, so this is obviously imperfect, but it performs quite well. There are quite a few other things we could try to improve the results:

  • Add more product attributes, such as GTIN and EAN codes, and measures of image similarity.
  • Add prices and measures of distance between the external and internal price.
  • Test different scaling or normalisation approaches.
  • Creating an ensemble model that uses alternative approaches such as TF-IDF and CountVectorizer.
  • Use an additional sub-model to predict the product category from the name and add as a feature.
  • Cross validate via an alternative scorer, such as the precision-recall curve.

Further reading

  • Ristoski, P., Petrovski, P., Mika, P. and Paulheim, H., 2018. A machine learning approach for product matching and categorization. Semantic web, 9(5), pp.707-728.

  • Petrovski, P., Primpeli, A., Meusel, R. and Bizer, C., 2016, September. The WDC gold standards for product feature extraction and product matching. In International Conference on Electronic Commerce and Web Technologies (pp. 73-86). Springer, Cham. http://webdatacommons.org/productcorpus/#toc7

  • Peeters, R., Primpeli, A., Wichtlhuber, B. and Bizer, C., 2020, June. Using schema.org annotations for training and maintaining product matchers. In Proceedings of the 10th International Conference on Web Intelligence, Mining and Semantics (pp. 195-204).

  • Xu, D., Ruan, C., Korpeoglu, E., Kumar, S. and Achan, K., 2020, January. Product knowledge graph embedding for ecommerce. In Proceedings of the 13th International Conference on Web Search and Data Mining (pp. 672-680).

Matt Clarke, Saturday, March 13, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.