Product matching or data matching is a computational technique employing Natural Language Processing and machine learning which aims to identify identical products being sold on different websites, where product names might not always be a perfect match.
While product matching really has a single purpose - identifying products that are the same - it actually has a number of different applications in ecommerce:
Product comparison: Price comparison sites are one of the main places where product matching is used. Here the aim is to allow consumers to compare like-for-like matches of the same product across a range of websites, often from data that has been scraped.
Price comparison: Many retailers scrape prices from their competitors to check that they’re offering products at a competitive price. Product matching is an important step in this process and ensures that prices are being compared against identical products.
Multi-seller sites: On multi-seller sites and marketplaces such as Amazon, eBay, Walmart, and Wish, product matching algorithms are used to check that sellers don’t create duplicate products on the platform and cause items to be duplicated within search results.
Competitor analysis: Retailers also use product matching during competitor analysis to compare their product categories to their rivals’ and identify products they could add to their range, or when low competition allows them to increase their prices.
Product Knowledge Graphs: PKGs are a new concept in ecommerce and aim to identify relationships between products, such as complements, co-views, and substitutes, so the data can be used in product recommendations, marketing and advertising.
There are several reasons why product matching is difficult. Firstly, product content is remarkably inconsistent across retailers, and secondly, there’s no requirement for retailers to make it easy for their rivals to scrape their content, so unique product identifiers such as Global Trade Identification Numbers (GTINs) are often absent.
For example, take the “WH-1000XM3 Wireless Noise Cancelling Headphones” as Sony calls them. On every ecommerce site I checked, they have a different name, which the vendor has tweaked slightly to improve on-site “findability” and aid SEO.
The final issue is that product matching is computationally expensive. Creating a model that runs efficiently is more challenging than it is on more typical machine learning models.
Sony | WH-1000XM3 Wireless Noise Cancelling Headphones |
Currys | SONY WH-1000XM3 Wireless Bluetooth Noise-Cancelling Headphones - Black |
John Lewis | Sony WH-1000XM3 Noise Cancelling Wireless Bluetooth NFC High Resolution Audio Over-Ear Headphones with Mic/Remote, Black |
Amazon | Sony WH-1000XM3 Noise Cancelling Wireless Headphones with Mic, 30 Hours Battery Life, Quick Charge, Gesture Control, Ambient Sound Mode, with Alexa Built-in – Black |
Carters | Sony WH1000XM3SCE7 Audio |
Buywise | Sony WH1000XM3BCE7 Headphone |
ElectricShop | Sony WH1000XM3SCE7 Over Ear Wireless Noise Cancelling Headphones Silver |
Very | WH-1000XM3 Wireless Noise-Cancelling Bluetooth Headphone with Built in Alexa |
Richer Sounds | Sony WH-1000XM3 (Black) |
eBay | WH-Gorsun 1000XM3 Bluetooth Headphone Active Noise Cancellation Earphone 55 Hour |
eBay | Sony WH-1000XM3 Wireless Noise Cancelling Headphones - Black |
There are many different issues that can be encountered when attempting data matching. Quite a few of them are evident in the tiny product sample above.
Naming inconsistency | Absolutely none of the names above are a perfect match against the manufacturer's name, so whatever algorithm you use, it could never be sure of a perfect match. |
Brand omission | Very's product name doesn't include the word Sony, making it harder to detect the brand from the title alone. |
Product condition | One of the pairs of WH-1000XM3 headphones was found on eBay for £199.90, compared to around £300 for the other pairs. However, they're used and not new, so aren't a like-for-like match. |
Formatting | Sony's product name for the headphones is WH-1000XM3, but this is shown as WH1000XM3SCE7 and WH1000XM3BCE7 on Buywise, ElectricShop, and Carters. Richer Sounds uses the official name and also drops the hyphen in the product copy, to aid SEO. |
"Fake" products | The WH-Gorsun 1000XM3 looks visually similar to the Sony WH-1000XM3 and includes the "WH-" and "1000XM3" elements in its name, yet it costs a third of the price and isn't a Sony product. |
Synonyms | They don't really occur in this dataset, but matching algorithms need to understand that HP is the same as Hewlett Packard, and that GB is the same as gigabytes. |
Before machine learning, product matching was done via a process called “manual matching”. Basically, someone had the unfortunate task of going through each competitor and manually mapping each product on their sites to the ones sold on the vendor’s site.
Manual matching is still used (even on some commercial price comparison platforms for competitor analysis), but it’s considered crude, expensive, and laborious. However, it can be made far easier using even fairly simple computational approaches such as distance metrics, which can be used to present some best guesses to moderate.
There are, of course, reasons why manual matching may be the only option. For example, if you are the manufacturer of a product, you may wish to compare it to the closest alternative using your business knowledge, so product matching just wouldn’t work there.
Obviously, the point of a product matching algorithm is to state whether a product in one retailer’s catalogue is the same as one in another. This is therefore a binary supervised classification problem - a product is either a “match” or “not a match”.
The classification is achieved using a wide range of features, which vary according to the complexity of the model. Features such as Levenshtein distance and a range of other similarity metrics are calculated and fed into the classifier, allowing it to go from a similarity score to a binary classification of “match” or “not a match”.
Product name similarity | Various metrics, such as Levenshtein Distance, TF-IDF, Jaccard Similarity and Cosine Similarity are used to compare string similarities on product names or product titles in order to try and identify the closest matches. |
Image similarity | If a product on one retailer's website includes an image which is identical to that on another, or is visually similar, this is a good indicator that the product is the same. |
Colours | For clothing, and some other products, colours are often important in aiding product matching. These can either be extracted from the product text (and perhaps mapped to a colour dictionary which links royal blue to bright blue), or colour hex codes can be extracted from the image itself. Colour distances can then be calculated using the CIELAB Delta E (CIELAB ΔE*) metric. |
Price outlier detection | Detecting whether a product price is outside the normal range can be a useful indicator to the strength of the product match. For example, if the product is being sold for between £25.99 and £29.99 on most sites, and one is selling it for just £4.99, there's a strong likelihood that it's not the same product. Distance between prices is easily measured. |
Number of variants | Assuming retailers stock each variant within a range, or have the same approach to displaying them on a page (i.e. all variants on one page via a configurable product) then the number of variants can also be a potential match indicator. The numerical similarity of the number of variants can be used as a measure. |
Product attributes | One advanced feature of some product matching algorithms is the incorporation of Product Attribution Extraction or PAE models. These extract values such as "64GB" or "black" from the product data to help improve matching accuracy. Extracting these is often very complex. |
Dictionary values | Dictionary approaches also work well on attributes such as material. For example, Polyester, GoreTex, Acrylic, Olefin, Nylon and Modacrylic are all synthetic fabric materials, but one retailer may call the product "acrylic" and another "synthetic". These often get placed in clusters or families based on material similarity, i.e. `is_synthetic`. |
Several researchers in the product matching field have utilised schema.org markup from ecommerce sites. Recently, Ralph Peeters and his team from the University of Mannheim, Germany, used solely schema.org markup and was able to achieve the current state-of-the-art F1 scores of 0.95 for product matching.
To create a product matching model we’ll need a range of packages, including the usual Pandas and Numpy for data manipulation. We’ll be using the Jellyfish and FuzzyWuzzy packages for calculating text similarities, such as Levenshtein distance, and a range of Scikit-Learn models and other machine learning packages.
import time
import re
import pandas as pd
import numpy as np
import jellyfish as jf
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import Perceptron
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
The dataset I’m using is a semi-synthetic product matching dataset I created from the PriceRunner product matching
dataset found on Kaggle. I’ve explained how you can create a similar dataset in this article. The dataset comprises several fields, but we only need three columns - the external_name
, the internal_name
, and match
.
The external_name
represents the product name that each retailer has given to the product, while the internal_name
represents the retailers product name to which it is mapped, and the match
identifies whether the product names match or not.
df = pd.read_csv('product_matching_synthetic.csv')
df.head()
external_name | internal_name | category_label | match | |
---|---|---|---|---|
0 | apple iphone 8 plus 64gb silver | Apple iPhone 8 Plus 64GB | Mobile Phones | 1 |
1 | apple iphone 8 plus 64 gb spacegrau | Apple iPhone 8 Plus 64GB | Mobile Phones | 1 |
2 | apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim... | Apple iPhone 8 Plus 64GB | Mobile Phones | 1 |
3 | apple iphone 8 plus 64gb space grey | Apple iPhone 8 Plus 64GB | Mobile Phones | 1 |
4 | apple iphone 8 plus gold 5.5 64gb 4g unlocked ... | Apple iPhone 8 Plus 64GB | Mobile Phones | 1 |
The feature engineering on this dataset is more simple than it would be ordinarily, because we’re only working with the product title, and do not have access to product attributes, prices, or images. I’ve used a range of different similarity metrics, such as Levenshtein distance, and various token ratios, which give a score corresponding to the level of similarity between the names.
The other feature added via the matching_numbers()
function came from an idea I read about in article by Ertug Odabasi. This extracts numeric data from the product names, i.e. 64, and looks at their level of similarity between the two names. This proved to be a valuable feature.
def matching_numbers(external_name, internal_name):
external_numbers = set(re.findall(r'[0-9]+', external_name))
internal_numbers = set(re.findall(r'[0-9]+', internal_name))
union = external_numbers.union(internal_numbers)
intersection = external_numbers.intersection(internal_numbers)
if len(external_numbers)==0 and len(internal_numbers) == 0:
return 1
else:
return (len(intersection)/ len(union))
def engineer_features(df):
df['internal_name'] = df['internal_name'].str.lower()
df['external_name'] = df['external_name'].str.lower()
df['levenshtein_distance'] = df.apply(
lambda x: jf.levenshtein_distance(x['external_name'],
x['internal_name']), axis=1)
df['damerau_levenshtein_distance'] = df.apply(
lambda x: jf.damerau_levenshtein_distance(x['external_name'],
x['internal_name']), axis=1)
df['hamming_distance'] = df.apply(
lambda x: jf.hamming_distance(x['external_name'],
x['internal_name']), axis=1)
df['jaro_similarity'] = df.apply(
lambda x: jf.jaro_similarity(x['external_name'],
x['internal_name']), axis=1)
df['jaro_winkler_similarity'] = df.apply(
lambda x: jf.jaro_winkler_similarity(x['external_name'],
x['internal_name']), axis=1)
df['match_rating_comparison'] = df.apply(
lambda x: jf.match_rating_comparison(x['external_name'],
x['internal_name']), axis=1).fillna(0).astype(int)
df['ratio'] = df.apply(
lambda x: fuzz.ratio(x['external_name'],
x['internal_name']), axis=1)
df['partial_ratio'] = df.apply(
lambda x: fuzz.partial_ratio(x['external_name'],
x['internal_name']), axis=1)
df['token_sort_ratio'] = df.apply(
lambda x: fuzz.token_sort_ratio(x['external_name'],
x['internal_name']), axis=1)
df['token_set_ratio'] = df.apply(
lambda x: fuzz.token_set_ratio(x['external_name'],
x['internal_name']), axis=1)
df['w_ratio'] = df.apply(
lambda x: fuzz.WRatio(x['external_name'],
x['internal_name']), axis=1)
df['uq_ratio'] = df.apply(
lambda x: fuzz.UQRatio(x['external_name'],
x['internal_name']), axis=1)
df['q_ratio'] = df.apply(
lambda x: fuzz.QRatio(x['external_name'],
x['internal_name']), axis=1)
df['matching_numbers'] = df.apply(
lambda x: matching_numbers(x['external_name'],
x['internal_name']), axis=1)
df['matching_numbers_log'] = (df['matching_numbers']+1).apply(np.log)
df['log_fuzz_score'] = (df['ratio'] + df['partial_ratio'] +
df['token_sort_ratio'] + df['token_set_ratio']).apply(np.log)
df['log_fuzz_score_numbers'] = df['log_fuzz_score'] + (df['matching_numbers']).apply(np.log)
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.fillna(value=0, inplace=True)
return df
df = engineer_features(df)
Examinining the Pearson correlation coefficients between the features shows that we’ve got quite a few that are highly correlated with matches. The matching_numbers
feature was the top performer, followed by various string similarity metrics. There’s a bit of collinearity in some of these so some of them do need to be removed through feature selection later.
df[df.columns[1:]].corr()['match'][:].sort_values(ascending=False)
match 1.000000
matching_numbers_log 0.780429
matching_numbers 0.780419
token_set_ratio 0.731331
log_fuzz_score_numbers 0.701562
partial_ratio 0.669121
jaro_winkler_similarity 0.648080
jaro_similarity 0.597436
log_fuzz_score 0.597043
ratio 0.579865
q_ratio 0.576836
uq_ratio 0.576830
token_sort_ratio 0.572287
match_rating_comparison 0.477758
w_ratio 0.472424
hamming_distance -0.143762
damerau_levenshtein_distance -0.146023
levenshtein_distance -0.146245
Name: match, dtype: float64
Next, we’ll take the features above and add them to the X
features set and assign the match
column to our target y
column. The test and training datasets can then be created in the usual manner using train_test_split()
.
X = df[['levenshtein_distance', 'damerau_levenshtein_distance', 'hamming_distance',
'jaro_similarity','jaro_winkler_similarity','matching_numbers_log',
'matching_numbers','token_set_ratio','token_sort_ratio','partial_ratio',
'ratio','log_fuzz_score','log_fuzz_score_numbers','match_rating_comparison',
'q_ratio','uq_ratio','w_ratio']].values
y = df['match'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)
To identify which models are likely to be best suited to tackling the problem we’ll loop through a selection of them, fit the training data, and make predictions on the test data and then save the results to a dataframe.
def get_confusion_matrix_values(y_test, y_pred):
cm = confusion_matrix(y_test, y_pred)
return(cm[0][0], cm[0][1], cm[1][0], cm[1][1])
classifiers = {
"DummyClassifier_stratified":DummyClassifier(strategy='stratified', random_state=0),
"KNeighborsClassifier":KNeighborsClassifier(3),
"XGBClassifier":XGBClassifier(n_estimators=1000, learning_rate=0.1),
"DecisionTreeClassifier":DecisionTreeClassifier(),
"RandomForestClassifier":RandomForestClassifier(),
"AdaBoostClassifier":AdaBoostClassifier(),
"GradientBoostingClassifier":GradientBoostingClassifier(),
"Perceptron": Perceptron(max_iter=40, eta0=0.1, random_state=1),
"MLP": MLPClassifier(),
"XGBClassifer tuned": XGBClassifier(colsample_bytree=0.8,
gamma=0.9,
max_depth=20,
min_child_weight=1,
scale_pos_weight=12,
subsample=0.9,
n_estimators=50,
learning_rate=0.1)
}
df_results = pd.DataFrame(columns=['model', 'accuracy', 'mae', 'precision',
'recall','f1','roc','run_time','tp','fp',
'tn','fn'])
for key in classifiers:
start_time = time.time()
classifier = classifiers[key]
model = classifier.fit(X_train, y_train)
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, zero_division=0)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, zero_division=0)
roc = roc_auc_score(y_test, y_pred)
classification = classification_report(y_test, y_pred, zero_division=0)
run_time = format(round((time.time() - start_time)/60,2))
tp, fp, fn, tn = get_confusion_matrix_values(y_test, y_pred)
row = {'model': key,
'accuracy': accuracy,
'mae': mae,
'precision': precision,
'recall': recall,
'f1': f1,
'roc': roc,
'run_time': run_time,
'tp': tp,
'fp': fp,
'tn': tn,
'fn': fn,
}
df_results = df_results.append(row, ignore_index=True)
df_results.head(10)
model | accuracy | mae | precision | recall | f1 | roc | run_time | tp | fp | tn | fn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | DummyClassifier_stratified | 0.832219 | 0.167781 | 0.094909 | 0.095907 | 0.095406 | 0.501478 | 0.0 | 95945 | 9832 | 1031 | 9719 |
1 | KNeighborsClassifier | 0.980073 | 0.019927 | 0.908967 | 0.871256 | 0.889712 | 0.931194 | 0.17 | 104839 | 938 | 9366 | 1384 |
2 | XGBClassifier | 0.986149 | 0.013851 | 0.948811 | 0.898326 | 0.922878 | 0.946700 | 0.72 | 105256 | 521 | 9657 | 1093 |
3 | DecisionTreeClassifier | 0.978211 | 0.021789 | 0.873329 | 0.893395 | 0.883248 | 0.940113 | 0.07 | 104384 | 1393 | 9604 | 1146 |
4 | RandomForestClassifier | 0.986716 | 0.013284 | 0.946699 | 0.907070 | 0.926461 | 0.950940 | 1.05 | 105228 | 549 | 9751 | 999 |
5 | AdaBoostClassifier | 0.980391 | 0.019609 | 0.933081 | 0.848279 | 0.888662 | 0.921048 | 0.24 | 105123 | 654 | 9119 | 1631 |
6 | GradientBoostingClassifier | 0.983257 | 0.016743 | 0.944708 | 0.869395 | 0.905489 | 0.932112 | 0.92 | 105230 | 547 | 9346 | 1404 |
7 | Perceptron | 0.973131 | 0.026869 | 0.971765 | 0.729953 | 0.833679 | 0.863899 | 0.02 | 105549 | 228 | 7847 | 2903 |
8 | MLP | 0.983429 | 0.016571 | 0.951654 | 0.864279 | 0.905865 | 0.929908 | 2.23 | 105305 | 472 | 9291 | 1459 |
9 | XGBClassifer tuned | 0.984879 | 0.015121 | 0.906991 | 0.931628 | 0.919145 | 0.960959 | 0.16 | 104750 | 1027 | 10015 | 735 |
The XGBoost model generated the best overall result, so we’ll tune this to see how if we can increase the score. To do this we need to know the correct scale_pos_weight
value, which can be calculated based on the ratio of the negative class over the positive class using the function I’ve written below.
def get_scale_pos_weight(target, square_root=False, gridsearch=False):
"""Return the scale_pos_weight parameter for the XGBoost model when data are imbalanced.
The scale_pos_weight parameter is calculated from the ratio of the negative class over
the positive class. The exact scale_pos_weight sometimes does not give the best result,
so by passing the gridsearch=True parameter you can return a list of values to test with
GridSearchCV. In addition, passing square_root=True changes the scale_pos_weight to the
square root value, which can sometimes be beneficial on extremely imbalanced data.
:param target: Pandas dataframe column containing the binary target
:param square_root: Optional boolean parameter to convert to square root on extremely unbalanced data
:param gridsearch: Optional boolean parameter to return a bracketed list for use in GridSearchCV
Usage:
scale_pos_weight = get_scale_pos_weight(df['target'], square_root=False, gridsearch=True)
"""
import math
scale_pos_weight = round((len(target) - sum(target)) / sum(target))
if square_root:
scale_pos_weight = round(math.sqrt(scale_pos_weight))
if gridsearch:
scale_pos_weight = [scale_pos_weight-2, scale_pos_weight-1, scale_pos_weight,
scale_pos_weight+1, scale_pos_weight+2]
return scale_pos_weight
scale_pos_weight = get_scale_pos_weight(df['match'], square_root=False, gridsearch=True)
scale_pos_weight
[8, 9, 10, 11, 12]
The scale_pos_weight
values can be added to the param_grid
lists and then GridSearchCV can be used to identify the optimum combination of model parameters to generate the best score.
n_estimators = [50]
learning_rate = [0.1]
max_depth = [5, 10, 20]
min_child_weight = [1, 2]
scale_pos_weight = [8, 9, 10, 11, 12]
gamma = [0.9, 1.0]
subsample = [0.9]
colsample_bytree = [0.8, 1.0]
start = time.perf_counter()
param_grid = dict(
n_estimators=n_estimators,
learning_rate=learning_rate,
max_depth=max_depth,
min_child_weight=min_child_weight,
scale_pos_weight=scale_pos_weight,
gamma=gamma,
subsample=subsample,
colsample_bytree=colsample_bytree,
)
model = XGBClassifier(random_state=0)
grid_search = GridSearchCV(estimator=model,
param_grid=param_grid,
scoring='roc_auc',
)
print('Running GridSearchCV...')
best_model = grid_search.fit(X_train, y_train)
best_score = round(best_model.score(X_test, y_test), 4)
best_params = best_model.best_params_
print('Score:', best_score)
print('Optimum parameters', best_params)
finish = time.perf_counter()
run_time = (finish - start / 60)
print(f"Completed task in {run_time:0.4f} minutes")
Running GridSearchCV returned the below settings, so we’ll fit the XGBClassifier model using these on the training data and then make some predictions on the test dataset.
model = XGBClassifier(colsample_bytree=0.8,
gamma=0.9,
max_depth=20,
min_child_weight=1,
scale_pos_weight=12,
subsample=0.9,
n_estimators=50,
learning_rate=0.1)
model = classifier.fit(X_train, y_train)
y_pred = model.predict(X_test)
The classification report shows that the model scores pretty well (given that the current state of the art score is an F1 of 0.95), and we still have lots of other things to try.
print(classification_report(y_test, y_pred, labels=[1, 0],
target_names=['match', 'not match']))
precision recall f1-score support
match 0.91 0.93 0.92 10750
not match 0.99 0.99 0.99 105777
accuracy 0.98 116527
macro avg 0.95 0.96 0.96 116527
weighted avg 0.99 0.98 0.98 116527
results = pd.DataFrame(data={'predictions': y_pred, 'actual': y_test})
results['result'] = np.where(results['predictions']==results['actual'], 1, 0)
results.head(20)
predictions | actual | result | |
---|---|---|---|
0 | 0 | 0 | 1 |
1 | 0 | 0 | 1 |
2 | 0 | 0 | 1 |
3 | 0 | 0 | 1 |
4 | 0 | 0 | 1 |
5 | 0 | 0 | 1 |
6 | 0 | 0 | 1 |
7 | 0 | 0 | 1 |
8 | 0 | 0 | 1 |
9 | 1 | 0 | 0 |
10 | 0 | 0 | 1 |
11 | 1 | 1 | 1 |
12 | 0 | 0 | 1 |
13 | 0 | 0 | 1 |
14 | 0 | 0 | 1 |
15 | 0 | 0 | 1 |
16 | 0 | 0 | 1 |
17 | 0 | 0 | 1 |
18 | 0 | 0 | 1 |
19 | 0 | 0 | 1 |
Generating new predictions is definitely the trickiest bit to get your head around. The model predicts whether a single product name is a match against another single product name. Therefore, testing every product in a retailer’s product catalogue for a potential match could entail comparing each product sold by one retailer against each product sold by another via the model.
Let’s say that both retailers have a modest 10,000 SKUs each. That would require 10,000 x 10,000, or 100 million predictions. If you were comparing prices against 10 retailers with similar inventories, this would quickly become an extremely time-consuming and expensive process and just wouldn’t scale. We therefore need a way of reducing the number of comparisons that need to be made by filtering out those with a lower probability of being a match before making a final check via the model.
To help reduce the number of products that need to be compared via the model, I used the Fuzzy Wuzzy token_set_ratio
scorer take a single external product name and compare it to all the unique internal product names in the dataset and return a sorted list of token_set_ratio
scores.
Each one of the closest matching internal_name
values can be compared to the external_name
using the model to see if there’s a match or not. This greatly reduces the volume of matches the model needs to test for each product in the dataset, with the function returning the top five.
def get_closest_matches(external_name):
unique_internal_names = df['internal_name'].unique().tolist()
closest_matches = process.extract(external_name,
unique_internal_names,
scorer=fuzz.token_set_ratio)
return closest_matches
Next we need to preprocess the data in the same way as we did during the training step. To do this I’ve assigned the same external name we’re trying to match to the external_name
column and added the internal_name
from the get_closest_matches()
function to the internal_name
.
def prepare_data(external_name):
closest_matches = get_closest_matches(external_name)
df = pd.DataFrame(columns=['external_name', 'internal_name'])
for match in closest_matches:
row = {'external_name': external_name, 'internal_name': match[0]}
df = df.append(row, ignore_index=True)
return df
closest_data = prepare_data("apple iphone x")
closest_data.head()
external_name | internal_name | |
---|---|---|
0 | apple iphone x | apple iphone x 64gb |
1 | apple iphone x | apple iphone x 256gb |
2 | apple iphone x | apple iphone 8 plus 64gb |
3 | apple iphone x | apple iphone 7 plus 32gb |
4 | apple iphone x | apple iphone 7 32gb |
Now we can simply re-run the engineer_features()
function we used during training to calculate the similarities between each of the product name pairs. That gives us data that’s formatted in the same way as our training data.
data = engineer_features(closest_data)
data = data[['levenshtein_distance', 'damerau_levenshtein_distance', 'hamming_distance',
'jaro_similarity','jaro_winkler_similarity','matching_numbers_log',
'matching_numbers','token_set_ratio','token_sort_ratio','partial_ratio',
'ratio','log_fuzz_score','log_fuzz_score_numbers','match_rating_comparison',
'q_ratio','uq_ratio','w_ratio']]
data.head()
Finally, we can generate new predictions by passing the dataframe through to the model’s predict_proba()
function to identify whether any of the product name pairs were identified as matches by the model. Based on the external_name
“apple iphone x” we get two matches, which are both correct.
y_pred = model.predict_proba(data)[:,1]
data = data.assign(prediction=y_pred)
data = data.merge(closest_data)
data[['external_name','internal_name','prediction']].head()
external_name | internal_name | prediction | |
---|---|---|---|
0 | apple iphone x | apple iphone x 64gb | 0.995701 |
1 | apple iphone x | apple iphone x 256gb | 0.995701 |
2 | apple iphone x | apple iphone 8 plus 64gb | 0.100689 |
3 | apple iphone x | apple iphone 7 plus 32gb | 0.100689 |
4 | apple iphone x | apple iphone 8 plus 64gb | 0.100689 |
Clearly, this prototype proof-of-concept model only uses a small number of the potential features you’d use in the real world, which is going to limit the performance of the model. Ordinarily, you’d spend weeks or months building such a model, so this is obviously imperfect, but it performs quite well. There are quite a few other things we could try to improve the results:
Ristoski, P., Petrovski, P., Mika, P. and Paulheim, H., 2018. A machine learning approach for product matching and categorization. Semantic web, 9(5), pp.707-728.
Petrovski, P., Primpeli, A., Meusel, R. and Bizer, C., 2016, September. The WDC gold standards for product feature extraction and product matching. In International Conference on Electronic Commerce and Web Technologies (pp. 73-86). Springer, Cham. http://webdatacommons.org/productcorpus/#toc7
Peeters, R., Primpeli, A., Wichtlhuber, B. and Bizer, C., 2020, June. Using schema.org annotations for training and maintaining product matchers. In Proceedings of the 10th International Conference on Web Intelligence, Mining and Semantics (pp. 195-204).
Xu, D., Ruan, C., Korpeoglu, E., Kumar, S. and Achan, K., 2020, January. Product knowledge graph embedding for ecommerce. In Proceedings of the 13th International Conference on Web Search and Data Mining (pp. 672-680).
Matt Clarke, Saturday, March 13, 2021