Product matching (or data matching) is a computational technique employing Natural Language Processing, machine learning, or deep learning, which aims to identify identical products being sold on different websites, where product names might not always be a perfect match.
It’s most commonly used in ecommerce by retailers who are scraping competitor prices and want to avoid the old-school and laborious technique of manually matching products. However, it’s also commonplace on online marketplaces and price comparison sites, such as Google Shopping and PriceRunner.
While the raw data exists for creating product matching models, it’s not formatted in a way that immediately lends itself to machine learning problems, where you can treat product matching as a binary classification problem, i.e. matching or non-matching.
To resolve this, we’ll create a synthetic dataset for product matching based on the PriceRunner Product Classification and Clustering dataset, which was created by researcher Leonidas Akritidis. Here’s how it’s done.
Leonidas’ original dataset includes products from ShopMania, PriceRunner, and Skroutz. However, we’re just using the PriceRunner one here. This includes 35,311 product names from various third party sellers listed on the PriceRunner platform, which are mapped to both a category (i.e. Mobile Phones) and a cluster representing the product variant (i.e. Apple iPhone 8 Plus 64GB). For a fuller model, you’d ordinarily bring in additional features, such as price and attributes, but they’re missing from this dataset.
import numpy as np
import pandas as pd
df_original = pd.read_csv('pricerunner_aggregate.csv',
names=['product_id','product_title','vendor_id','cluster_id',
'cluster_label','category_id','category_label'])
df_original.head()
product_id | product_title | vendor_id | cluster_id | cluster_label | category_id | category_label | |
---|---|---|---|---|---|---|---|
0 | 1 | apple iphone 8 plus 64gb silver | 1 | 1 | Apple iPhone 8 Plus 64GB | 2612 | Mobile Phones |
1 | 2 | apple iphone 8 plus 64 gb spacegrau | 2 | 1 | Apple iPhone 8 Plus 64GB | 2612 | Mobile Phones |
2 | 3 | apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim... | 3 | 1 | Apple iPhone 8 Plus 64GB | 2612 | Mobile Phones |
3 | 4 | apple iphone 8 plus 64gb space grey | 4 | 1 | Apple iPhone 8 Plus 64GB | 2612 | Mobile Phones |
4 | 5 | apple iphone 8 plus gold 5.5 64gb 4g unlocked ... | 5 | 1 | Apple iPhone 8 Plus 64GB | 2612 | Mobile Phones |
df_original.shape
(35311, 7)
Next we’ll define the column names. We’ll refer to the product_title
that vendors have used for the product as the external_name
and we’ll call cluster_label
the internal_name
.
df_correct = df_original[['product_title','cluster_label','category_label']].copy()
df_correct.rename(columns={'product_title': 'external_name',
'cluster_label': 'internal_name'}, inplace=True)
df_correct.head()
external_name | internal_name | category_label | |
---|---|---|---|
0 | apple iphone 8 plus 64gb silver | Apple iPhone 8 Plus 64GB | Mobile Phones |
1 | apple iphone 8 plus 64 gb spacegrau | Apple iPhone 8 Plus 64GB | Mobile Phones |
2 | apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim... | Apple iPhone 8 Plus 64GB | Mobile Phones |
3 | apple iphone 8 plus 64gb space grey | Apple iPhone 8 Plus 64GB | Mobile Phones |
4 | apple iphone 8 plus gold 5.5 64gb 4g unlocked ... | Apple iPhone 8 Plus 64GB | Mobile Phones |
Since these data have already been matched, we can now create a match
column and set this to 1 for all of the records. That gives us 35,311 products in the positive class, which is a decent number. As shown by df_correct.internal_name.nunique()
, these span 12,849 different products.
df_correct['match'] = 1
df_correct.head()
external_name | internal_name | category_label | match | |
---|---|---|---|---|
0 | apple iphone 8 plus 64gb silver | Apple iPhone 8 Plus 64GB | Mobile Phones | 1 |
1 | apple iphone 8 plus 64 gb spacegrau | Apple iPhone 8 Plus 64GB | Mobile Phones | 1 |
2 | apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim... | Apple iPhone 8 Plus 64GB | Mobile Phones | 1 |
3 | apple iphone 8 plus 64gb space grey | Apple iPhone 8 Plus 64GB | Mobile Phones | 1 |
4 | apple iphone 8 plus gold 5.5 64gb 4g unlocked ... | Apple iPhone 8 Plus 64GB | Mobile Phones | 1 |
df_correct.internal_name.nunique()
12849
To create synthetic data we can use the correctly matched data above and randomly reshuffle the internal name value so it’s more likely to be incorrect. If we perform that reshuffling within the product category the results will be a bit more realistic, so mentions of “phone” won’t appear fridges. The below function does this for us. It creates n
new sets of synthetic data and assigns a new internal_name
and updates the match
column accordingly.
def create_synthetic_data(df, iterations):
"""Creates synthetic training data from the correctly matched
data by grouping on the cluster_label column and reshuffling
the internal_name to create data that contain incorrect matches.
"""
df_output = df
i = 1
while i <= iterations:
# Create synthetic data by shuffling the column using a groupby
df_s = df[['external_name','internal_name','category_label']].copy()
df_s['shuffled_internal_name'] = df_s['internal_name']
df_s['shuffled_internal_name'] = df_s.groupby('category_label')['internal_name'].transform(np.random.permutation)
# Add the correct value to the match column
df_s['match'] = np.where(df_s['internal_name'] == df_s['shuffled_internal_name'], 1, 0)
# Create internal name column
df_s['internal_name'] = np.where(df_s['shuffled_internal_name']!='',
df_s['shuffled_internal_name'],
df_s['internal_name'])
df_output = df_output.append(df_s)
df_output = df_output.drop(columns=['shuffled_internal_name'])
i += 1
return df_output
Finally, we can run the function and create 10 new synthetic sets of shuffled data and append them to the original
correctly mapped data. This gives us a new dataset containing 388,421 products, of which 35,834 are correctly
assigned, according to the Pandas value_counts()
function. Changing the number of iterations will make the
imbalance smaller or
larger.
df_output = create_synthetic_data(df_correct, 10)
df_output.tail()
external_name | internal_name | category_label | match | |
---|---|---|---|---|
35306 | smeg fab28 60cm retro style right hand hinge f... | Miele K 12020 S-1 White White | Fridges | 0 |
35307 | smeg fab28 60cm retro style left hand hinge fr... | Bosch KIL20V60 Integrated | Fridges | 0 |
35308 | smeg fab28 60cm retro style left hand hinge fr... | Smeg FAB28YUJ1 Retro | Fridges | 0 |
35309 | candy 60cm built under larder fridge cru160nek | Smeg FAB28RAZ1 Blue | Fridges | 0 |
35310 | neff k4316x7gb built under larder fridge | Bosch KIL42VS30G Integrated | Fridges | 0 |
df_output.shape
(388421, 4)
df_output.match.value_counts()
0 352587
1 35834
Name: match, dtype: int64
df_output.to_csv('product_matching_synthetic.csv', index=False)
L. Akritidis, A. Fevgas, P. Bozanis, “Effective Product Categorization with Importance Scores and Morphological Analysis of the Titles”, In Proceedings of the 30th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), pp. 213-220, 2018.
L. Akritidis, A. Fevgas, P. Bozanis, C. Makris, “A Self-Verifying Clustering Approach to Unsupervised Matching of Product Titles”, Artificial Intelligence Review (Springer), pp. 1-44, 2020.
Matt Clarke, Sunday, March 07, 2021