How to create a dataset for product matching models

Datasets for the product matching models required to verify price comparisons are hard to find. Here's how you can create one to train your model.

How to create a dataset for product matching models
Picture by Oscar Dario, Unsplash.
10 minutes to read

Product matching (or data matching) is a computational technique employing Natural Language Processing, machine learning, or deep learning, which aims to identify identical products being sold on different websites, where product names might not always be a perfect match.

It’s most commonly used in ecommerce by retailers who are scraping competitor prices and want to avoid the old-school and laborious technique of manually matching products. However, it’s also commonplace on online marketplaces and price comparison sites, such as Google Shopping and PriceRunner.

While the raw data exists for creating product matching models, it’s not formatted in a way that immediately lends itself to machine learning problems, where you can treat product matching as a binary classification problem, i.e. matching or non-matching.

To resolve this, we’ll create a synthetic dataset for product matching based on the PriceRunner Product Classification and Clustering dataset, which was created by researcher Leonidas Akritidis. Here’s how it’s done.

Load the original data

Leonidas’ original dataset includes products from ShopMania, PriceRunner, and Skroutz. However, we’re just using the PriceRunner one here. This includes 35,311 product names from various third party sellers listed on the PriceRunner platform, which are mapped to both a category (i.e. Mobile Phones) and a cluster representing the product variant (i.e. Apple iPhone 8 Plus 64GB). For a fuller model, you’d ordinarily bring in additional features, such as price and attributes, but they’re missing from this dataset.

import numpy as np
import pandas as pd

df_original = pd.read_csv('pricerunner_aggregate.csv',
                         names=['product_id','product_title','vendor_id','cluster_id',
                         'cluster_label','category_id','category_label'])
df_original.head()
product_id product_title vendor_id cluster_id cluster_label category_id category_label
0 1 apple iphone 8 plus 64gb silver 1 1 Apple iPhone 8 Plus 64GB 2612 Mobile Phones
1 2 apple iphone 8 plus 64 gb spacegrau 2 1 Apple iPhone 8 Plus 64GB 2612 Mobile Phones
2 3 apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim... 3 1 Apple iPhone 8 Plus 64GB 2612 Mobile Phones
3 4 apple iphone 8 plus 64gb space grey 4 1 Apple iPhone 8 Plus 64GB 2612 Mobile Phones
4 5 apple iphone 8 plus gold 5.5 64gb 4g unlocked ... 5 1 Apple iPhone 8 Plus 64GB 2612 Mobile Phones
df_original.shape
(35311, 7)

Separate the core fields

Next we’ll define the column names. We’ll refer to the product_title that vendors have used for the product as the external_name and we’ll call cluster_label the internal_name.

df_correct = df_original[['product_title','cluster_label','category_label']].copy()
df_correct.rename(columns={'product_title': 'external_name', 
                   'cluster_label': 'internal_name'}, inplace=True)
df_correct.head()
external_name internal_name category_label
0 apple iphone 8 plus 64gb silver Apple iPhone 8 Plus 64GB Mobile Phones
1 apple iphone 8 plus 64 gb spacegrau Apple iPhone 8 Plus 64GB Mobile Phones
2 apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim... Apple iPhone 8 Plus 64GB Mobile Phones
3 apple iphone 8 plus 64gb space grey Apple iPhone 8 Plus 64GB Mobile Phones
4 apple iphone 8 plus gold 5.5 64gb 4g unlocked ... Apple iPhone 8 Plus 64GB Mobile Phones

Label the correct predictions

Since these data have already been matched, we can now create a match column and set this to 1 for all of the records. That gives us 35,311 products in the positive class, which is a decent number. As shown by df_correct.internal_name.nunique(), these span 12,849 different products.

df_correct['match'] = 1
df_correct.head()
external_name internal_name category_label match
0 apple iphone 8 plus 64gb silver Apple iPhone 8 Plus 64GB Mobile Phones 1
1 apple iphone 8 plus 64 gb spacegrau Apple iPhone 8 Plus 64GB Mobile Phones 1
2 apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim... Apple iPhone 8 Plus 64GB Mobile Phones 1
3 apple iphone 8 plus 64gb space grey Apple iPhone 8 Plus 64GB Mobile Phones 1
4 apple iphone 8 plus gold 5.5 64gb 4g unlocked ... Apple iPhone 8 Plus 64GB Mobile Phones 1
df_correct.internal_name.nunique()
12849

Create incorrect data

To create synthetic data we can use the correctly matched data above and randomly reshuffle the internal name value so it’s more likely to be incorrect. If we perform that reshuffling within the product category the results will be a bit more realistic, so mentions of “phone” won’t appear fridges. The below function does this for us. It creates n new sets of synthetic data and assigns a new internal_name and updates the match column accordingly.

def create_synthetic_data(df, iterations):
    """Creates synthetic training data from the correctly matched
    data by grouping on the cluster_label column and reshuffling
    the internal_name to create data that contain incorrect matches.
    """

    df_output = df

    i = 1
    while i <= iterations:

        # Create synthetic data by shuffling the column using a groupby
        df_s = df[['external_name','internal_name','category_label']].copy()
        df_s['shuffled_internal_name'] = df_s['internal_name']
        df_s['shuffled_internal_name'] = df_s.groupby('category_label')['internal_name'].transform(np.random.permutation)

        # Add the correct value to the match column
        df_s['match'] = np.where(df_s['internal_name'] == df_s['shuffled_internal_name'], 1, 0)

        # Create internal name column
        df_s['internal_name'] = np.where(df_s['shuffled_internal_name']!='', 
                                        df_s['shuffled_internal_name'],
                                        df_s['internal_name'])

        df_output = df_output.append(df_s)
        df_output = df_output.drop(columns=['shuffled_internal_name'])

        i += 1

    return df_output

Finally, we can run the function and create 10 new synthetic sets of shuffled data and append them to the original correctly mapped data. This gives us a new dataset containing 388,421 products, of which 35,834 are correctly assigned, according to the Pandas value_counts() function. Changing the number of iterations will make the imbalance smaller or larger.

df_output = create_synthetic_data(df_correct, 10)
df_output.tail()
external_name internal_name category_label match
35306 smeg fab28 60cm retro style right hand hinge f... Miele K 12020 S-1 White White Fridges 0
35307 smeg fab28 60cm retro style left hand hinge fr... Bosch KIL20V60 Integrated Fridges 0
35308 smeg fab28 60cm retro style left hand hinge fr... Smeg FAB28YUJ1 Retro Fridges 0
35309 candy 60cm built under larder fridge cru160nek Smeg FAB28RAZ1 Blue Fridges 0
35310 neff k4316x7gb built under larder fridge Bosch KIL42VS30G Integrated Fridges 0
df_output.shape
(388421, 4)
df_output.match.value_counts()
0    352587
1     35834
Name: match, dtype: int64
df_output.to_csv('product_matching_synthetic.csv', index=False)

Further reading

  • L. Akritidis, A. Fevgas, P. Bozanis, “Effective Product Categorization with Importance Scores and Morphological Analysis of the Titles”, In Proceedings of the 30th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), pp. 213-220, 2018.

  • L. Akritidis, A. Fevgas, P. Bozanis, C. Makris, “A Self-Verifying Clustering Approach to Unsupervised Matching of Product Titles”, Artificial Intelligence Review (Springer), pp. 1-44, 2020.

Matt Clarke, Sunday, March 07, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.