How to create a dataset for product matching models

Picture by Oscar Dario, Unsplash.

10 minutes to read

Product matching (or data matching) is a computational technique employing Natural Language Processing, machine learning, or deep learning, which aims to identify identical products being sold on different websites, where product names might not always be a perfect match.

It’s most commonly used in ecommerce by retailers who are scraping competitor prices and want to avoid the old-school and laborious technique of manually matching products. However, it’s also commonplace on online marketplaces and price comparison sites, such as Google Shopping and PriceRunner.

While the raw data exists for creating product matching models, it’s not formatted in a way that immediately lends itself to machine learning problems, where you can treat product matching as a binary classification problem, i.e. matching or non-matching.

To resolve this, we’ll create a synthetic dataset for product matching based on the PriceRunner Product Classification and Clustering dataset, which was created by researcher Leonidas Akritidis. Here’s how it’s done.

Load the original data

Leonidas’ original dataset includes products from ShopMania, PriceRunner, and Skroutz. However, we’re just using the PriceRunner one here. This includes 35,311 product names from various third party sellers listed on the PriceRunner platform, which are mapped to both a category (i.e. Mobile Phones) and a cluster representing the product variant (i.e. Apple iPhone 8 Plus 64GB). For a fuller model, you’d ordinarily bring in additional features, such as price and attributes, but they’re missing from this dataset.

import numpy as np
import pandas as pd

df_original = pd.read_csv('pricerunner_aggregate.csv',
                         names=['product_id','product_title','vendor_id','cluster_id',
                         'cluster_label','category_id','category_label'])
df_original.head()

	product_id	product_title	vendor_id	cluster_id	cluster_label	category_id	category_label
0	1	apple iphone 8 plus 64gb silver	1	1	Apple iPhone 8 Plus 64GB	2612	Mobile Phones
1	2	apple iphone 8 plus 64 gb spacegrau	2	1	Apple iPhone 8 Plus 64GB	2612	Mobile Phones
2	3	apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...	3	1	Apple iPhone 8 Plus 64GB	2612	Mobile Phones
3	4	apple iphone 8 plus 64gb space grey	4	1	Apple iPhone 8 Plus 64GB	2612	Mobile Phones
4	5	apple iphone 8 plus gold 5.5 64gb 4g unlocked ...	5	1	Apple iPhone 8 Plus 64GB	2612	Mobile Phones

df_original.shape

(35311, 7)

Separate the core fields

Next we’ll define the column names. We’ll refer to the product_title that vendors have used for the product as the external_name and we’ll call cluster_label the internal_name.

df_correct = df_original[['product_title','cluster_label','category_label']].copy()
df_correct.rename(columns={'product_title': 'external_name', 
                   'cluster_label': 'internal_name'}, inplace=True)
df_correct.head()

	external_name	internal_name	category_label
0	apple iphone 8 plus 64gb silver	Apple iPhone 8 Plus 64GB	Mobile Phones
1	apple iphone 8 plus 64 gb spacegrau	Apple iPhone 8 Plus 64GB	Mobile Phones
2	apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...	Apple iPhone 8 Plus 64GB	Mobile Phones
3	apple iphone 8 plus 64gb space grey	Apple iPhone 8 Plus 64GB	Mobile Phones
4	apple iphone 8 plus gold 5.5 64gb 4g unlocked ...	Apple iPhone 8 Plus 64GB	Mobile Phones

Label the correct predictions

Since these data have already been matched, we can now create a match column and set this to 1 for all of the records. That gives us 35,311 products in the positive class, which is a decent number. As shown by df_correct.internal_name.nunique(), these span 12,849 different products.

df_correct['match'] = 1
df_correct.head()

	external_name	internal_name	category_label	match
0	apple iphone 8 plus 64gb silver	Apple iPhone 8 Plus 64GB	Mobile Phones	1
1	apple iphone 8 plus 64 gb spacegrau	Apple iPhone 8 Plus 64GB	Mobile Phones	1
2	apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...	Apple iPhone 8 Plus 64GB	Mobile Phones	1
3	apple iphone 8 plus 64gb space grey	Apple iPhone 8 Plus 64GB	Mobile Phones	1
4	apple iphone 8 plus gold 5.5 64gb 4g unlocked ...	Apple iPhone 8 Plus 64GB	Mobile Phones	1

df_correct.internal_name.nunique()

Create incorrect data

To create synthetic data we can use the correctly matched data above and randomly reshuffle the internal name value so it’s more likely to be incorrect. If we perform that reshuffling within the product category the results will be a bit more realistic, so mentions of “phone” won’t appear fridges. The below function does this for us. It creates n new sets of synthetic data and assigns a new internal_name and updates the match column accordingly.

def create_synthetic_data(df, iterations):
    """Creates synthetic training data from the correctly matched
    data by grouping on the cluster_label column and reshuffling
    the internal_name to create data that contain incorrect matches.
    """

    df_output = df

    i = 1
    while i <= iterations:

        # Create synthetic data by shuffling the column using a groupby
        df_s = df[['external_name','internal_name','category_label']].copy()
        df_s['shuffled_internal_name'] = df_s['internal_name']
        df_s['shuffled_internal_name'] = df_s.groupby('category_label')['internal_name'].transform(np.random.permutation)

        # Add the correct value to the match column
        df_s['match'] = np.where(df_s['internal_name'] == df_s['shuffled_internal_name'], 1, 0)

        # Create internal name column
        df_s['internal_name'] = np.where(df_s['shuffled_internal_name']!='', 
                                        df_s['shuffled_internal_name'],
                                        df_s['internal_name'])

        df_output = df_output.append(df_s)
        df_output = df_output.drop(columns=['shuffled_internal_name'])

        i += 1

    return df_output

Finally, we can run the function and create 10 new synthetic sets of shuffled data and append them to the original correctly mapped data. This gives us a new dataset containing 388,421 products, of which 35,834 are correctly assigned, according to the Pandas value_counts() function. Changing the number of iterations will make the imbalance smaller or larger.

df_output = create_synthetic_data(df_correct, 10)
df_output.tail()

	external_name	internal_name	category_label
35306	smeg fab28 60cm retro style right hand hinge f...	Miele K 12020 S-1 White White	Fridges
35307	smeg fab28 60cm retro style left hand hinge fr...	Bosch KIL20V60 Integrated	Fridges
35308	smeg fab28 60cm retro style left hand hinge fr...	Smeg FAB28YUJ1 Retro	Fridges
35309	candy 60cm built under larder fridge cru160nek	Smeg FAB28RAZ1 Blue	Fridges
35310	neff k4316x7gb built under larder fridge	Bosch KIL42VS30G Integrated	Fridges

df_output.shape

(388421, 4)

df_output.match.value_counts()

0    352587
1     35834
Name: match, dtype: int64

df_output.to_csv('product_matching_synthetic.csv', index=False)

How to tune a LightGBMClassifier model with Optuna

The LightGBM model is a gradient boosting framework that uses tree-based learning algorithms, much like the popular XGBoost model. LightGBM supports both classification and regression tasks, and is known for...

How to create a customer retention model with XGBoost

Although all business know the importance of retaining customers, few companies are actually able to measure customer retention accurately, and fewer still can predict which ones will churn or be...

How to add feature engineering to a scikit-learn pipeline

When building a machine learning model, feature engineering is one of the most important steps. Feature engineering is the process of creating new features from existing data and can often...

How to create a dataset for product matching models

Datasets for the product matching models required to verify price comparisons are hard to find. Here's how you can create one to train your model.

Load the original data

Separate the core fields

Label the correct predictions

Create incorrect data

Further reading

How to use sort_values() to sort a Pandas DataFrame

How to drop Pandas dataframe rows and columns

How to select, filter, and subset data in Pandas dataframes

How to create a Naive Bayes product classification model

How to assign RFM scores with quantile-based discretization

How to use Category Encoders to encode categorical variables

How to use Pandas from_records() to create a dataframe

How to calculate an exponential moving average in Pandas

How to use the Pandas map() function

How to use Pandas pipe() to create data pipelines

How to use Pandas assign() to create new dataframe columns

How to measure Python code execution times with timeit

How to use Pandas from_records() to create a dataframe

How to calculate an exponential moving average in Pandas

How to use the Pandas map() function

How to use Pandas pipe() to create data pipelines

How to use Pandas assign() to create new dataframe columns

How to measure Python code execution times with timeit

How to create a dataset for product matching models

Datasets for the product matching models required to verify price comparisons are hard to find. Here's how you can create one to train your model.

Load the original data

Separate the core fields

Label the correct predictions

Create incorrect data

Further reading

Other posts you might like

The LightGBM model is a gradient boosting framework that uses tree-based learning algorithms, much like the popular XGBoost model. LightGBM supports both classification and regression tasks, and is known for...

Although all business know the importance of retaining customers, few companies are actually able to measure customer retention accurately, and fewer still can predict which ones will churn or be...

When building a machine learning model, feature engineering is one of the most important steps. Feature engineering is the process of creating new features from existing data and can often...

Get the newsletter