If you read research papers on machine learning, you’ll notice that many researchers use the same standard datasets so other data scientists can reproduce their work or try and improve upon it.
There are various places you can track these down (see below). However, the datasets aren’t always curated, aren’t always free of charge, vary in quality, and might not include the right features for the model you’re aiming to build.
It can, therefore, be quite a time-consuming process to sift through all the dataset repositories in order to find the right one for the job. To make the data set hunting process a little easier, I’ve compiled a selection of some of the most useful, interesting, or most widely used datasets for creating models in the ecommerce, retail, and marketing sector.
Transactional item datasets include a row for each item a customer purchased, so can be used to construct secondary datasets on customers and products. Transactional item datasets can be used in a variety of models including: customer churn models, time series forecasting models, customer segmentation models, customer clustering models, product recommender systems, Market Basket Analysis, and Next-Product-To-Buy or NPTB models.
Marketing response datasets contain the details on whether customers responded to a marketing campaign or not. Usually they’re based on the response to a single campaign, but sometimes they come in the form of a data set comprising a built-in marketing test and control, which can be really useful in modeling.
Marketing response datasets can be used for: marketing response models, customer propensity models, uplift models, and marketing targeting models, such as those to identify Marketing Qualified Leads (MQLs).
Click-stream datasets include details on the actions customers performed when visiting a website. For example, the pages they viewed, items they searched for, or products they added to their basket.
Click-stream datasets can be used in several ways, including predicting whether customers will purchase during their visit, Learning to Rank models which seek to examine search relevance and define the optimum order for search results, response models to predict who will purchase on a subsequent visit based on recent online activity, and in-visit recommender systems.
Product review datasets are great for improving your Natural Language Processing skills, particularly sentiment analysis. They can be either reviews from your own business or those from your competitors, as both can reveal interesting information to help you better understand customers and their expectations.
Product datasets take several forms and can either comprise data on an individual retailer’s products, or data on products across multiple retailers. These datasets are used for product matching models which aim to identify the same products sold at different retailers, and product attribution extraction models, which attempt to use NLP to extract useful information from product content to aid the customer experience.
When created from transactional item datasets, product datasets including sales figures can also be used to perform ABC inventory analysis and XYZ inventory analysis to aid operations managers in the control of stock.
Contractual churn datasets typically come from telecommunications providers and those retailers or SaaS businesses who offer subscriptions. The churn in these businesses is totally different to the churn in non-contractual ecommerce, as the time of a customer’s “death” (in CLV terms) is known for contractual businesses, while it can only be guessed for non-contractual businesses.
While most ecommerce retailers don’t create their own models to detect credit card fraud, seeing as the banks tackle it fairly well, credit card fraud datasets are very interesting nonetheless. They tend to be extremely imbalanced, with fraudulent transactions making up a tiny proportion of the overall volume, so they’re superb for learning imbalanced classification techniques such as SMOTE.
Matt Clarke, Thursday, March 04, 2021