How to auto-generate product summaries using deep learning

Learn how to use Transformer models to automatically generate summaries from ecommerce product descriptions and save your copywriters days of time.

How to auto-generate product summaries using deep learning
Picture by Anastase Maragos, Unsplash.
12 minutes to read

Several years ago, in one of my first Ecommerce Director roles, I worked with the ex-Myprotein founder to launch sports nutrition brand GoNutrition. As a “bootstrapped” startup, we were low on numbers, so we all handled various tasks, with me taking on the copywriting.

Keen to improve the site’s conversion rate after launch, I ran an A/B test based on my hypothesis that dumbed-down and more accessible product copy might lead to more sales. To run the test, I had to re-write dozens of product descriptions. This me took days, partly because the gym is not my natural habitat, and protein powders are not my normal sustenance…

On a larger site, this sort of rewriting or text summarising task could easily take weeks or months. I did this rewriting by hand, but if I were tackling the same task today, I’d consider using a deep learning Transformer model to semi-automate the process instead.

In this project, I’ll show you how effective this is, by using the Bart model to automatically generate short product summaries from some of the original product copy I created.

Load the packages

Open a new Jupyter notebook and import the pandas and transformers packages. You’ll likely need to install Transformers, which you can do via PyPi by entering !pip3 install transformers in your terminal. To see more of the text in our dataframe I’ve also set max_colwidth to 150.

import pandas as pd
from transformers import pipeline
pd.set_option('max_colwidth', 150)

Load the Transformer pipeline

Next, we’ll load the summarization pipeline from Hugging Face. This downloads a massive 1.3 GB pre-trained model for text summarisation which uses Bart, a “denoising autoencoder” model developed in 2019 by Mike Lewis and co-authors.

This one line will download and setup our Bart transformer so it’s ready to handle text summarisation out-of-the-box. Building a model like this yourself would take an enormous dataset, a supercomputer, loads of powerful GPUs, and months of your time.

summarizer = pipeline("summarization")

Load the data

Now the model is ready, we can import our data. I’ve created a miniature dataset based on some of the original product descriptions I wrote for GoNutrition when I worked there. These aim to explain the features and benefits of a range of protein powders and pre-workout supplements to gym goers with various levels of sports nutrition expertise.

df = pd.read_csv('gonutrition.csv')
df.head()
product_name product_description
0 Whey Protein Isolate 90 What is Whey Protein Isolate? Whey Protein Isolate 90 is our highest quality whey protein powder and provides 23g of protein per 25g serving. This...
1 Whey Protein 80 What is Whey Protein 80? Whey Protein 80 is an ultra premium quality 80% whey protein powder exclusively from free range, grass fed cows providing...
2 Volt Preworkout™ What is Volt™? Our Volt pre workout formula includes 12 advanced active ingredients that work together to increase energy, mental focus and muscul...

Extract a product description

To see what we’re giving the model to work with, let’s take a look at the product description for Whey Protein Isolate. This is quite a long and technical description (something required in this market), but we need to truncate it to the first 1024 characters as the model expects data of this size and no larger.

text = df['product_description'][0]
truncated_text = text[:1024]
truncated_text
"What is Whey Protein Isolate? Whey Protein Isolate 90 is our highest quality whey protein powder and provides 23g of protein per 25g serving. This whey protein isolate powder is 90% protein and extremely low in fat and carbohydrates, with only 0.17g of carbs and 0.25g of fat per serving, making it ideal for those looking to lose fat and develop a more toned physique. It's a purer, higher end whey protein than our standard whey protein concentrate product GN Whey Protein 80, which contains more fat and carbs than isolate.  What's so special about your whey protein isolate? Whey Protein Isolate 90™ is one of the the highest quality whey's on the market today and is blended, packed and sealed for freshness in our state of the art production facility. This is what's known as an un-denatured whey protein isolate, so the protein within hasn't been damaged by heat and can be readily used by the body. It's a traceable product from free range, grass fed cows and is made from vegetarian sweet cheese. It's packed with a"

Generate a summary of the product description

Next, we’ll generate a couple of product summaries. In the first one, I’ve set do_sample to True, so the Bart model extracts relevant snippets from the text from which to construct its summary. To tidy up the text and remove some formatting issues I’ve used strip() and replace().

summary = summarizer(truncated_text, min_length=50, max_length=100, do_sample=True)
summary[0]['summary_text'].strip().replace(' .', '.')
"Whey Protein Isolate 90 is our highest quality whey protein powder and provides 23g of protein per 25g serving. Whey protein isolate powder is 90% protein and extremely low in fat and carbohydrates, with only 0.17g of carbs and 0.25g of fat per serving. This is what's known as an un-denatured wheyprotein isolate, so the protein within hasn't been damaged by heat and can be readily used by the body."

The other approach we can use with the Bart transformer model is to create a text summary from completely unique text. Here, we’ll set do_sample to False and Bart will read the text, understand it, pick out the key points, and then write a short text summary in its own words.

summary = summarizer(truncated_text, min_length=50, max_length=100, do_sample=False)
summary[0]['summary_text'].strip().replace(' .', '.')
'Whey Protein Isolate 90 is 90% protein and extremely low in fat and carbohydrates, with only 0.17g of carbs and 0.25g of fat per serving. Ideal for those looking to lose fat and develop a more toned physique.'

Generate summaries for all products

Finally, we can put this all together and create a function to run the model on each of the product descriptions in our original dataframe and generate a bespoke product summary. This takes just a second or two per product!

def get_summary(text, min_length=50, max_length=100, do_sample=False):

    summary = summarizer(text[:1024], 
                         min_length=min_length, 
                         max_length=max_length, 
                         do_sample=do_sample)
    summary_text = summary[0]['summary_text'].strip().replace(' .', '.')

    return summary_text
df['product_summary'] = df.apply(lambda x: get_summary(x.product_description), axis=1)

Examine the results

After the model has run, we can inspect the product_summary we stored back in the original dataframe. These are slightly truncated in the dataframe, so we’ll also look at them individually to see how well the model compares to a human copywriter.

df = df[['product_name', 'product_summary']]
df.head()
product_name product_summary
0 Whey Protein Isolate 90 Whey Protein Isolate 90 is 90% protein and extremely low in fat and carbohydrates, with only 0.17g of carbs and 0.25g of fat per serving. Ideal fo...
1 Whey Protein 80 GN Whey Protein 80 is an 80% whey protein powder exclusively from free range, grass fed cows. Contains 20g of premium grade protein per 25g servin...
2 Volt Preworkout™ Volt enables you to achieve the ultimate workout so you can maximise your lean muscle, power and strength gains by training harder. Volt is ideal ...
Whey Protein Isolate 90
df['product_summary'][0]
'Whey Protein Isolate 90 is 90% protein and extremely low in fat and carbohydrates, with only 0.17g of carbs and 0.25g of fat per serving. Ideal for those looking to lose fat and develop a more toned physique.'
df['product_summary'][1]
GN Whey Protein 80
'GN Whey Protein 80 is an 80% whey protein powder exclusively from free range, grass fed cows. Contains 20g of premium grade protein per 25g serving and delivers an outstanding amino acid profile. Produced from the world renowned Müller dairy based in Germany.'
Volt Preworkout
df['product_summary'][2]
"Volt enables you to achieve the ultimate workout so you can maximise your lean muscle, power and strength gains by training harder. Volt is ideal for those who want to push themselves further when they're working out. However, it's potent and has a high caffeine content."

The verdict

The Bart transformer approach is pretty incredible. It’s rewritten product summaries that pick out the key points about these products. While I’m not suggesting it’s capable of replacing human copywriters, this approach could still be a huge time saver for ecommerce teams.

A little human editing could have these looking spot-on in a fraction of the time it would usually take. Most of them could use a little tweaking for SEO, but they’re really hard to tell apart from something the average ecommerce copywriter might produce.

The performance you get, of course, is going to be totally dependent upon the quality and depth of information included within your original copy. If there’s a lack of sufficient information to summarise, Bart will have its work cut out, which means copywriters have nothing to fear about their jobs, for now.

Further reading

  • Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V. and Zettlemoyer, L., 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.

Matt Clarke, Sunday, March 14, 2021

Matt Clarke Matt is a Digital Director who uses data science to help in his work. He has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.

Introduction to Natural Language Processing in Python

Learn fundamental natural language processing techniques using Python and how to apply them to extract insights from real-world text data.

Start course for FREE

Comments