How to find spelling and grammar issues on product pages

Spelling and grammar issues on product detail pages can make your site look unprofessional. Here’s how you can use Pandas and LanguageTool to find them.

How to find spelling and grammar issues on product pages
Picture by Pixabay, Pexels.
6 minutes to read

Ecommerce copywriters are busy people and don’t have the privilege of having eagle-eyed sub editors to sub-edit their copy and check it for spelling mistakes or grammatical issues, as magazine writers do. As a result, every online retailer will have the odd typo here and there.

While spelling and grammar issues may go unnoticed by most people, they can make your site look unprofessional. Especially if there are lots of them. However, they’re also really hard to find, as you’d literally need to proofread or spell check every page.

Thankfully, there’s a data science technique you can utilise to automate the process of proofreading and spell checking your product page content for you. It can even correct any mistakes found. Here’s how it works.

Installation

For this project we’ll be using a Python package called language_tool_python. This is a wrapper to the Java-based LanguageTool application that powers the spell check feature in OpenOffice. You can install it via Pip.

!pip3 install language_tool_python

By default, language_tool_python package will download the LanguageTool server .jar executable and run that on your machine. However, there’s also a public API you can use if you don’t want to install the full package itself.

Load the packages

We only need pandas for manipulating our text data, and the language_tool_python package. After importing these, you’ll need to define the tool you use. I’m using the public API rather than the Java application below, and I’ve set the language to en-GB.

import pandas as pd
import language_tool_python
tool = language_tool_python.LanguageToolPublicAPI('en-GB')

Load the data

Next, load up a Pandas dataframe containing the product content from your product detail pages. I’ve grabbed a tiny selection from the John Lewis website to use for test purposes and have loaded the product descriptions in a column called content.

df = pd.read_csv('products.csv')
df.head()
content
0 The portable Google Pixelbook Go laptop has be...
1 The ASUS Chromebook has a 14” Full HD touch di...
2 The Lenovo S345 Chromebook laptop uses Google’...
3 The Acer 314 laptop has been created with a ba...

Check a page for spelling and grammar issues

First, we’ll grab one of the product descriptions from our dataframe using iloc to select the specific row at index 0, and we’ll assign the content column to a variable called text.

text = df.iloc[0]['content']
text
'The portable Google Pixelbook Go laptop has been created with a battery that will 
last up to 12 hours, an Intel Core M3 processor, a 13.3” Full HD touch display, a 2MP 
web cam, along with microphones that offer improved noise cancellation and dual 
front-firing speakers for immersive sound. This device uses Google’s Chrome OS 
(Operating System), which means access to your favourite Android apps via the 
Google Play Store.'

Next we’ll use the check() function and will pass in our text variable containing the product description. This returns a dictionary containing the details on the issues found, plus their exact positions within the text, and some recommendations on replacements.

matches = tool.check(text)
matches
[Match({'ruleId': 'EN_COMPOUNDS', 'message': 'This word is normally spelled as one.', 
'replacements': ['webcam'], 'offsetInContext': 43, 'context': '...r, a 13.3” Full HD touch 
display, a 2MP web cam, along with microphones that offer impr...', 'offset': 168, 'errorLength': 
7, 'category': 'MISC', 'ruleIssueType': 'misspelling', 'sentence': 'The portable Google Pixelbook 
Go laptop has been created with a battery that will last up to 12 hours, an Intel Core M3 
processor, a 13.3” Full HD touch display, a 2MP web cam, along with microphones that offer 
improved noise cancellation and dual front-firing speakers for immersive sound.'})]

We can loop through each match in the dictionary using a for loop. This identifies that the words “web cam” have been separated as two words, rather using the word “webcam”.

for match in matches:
    print(match)
Offset 168, length 7, Rule ID: EN_COMPOUNDS
Message: This word is normally spelled as one.
Suggestion: webcam
...r, a 13.3” Full HD touch display, a 2MP web cam, along with microphones that offer impr...
                                                                                                                                                                        ^^^^^^^

Correcting spelling errors

LanguageTool also allows you to automatically correct any spelling errors found. You can do this using the correct() function. By default, this won’t save the changes - it’s just showing us how it thinks the text should be corrected.

tool.correct(text)
'The portable Google Pixelbook Go laptop has been created with a battery that will last up to 12 hours, an Intel Core M3 processor, a 13.3” Full HD touch display, a 2MP webcam, along with microphones that offer improved noise cancellation and dual front-firing speakers for immersive sound. This device uses Google’s Chrome OS (Operating System), which means access to your favourite Android apps via the Google Play Store.'

To apply the recommended spelling corrections back to our parent dataframe we can use Pandas to overwrite the content by passing the output of correct() back to the correct index. Re-running check() shows us we have no spelling errors for that product.

df.iloc[0]['content'] = tool.correct(text)
matches = tool.check(df.iloc[0]['content'])
matches
[]

Matt Clarke, Saturday, March 13, 2021

Matt Clarke Matt is a Digital Director who uses data science to help in his work. He has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.

Introduction to Natural Language Processing in Python

Learn fundamental natural language processing techniques using Python and how to apply them to extract insights from real-world text data.

Start course for FREE

Comments