Ecommerce copywriters are busy people and don’t have the privilege of having eagle-eyed sub editors to sub-edit their copy and check it for spelling mistakes or grammatical issues, as magazine writers do. As a result, every online retailer will have the odd typo here and there.
While spelling and grammar issues may go unnoticed by most people, they can make your site look unprofessional. Especially if there are lots of them. However, they’re also really hard to find, as you’d literally need to proofread or spell check every page.
Thankfully, there’s a data science technique you can utilise to automate the process of proofreading and spell checking your product page content for you. It can even correct any mistakes found. Here’s how it works.
For this project we’ll be using a Python package called language_tool_python
. This is a wrapper to the Java-based LanguageTool application that powers the spell check feature in OpenOffice. You can install it via Pip.
!pip3 install language_tool_python
By default, language_tool_python
package will download the LanguageTool server .jar
executable and run that on your machine. However, there’s also a public API you can use if you don’t want to install the full package itself.
We only need pandas
for manipulating our text data, and the language_tool_python
package. After importing these, you’ll need to define the tool
you use. I’m using the public API rather than the Java application below, and I’ve set the language to en-GB
.
import pandas as pd
import language_tool_python
tool = language_tool_python.LanguageToolPublicAPI('en-GB')
Next, load up a Pandas dataframe containing the product content from your product detail pages. I’ve grabbed a tiny
selection from the John Lewis website via web scraping to use for test purposes and have loaded the product descriptions in a
column called content
.
df = pd.read_csv('products.csv')
df.head()
content | |
---|---|
0 | The portable Google Pixelbook Go laptop has be... |
1 | The ASUS Chromebook has a 14” Full HD touch di... |
2 | The Lenovo S345 Chromebook laptop uses Google’... |
3 | The Acer 314 laptop has been created with a ba... |
First, we’ll grab one of the product descriptions from our dataframe using iloc
to select the specific row at index 0
, and we’ll assign the content column to a variable called text
.
text = df.iloc[0]['content']
text
'The portable Google Pixelbook Go laptop has been created with a battery that will
last up to 12 hours, an Intel Core M3 processor, a 13.3” Full HD touch display, a 2MP
web cam, along with microphones that offer improved noise cancellation and dual
front-firing speakers for immersive sound. This device uses Google’s Chrome OS
(Operating System), which means access to your favourite Android apps via the
Google Play Store.'
Next we’ll use the check()
function and will pass in our text
variable containing the product description. This returns a dictionary containing the details on the issues found, plus their exact positions within the text, and some recommendations on replacements.
matches = tool.check(text)
matches
[Match({'ruleId': 'EN_COMPOUNDS', 'message': 'This word is normally spelled as one.',
'replacements': ['webcam'], 'offsetInContext': 43, 'context': '...r, a 13.3” Full HD touch
display, a 2MP web cam, along with microphones that offer impr...', 'offset': 168, 'errorLength':
7, 'category': 'MISC', 'ruleIssueType': 'misspelling', 'sentence': 'The portable Google Pixelbook
Go laptop has been created with a battery that will last up to 12 hours, an Intel Core M3
processor, a 13.3” Full HD touch display, a 2MP web cam, along with microphones that offer
improved noise cancellation and dual front-firing speakers for immersive sound.'})]
We can loop through each match in the dictionary using a for
loop. This identifies that the words “web cam” have been separated as two words, rather using the word “webcam”.
for match in matches:
print(match)
Offset 168, length 7, Rule ID: EN_COMPOUNDS
Message: This word is normally spelled as one.
Suggestion: webcam
...r, a 13.3” Full HD touch display, a 2MP web cam, along with microphones that offer impr...
^^^^^^^
LanguageTool also allows you to automatically correct any spelling errors found. You can do this using the correct()
function. By default, this won’t save the changes - it’s just showing us how it thinks the text should be corrected.
tool.correct(text)
'The portable Google Pixelbook Go laptop has been created with a battery that will last up to 12 hours, an Intel Core M3 processor, a 13.3” Full HD touch display, a 2MP webcam, along with microphones that offer improved noise cancellation and dual front-firing speakers for immersive sound. This device uses Google’s Chrome OS (Operating System), which means access to your favourite Android apps via the Google Play Store.'
To apply the recommended spelling corrections back to our parent dataframe we can use Pandas to overwrite the content by passing the output of correct()
back to the correct index. Re-running check()
shows us we have no spelling errors for that product.
df.iloc[0]['content'] = tool.correct(text)
matches = tool.check(df.iloc[0]['content'])
matches
[]
Matt Clarke, Saturday, March 13, 2021