How to identify near duplicate content using LMS

Learn how to detect near duplicate content using the Longest Matching Subsequence (LMS) technique in Python and boost your SEO performance.

How to identify near duplicate content using LMS
Picture by Andrew Neel, Unsplash.
11 minutes to read

In those ecommerce businesses where relatively few products are launched and products have a relatively long lifecycle, copywriters tend to be targeted on producing unique content that sells the benefits of the product, answers customers’ questions, and is more likely to rank in organic search results. Quality is typically considered better than quantity.

Ecommerce directors generally despise the use of cut-and-pasted product copy taken from supplier websites, and want writers to produce bespoke copy that is on-brand, tailored for the business, and designed to help improve conversion. Having the same copy on multiple pages is generally considered poor form and can be viewed as a sign of laziness or poor management.

However, on faster moving sites with shorter product lifecycles, the focus and KPIs often shift towards quantity not quality, and these important criteria often get overlooked due to the additional time they take. Unfortunately, this creates duplicate or near-duplicate content, which can be an issue for search engine optimisation.

How to identify near-duplicate content

Near-duplicate content comes in a few forms, but it generally comprises content that is very similar to that of another page on the same site. A related form can also happen when the content is very similar to that on external sites, which is a particular issue when ecommerce copywriters just cut-and-paste supplier product copy.

Even if we confine our searches for near-duplicates to our own site, the identification of near-duplicate content is still fairly tricky, since you need to create a distance metric which compares the content of one page to the content of every other page.

There are actually a wide range of hashing methods, such as shingling, Levenshtein distance, cosine similarity, and other methods you can use to measure content similarity. I’ve covered quite a few in my post on product matching models, where they’re heavily utilised.

However, in this project we’ll be using LMS, or the Longest Matching Subsequence approach. As the name suggests, this find the longest string of text that is identical between two documents, so it’s ideal for spotting those cases where a copywriter has cut-and-pasted content between pages instead of rewriting from scratch. Here’s how to use it.

Load the packages

First, open up a Jupyter notebook and import pandas and difflib. The difflib module is actually built into Python and provides a range of classes and functions for comparing files or sequences. We’ll specifically be using it’s SequenceMatcher class.

import pandas as pd
import difflib
from difflib import SequenceMatcher

Load your data

Next, load up your product descriptions into a Pandas dataframe. I’ve used a bunch of genuine product reviews here. Two of them are near-duplicates because the two size variants within a product range have been split over two pages, while the third is a related product, but is separate and different to the other two.

df = pd.read_csv('aquariums.csv')
df['description_length'] = df['long_description'].str.len().astype(int)
df.head()
id title long_description description_length
0 1 Product 1 \nProduct Information for Product 1 [redacted].... 1491
1 2 Product 2 Product Information for Product 2 [redacted]... 1500
2 3 Product 3 Product Information for Product 3 [redacted] 1869

Split off the data for testing

For demonstration purposes, we’ll assign the product descriptions in each row of the dataframe to a variable so we can use it for string similarity testing. One product is distinct and two are variants of the same product in different sizes, with near-duplicate content.

product1 = df['long_description'][0]
product2 = df['long_description'][1]
product3 = df['long_description'][2]

Using the difflib SequenceMatcher

The difflib module has loads of features, but we’re only interested in the SequenceMatcher class for now, which returns the longest matching sequence. This takes up to four arguments: isjunk (which is usually set to None, string a, string b, and an optional argument called autojunk, which takes a boolean True or False value.

The autojunk argument is a heruristic that automatically treats certain sequences as junk by counting how many times it appears in the sequence. If an item accounts for more than 1% of the sequence and the sequence is over 200 characters in length, the item is considered “popular” and gets ignored. It can make a big difference to the result.

difflib.SequenceMatcher(None, product1, product2)
<difflib.SequenceMatcher at 0x7f4d41b50550>

SequenceMatcher() returns an object that we need to parse using the find_longest_match() function. This takes four arguments: a 0 representing the start of string a and the an integer representing the last character in string a that we can calculate using len(). The other two arguments are the same, just for string b. With autojunk=True enabled we get back a size of 335 for the two near-duplicates, but setting this to False returns a score of 958.

s = SequenceMatcher(None, product1, product2, autojunk=True)
s.find_longest_match(0, len(product1), 
                     0, len(product2))
Match(a=1, b=0, size=335)
s = SequenceMatcher(None, product1, product2, autojunk=False)
s.find_longest_match(0, len(product1), 
                     0, len(product2))
Match(a=337, b=336, size=958)

For the descriptions that aren’t near-duplicates, we get back a score of 6 with autojunk=True, and 24 with autojunk=False, so setting this to False seems most beneficial for our purposes. Given that the two near-duplicate strings are 1491 and 1500 characters in length, this suggests they’re extremely similar.

s = SequenceMatcher(None, product1, product3, autojunk=True)
s.find_longest_match(0, len(product1), 
                     0, len(product3))
Match(a=156, b=561, size=6)
s = SequenceMatcher(None, product1, product3, autojunk=False)
s.find_longest_match(0, len(product1), 
                     0, len(product3))
Match(a=809, b=1083, size=24)

Extracting the longest matching sequence

The find_longest_match() function returns a tuple containing the size of the longest matching subsequence in the last value, or element [2] so we can assign its output to a variable and then select that element.

s = SequenceMatcher(None, product2, product3, autojunk=False)
result = s.find_longest_match(0, len(product2), 
                              0, len(product3))
result[2]
24

Finding near duplicates across a dataframe

This is where things get inefficient remarkably quickly. In order to use this approach to identify near-duplicates in an entire dataframe, we’d need to examine one product description at a time and compare it to each of the other products in the dataframe.

Obviously, that dataframe could be filtered to include similar items, such as others from the same category, subcategory, or brand, rather than absolutely every product.

def find_near_duplicates(df, target):

    df_output = pd.DataFrame(columns = ['title', 'longest_matching_subsequence', 'identical'])

    for index, row in df.iterrows(): 

        s = SequenceMatcher(None, row['long_description'], target, autojunk=False)
        result = s.find_longest_match(0, len(row['long_description']), 0, len(target))

        product = {
        'title': row['title'],
        'longest_matching_subsequence': result[2],            
        'description_length': int(row['description_length']), 
        'identical': (result[2] / int(row['description_length'])) * 100
        }

        df_output = df_output.append(product, ignore_index=True)        

    return df_output.sort_values(by='longest_matching_subsequence', ascending=False)

Running the function on a bunch of product pages from the site returns some interesting data. The top two SKUs are 64% identical, with the longest matching substring coming in at 958 characters, so this is clearly a cut-and-paste job and will cause keyword cannibalisation) that likely harms rankings.

Similarly, although the products at index 3 and 4 are non-identical to the target product, they do appear to share identical content. Given that this site has products with long lifecycles, this is clearly an issue that needs to be fixed.

output = find_near_duplicates(df, product3)
output.head()
title longest_matching_subsequence identical description_length
0 Product 1 1491 100.000000 1491.0
1 Product 2 958 63.866667 1500.0

Matt Clarke, Sunday, March 14, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.