How to use identify visually similar images using hashing

Picture by Ben Weber, Unsplash.

13 minutes to read

Image hashing (or image fingerprinting) is a technique that is used to convert an image to an alphanumeric string. While this might sound somewhat pointless, it actually has a number of practical applications and it can make a useful feature in certain machine learning models.

Image hashing has two main uses: it lets you detect identical or visually similar images, and it lets you store the image fingerprint so you don’t need to reload each image to check it.

It’s commonly applied to the detection of copyright infringement in both images and video stills, by detecting identical image hashes from a copyright owner to that on other websites. However, it can also be used as a content-based image retrieval (or CBIR) query in reverse image search engines, product matching algorithms and other things.

I recently applied image hashing to create a feature in a model for identifying potentially fraudulent listings on eBay Motors. I’d spotted that some fraudsters were stealing images from legitimate sellers and posting fraudulent listings for the same cars. My model detected the potentially fraudulent listings when the same images were posted by other sellers and added this as a feature to help the model classify fraudulent from non-fraudulent listings.

How does image hashing work?

Hashing functions convert images to short alphanumeric strings that represent the uniqueness of the image. As the hashes are small and text-based strings, like e7643c330f0f0f0f, they can be stored without taking up lots of room and they can be searched and compared at speed.

However, unlike the commonly used MD5 or SHA hashes we use on text strings, image hashes are designed to be able to handle images that have been resized, rotated, scaled, recoloured, or have noise or watermarks upon them. So while MD5 or SHA would give you a different hash for images treated in different ways, image hashing would give you similar or identical hashes for images that have been manipulated slightly.

There are a number of different image hashing functions you can use for detecting duplicate or similar images, including average hashing (aHash), perceptual hashing (pHash), difference hashing (dHash), Haar wavelet hashing (wHash), and HSV color hashing.

They work in slightly different ways, but most of them convert the image to greyscale first, as removing the colours makes the process quicker and means that images with adjusted colours won’t throw the algorithm off. After that they’ll rescale the image to a very small size, typically ignoring the aspect ratio, and then calculate the hash from the pixels in the simplified image grid.

Load the packages

For this project we’re using four main packages. The Requests package is used for fetching remote images, the Pillow or PIL package is used for opening the image files, the ImageHash package is used for calculating image hashes, Pandas is used for displaying the results in dataframes, and iPyPlot is used for displaying the sample images.

import requests
from PIL import Image
import imagehash
import pandas as pd
import ipyplot

Define the images to hash

Next, create a Python list of image URLs to hash. For demonstration purposes, I’ve included a selection of Land Rover Defender images from an eBay listing. Most of these are different, but relatively similar, but a couple of them are identical and are just cropped to different sizes.

images = ['https://i.ebayimg.com/images/g/vOUAAOSwVHle64yO/s-l1600.jpg',
         'https://i.ebayimg.com/images/g/jN8AAOSwxMle64yY/s-l1600.jpg',
         'https://i.ebayimg.com/images/g/3p8AAOSwk2Je64ym/s-l1600.jpg',
         'https://i.ebayimg.com/images/g/qqYAAOSweNle64zN/s-l1600.jpg',
         'https://i.ebayimg.com/images/g/cnkAAOSw~n9e64za/s-l1600.jpg',
         'https://i.ebayimg.com/images/g/3p8AAOSwk2Je64ym/s-l64.jpg',
         'https://i.ebayimg.com/images/g/qqYAAOSweNle64zN/s-l64.jpg']

ipyplot.plot_images(images)

Create the image hashes

To create the image hashes and assess the performance of the different hashing algorithms, we’ll open each of the images in the list and hash the image using the average hash, perceptual hash, difference hash, Haar wavelet hash, and HSV color hash algorithms. We’ll store the results in a Pandas dataframe and print it out.

df = pd.DataFrame(columns=['image','ahash','phash','dhash','whash','colorhash'])

for url in images:
    file = Image.open(requests.get(url, stream=True).raw)

    data = {
        'image': url,
        'ahash': imagehash.average_hash(file),
        'phash': imagehash.phash(file),
        'dhash': imagehash.dhash(file),
        'whash': imagehash.whash(file),
        'colorhash': imagehash.colorhash(file),   
    }
    
    df = df.append(data, ignore_index=True)

If you check out the results below, you’ll see a couple of the images have identical hashes from the average hashing, perceptual hashing, and Haar wavelet hash algorithms, which is expected because the images are the same, just different sizes. However, the cropping has caused the hashes to differ very slightly when the images were scaled.

df.head(10)

	ahash	phash	dhash	whash	colorhash
0	333e9f8981c3c3e3	ac9ec7216b61b076	e7643c330f0f0f0f	033e9f9981c3c3e3	072c0040000
1	0f0783ce0f8b8b05	ba052a9d13f87f21	fcef273c6a3b3359	0f17039f0f8f8f05	0e400008002
2	fe86c38181c1c3c7	f4959690dbc385a5	f436060337978f0e	fe86c381c1c1c3e7	06280040001
3	2700017303030f9f	af41943dc186ad3d	d6cac3c3c36f7f3f	7702037323838fbf	06080009000
4	78053f3f839383c3	add3406cd3cdcd21	c18d73721f27379e	380d3f3f839383c3	0e2c0001000
5	fe86c38181c1c3c7	f4959690dbc385a5	f436060337978f0e	fe86c381c1c1c3e7	06280080000
6	2700017303030f9f	af41943dc086ad7d	d6dac3c3d36f7f3f	3706037323838fbf	06280000000

Running the Pandas duplicated() function will show us when each image has been duplicated within the dataframe (but obviously doesn’t show the first occurrence). Image 5 was a rescaled version of image 2 so both share the aHash of fe86c38181c1c3c7, while image 6 was a rescaled version of image 3, so both share the hash 2700017303030f9f.

df.ahash.duplicated()

  False
  False
  False
  False
  False
   True
   True
Name: ahash, dtype: bool

Perceptual hashing gave an identical hash of f4959690dbc385a5 for images 5 and 2, but images 3 and 6 had slightly different hashes, with af41943dc186ad3d and af41943dc086ad7d respectively.

df.phash.duplicated()

  False
  False
  False
  False
  False
   True
  False
Name: phash, dtype: bool

Difference hashing performed similarly to perceptual hashing, with a match on images 5 and 2, but slightly different hashes for images 3 and 6, with d6cac3c3c36f7f3f and d6dac3c3d36f7f3f respectively.

df.dhash.duplicated()

  False
  False
  False
  False
  False
   True
  False
Name: dhash, dtype: bool

The Haar wavelet hash also performed similarly to the perceptual and difference hashing algorithms, giving an identical hash on images 5 and 2 but slightly different hashes for images 3 and 6.

df.whash.duplicated()

  False
  False
  False
  False
  False
   True
  False
Name: whash, dtype: bool

The colourhash algorithm didn’t detect any duplications in the images, but did return similar hashes for the rescaled images.

df.colorhash.duplicated()

  False
  False
  False
  False
  False
  False
  False
Name: colorhash, dtype: bool

Identifying duplicate images

Now we’ll create a function to identify whether a new image is a duplicate of another image already in our list. This will take the URL for a new image, return its hash and compare it to the other hashes we’ve already calculated to see if it is a duplicate of an existing image. To do this we’ll load the remote image, create an average hash, then compare the string of the average hash to the average hashes stored for the known images. Running the function on an image which is already in our dataset returns true.

def find_duplicate_images(df, ahash_column, image_url):
    """Determine whether a new image is a duplicate of 
    an existing image using average hashing.
    
    :param df: Pandas dataframe containing image hashes
    :param ahash_column: Name of column containing average hashes
    :param image_url: URL of new image
    
    :return: True if the image is a duplicate, or False if unique
    """
    
    file = Image.open(requests.get(image_url, stream=True).raw)
    ahash = imagehash.average_hash(file)

    matches = df[ahash_column].astype(str).str.contains(str(ahash))
    
    if matches.sum() > 0:
        return True
    else:
        return False

find_duplicate_images(df, 'ahash', 'https://i.ebayimg.com/images/g/vOUAAOSwVHle64yO/s-l1600.jpg')

True

Identifying visually similar images

We can use a similar approach to create a reverse image search algorithm to find visually similar images. We’ll make a function which uses the Hamming distance to do this. The Hamming distances is a metric for comparing two binary strings of equal length and it returns a score based on the number of differences identified, effectively allowing us to see how similar one image is to the others in the dataset.

By passing in the URL of an unseen image, we can calculate its average hash or ahash, convert it to a string, and then use a lambda function to calculate its Hamming distance from each of the previously seen images in our dataframe. By sorting the values in the dataframe, we get a list of images ranked by their visual similarity to the new image. A score of 0 means the images match exactly.

import distance

def find_similar_images(df, ahash_column, image_url):
    """Compare an unseen image to previously seen images and return
    a list of images ranked by their similarity according to the 
    Hamming distance of their average hash or ahash.
    
    :param df: Pandas dataframe containing image and ahash columns
    :param ahash_column: Name of ahash column
    :param image_url: URL of the unseen image to hash and compare
   
    :return
        Pandas dataframe containing the most similar images
    """
    
    file = Image.open(requests.get(image_url, stream=True).raw)
    ahash = str(imagehash.average_hash(file))
        
    df['hamming_distance'] = df.apply(\
    lambda x: distance.hamming(str(x[ahash_column]), ahash), axis=1)

    df = df[['image','ahash','hamming_distance']]\
    .sort_values(by='hamming_distance', ascending=True)
    
    return df

df = find_similar_images(df, 'ahash', 'https://i.ebayimg.com/images/g/3p8AAOSwk2Je64ym/s-l1600.jpg')

df.head(10)

	image	ahash	hamming_distance
2	https://i.ebayimg.com/images/g/3p8AAOSwk2Je64y...	fe86c38181c1c3c7	0
5	https://i.ebayimg.com/images/g/3p8AAOSwk2Je64y...	fe86c38181c1c3c7	0
0	https://i.ebayimg.com/images/g/vOUAAOSwVHle64y...	333e9f8981c3c3e3	10
4	https://i.ebayimg.com/images/g/cnkAAOSw~n9e64z...	78053f3f839383c3	13
1	https://i.ebayimg.com/images/g/jN8AAOSwxMle64y...	0f0783ce0f8b8b05	15
3	https://i.ebayimg.com/images/g/qqYAAOSweNle64z...	2700017303030f9f	16
6	https://i.ebayimg.com/images/g/qqYAAOSweNle64z...	2700017303030f9f	16

By using the image hashing approach we can store a unique fingerprint for each of our images in our database to help us identify identical or visually similar images by comparing the hash of a new image with one of the hashes we’ve calculated before. The hashes are small, quick to search, and the technique is really effective.

Matt Clarke, Friday, March 05, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.