How to create image datasets for machine learning models

Learn how to create image datasets for machine learning image classification models using Python and web scraping.

How to create image datasets for machine learning models
Aglais io, by Charles J Sharp, Creative Commons.
4 minutes to read

While many models are now pre-trained to identify certain objects, in most cases you will need to undertake further training. This requires the construction of image classification datasets containing a mixture of labeled images that represent the kind of images your final model will be predicting against.

When constructing an image classifier dataset you’ll usually create a list of search terms, representing your target labels, and then perform an image search on Google and Bing and scrape the first batch of results. Here’s how you can quickly construct an image classifier dataset using Python.

Install your packages

While you can write a scraper using Scrapy, Selenium or Beautiful Soup, there are already pre-built Python packages in PyPi to save you time and prevent you reinventing the wheel. I’ve used Bing Image Downloader, which you can install from the Python package index.

pip3 install bing-image-downloader

Scrape the images

Next, we’ll create a Python script and import the downloader module from the bing_image_downloader package, then we’ll create a function called get_images() which takes a single query term and returns the first 50 results.

from bing_image_downloader import downloader

def get_images(query):
    """Download images and place them in a directory.
    
    :param query: Query to search for
    :return: Images matching query
    
    """
    print(query)
    
    downloader.download(query, 
                    limit=50, 
                    output_dir='data', 
                    adult_filter_off=False, 
                    force_replace=False, 
                    timeout=60)

Prepare a list of images to find

The classifier I’m building identifies British butterfly species from their images, so my list of target search terms comprises the scientific names of all British butterfly species. Take your chosen search terms and put them into a list.

butterflies = [
'Aglais io',
'Aglais urticae',
'Anthocharis cardamines',
'Apatura iris',
'Aphantopus hyperantus',
'Argynnis paphia',
'Aricia agestis',
'Aricia artaxerxes',
'Boloria euphrosyne',
'Boloria selene',
'Callophrys rubi',
'Carterocephalus palaemon',
'Celastrina argiolus',
'Coenonympha pamphilus',
'Coenonympha tullia',
'Colias croceus',
'Cupido minimus',
'Danaus plexippus',
'Erebia aethiops',
'Erebia epiphron',
'Erynnis tages',
'Euphydryas aurinia',
'Fabriciana adippe',
'Favonius quercus',
'Gonepteryx rhamni',
'Hamearis lucina',
'Hesperia comma',
'Hipparchia semele',
'Lasiommata megera',
'Leptidea juvernica',
'Leptidea sinapis',
'Limenitis camilla',
'Lycaena phlaeas',
'Maniola jurtina',
'Melanargia galathea',
'Melitaea athalia',
'Melitaea cinxia',
'Nymphalis polychloros',
'Ochlodes sylvanus',
'Papilio machaon',
'Pararge aegeria',
'Phengaris arion',
'Pieris brassicae',
'Pieris napi',
'Pieris rapae',
'Plebejus argus',
'Polygonia c-album',
'Polyommatus bellargus',
'Polyommatus coridon',
'Polyommatus icarus',
'Pyrgus malvae',
'Pyronia tithonus',
'Satyrium pruni',
'Satyrium w-album',
'Speyeria aglaja',
'Thecla betulae',
'Thymelicus acteon',
'Thymelicus lineola',
'Thymelicus sylvestris',
'Vanessa atalanta',
'Vanessa cardui']

Scrape the images

Finally, all we need to do is loop over each butterfly in the list of butterflies and run the get_images() function with the species name of the butterfly passed in as the search parameter. The function will then search Bing for each species, scrape the results and store the first 50 matches in a directory bearing the name of the species.

for butterfly in butterflies:
    print('Fetching images of', butterfly)
    get_images(butterfly)

Once you’ve got your images, I’d recommend going through them to double-check they’re a good match for the search term used. Once you’re happy with them, your next steps would be to scale them, augment them, and split them into the training and test datasets so they’re ready for your model.

Matt Clarke, Thursday, March 04, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.