Most very large datasets tend to get compressed on servers to preserve storage space and bandwidth and allow them to be downloaded more quickly by end users. Python includes some really useful modules to allow you to decompress or unzip compressed files, so you can access the files within. Here’s how it’s done.
First open a new script or Jupyter notebook and import the glob
and zipfile
modules. The glob
package gets its name from the Unix package and is used for pattern matching, while the zipfile
module obviously handles decompression.
import glob
import zipfile
Next, use the glob()
function in glob
to check the directory containing your downloaded files. I’ve added *.zip
so glob
will search for all files in the data
directory that have a .zip
suffix. This will return a list containing the paths to all the zip files found.
files = glob.glob('data/*.zip')
files
['data/BasicCompanyData-part1.zip',
'data/BasicCompanyData-part2.zip',
'data/BasicCompanyData-part4.zip',
'data/BasicCompanyData-part3.zip',
'data/BasicCompanyData-part5.zip',
'data/BasicCompanyData-part6.zip']
Finally, create a for
loop and loop over each of the files in the files
list returned by glob
. Then, use the ZipFile()
function to read each file and the extractall()
function to decompress or unzip each zip file and save the contents to a directory called data/raw
.
for file in files:
print('Unzipping:',file)
with zipfile.ZipFile(file, 'r') as zip_ref:
zip_ref.extractall('data/raw')
Unzipping: data/BasicCompanyData-part1.zip
Unzipping: data/BasicCompanyData-part2.zip
Unzipping: data/BasicCompanyData-part4.zip
Unzipping: data/BasicCompanyData-part3.zip
Unzipping: data/BasicCompanyData-part5.zip
Unzipping: data/BasicCompanyData-part6.zip
Matt Clarke, Friday, March 12, 2021