The Pandas package is one of the main reasons why so many data scientists favour Python over Microsoft Excel. Pandas is incredibly powerful and versatile and can handle a wide range of file formats without the limitations that can cripple Excel.
As you might imagine, Pandas also lets you export data stored in a Pandas dataframe into a wide range of different file formats and systems. We’ll be covering how to export Pandas data to CSV, TSV, JSON, HTML, Feather, Parquet, Stata, LaTex, Pickle, and the clipboard in this project. However, you can also export to SQL and Google Sheets with some additional packages.
To get started we’ll first import the Pandas package and then import data into a dataframe that we can export to a variety of different file formats using Pandas’ export functions.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/marketing_mix.csv') df.head()
to_csv() function allows you to export or save the contents of a Pandas dataframe by writing the output to a CSV or comma separated value file. CSV files can be read by any application so they’re convenient and widely used. By default, you only need to pass a single argument to the
to_csv() function, which is a string containing the filename or filepath, for example,
data1.csv. When executed, Pandas will write the data to that file.
To understand why this default option may not be what you want, let’s load the CSV file we just created using
read_csv() to read it back into a Pandas dataframe. As you’ll see, there’s an extra column called
Unnamed: 0 in the dataframe because Pandas by default will add an extra index to the file in case one doesn’t exist.
df_data1 = pd.read_csv('data1.csv') df_data1.head()
If you want to keep the index added by default, you can append a value in the
index_label argument when calling
to_csv(). Instead of adding
Unnamed: 0 for the index name, this will add the value you’ve defined.
df_data1 = pd.read_csv('data1.csv') df_data1.head()
If you don’t want the index to be added, you can simply pass in the
index=False argument. This tends to be the standard approach on most dataframes.
df_data2 = pd.read_csv('data2.csv') df_data2.head()
Despite the name, the
to_csv() function also lets you export Pandas dataframes to similar character-separated data formats, such as tab separated value (TSV) or pipe separated value (PSV) files used by some legacy platforms.
To do this, you simply need to define the field delimiter or separator character by passing it as a string to the
sep argument. For example,
sep='|' will create a pipe separated value file, and
sep='\t' will create a tab separated value or TSV file. By default, the
to_csv() function uses
sep=',', so you don’t need to enter anything if you want a CSV file back.
df.to_csv('data3.tsv', index=False, sep='\t')
Note that if you attempt to reload a file that uses a separator other than the default comma, the
read_csv() function won’t format the dataframe correctly. Here’s what happens when we load the TSV file we created above using
df_data3 = pd.read_csv('data3.tsv') df_data3.head()
To load a TSV file in Pandas you simply need to define the separator character using
sep='\t' in the
df_data3 = pd.read_csv('data3.tsv', sep='\t') df_data3.head()
to_json() function. Like
to_csv(), you simply append the function name to your dataframe name using a dot and define the output filename as the argument.
to_json() function also includes a range of other arguments to allow you to control the date format, index, compression, orientation, and can output regular JSON or JSON lines.
One really useful feature if you write blog posts or want to export Pandas data to a web application is the ability to export a Pandas dataframe to an HTML table. You can do this via the
When appended to the name of your Pandas dataframe, the
to_html() function returns a chunk of HTML defining the table layout for the Pandas dataframe, so you can copy it to your web page or insert it into a blog post. The
to_html() function also includes a load of extra arguments that you can use to style the Pandas dataframe by assigning bold, adding class names, and justifying the content. It’s really powerful.
html = df.head().to_html()
'<table border="1" class="dataframe">\n <thead>\n <tr style="text-align: right;">\n <th></th>\n <th>week</th>\n <th>tv</th>\n <th>radio</th>\n <th>newspaper</th>\n <th>sales</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>0</td>\n <td>230.1</td>\n <td>37.8</td>\n <td>69.2</td>\n <td>22.1</td>\n </tr>\n <tr>\n <th>1</th>\n <td>1</td>\n <td>44.5</td>\n <td>39.3</td>\n <td>45.1</td>\n <td>10.4</td>\n </tr>\n <tr>\n <th>2</th>\n <td>2</td>\n <td>17.2</td>\n <td>45.9</td>\n <td>69.3</td>\n <td>9.3</td>\n </tr>\n <tr>\n <th>3</th>\n <td>3</td>\n <td>151.5</td>\n <td>41.3</td>\n <td>58.5</td>\n <td>18.5</td>\n </tr>\n <tr>\n <th>4</th>\n <td>4</td>\n <td>180.8</td>\n <td>10.8</td>\n <td>58.4</td>\n <td>12.9</td>\n </tr>\n </tbody>\n</table>'
If you want to export a Pandas dataframe to the clipboard, so you can paste it into a document, you can use the
to_clipboard() function. The
to_clipboard() function works out of the box on Windows and Mac OS, but you’ll need to install either xclip or xsel with the PyQt4 modules to run this on Linux. Once you’ve run the code the dataframe will be stored in your clipboard and you can paste it into another document, such as a Google Sheet, using Ctrl + v.
If you’re running Pandas version 1.3.0 or greater, you can now also export a Pandas dataframe to XML using the
to_xml() function. As with the other export functions, the default argument accepts a filename or filepath in which to write the data, i.e.
/data/file.xml. It also includes a load of additional arguments that can be used to modify the XML exported.
to_parquet() function allows you to export a Pandas dataframe to a Parquet file. In case you haven’t heard of it, Parquet is an open source column-oriented data file format designed for efficient data storage and retrieval.
The Parquet file format is supported by a range of enterprise data platforms, such as Hadoop, Pig, Spark, and Hive. Parquet files have the
.parquet suffix. By default you just need to provide a filepath or filename, i.e.
data.parquet. However, you can also define the Parquet engine used with
engine, control the index using
index, define partition columns with
partition_cols, and define storage options with
to_stata() function allows you to export a Pandas dataframe to a Stata file. Stata files have a
.dta file suffix and are a proprietary binary format used in the IBM Stata statistics application.
to_feather() function lets you export a Pandas dataframe to a Feather file. Feather files are used by Apache Arrow for storing Arrow tables and use either a
.fea file suffix. There are fewer additional arguments for
to_feather(), you just need to define the file path.
If you’re one of those odd people who still uses LaTex for typesetting, you can export the contents of your dataframe to LaTex using the
to_latex() function. As with the other Pandas export functions, you only need to define a filepath or filename by default, but there are loads of other arguments you can use with which to control the output of the LaTex generated. LaTex files have a
to_pickle() function lets you export a Pandas dataframe to a Pickle file. Pickle is a Python module that uses object serialization to store data in a structured form. Data is serialized (or pickled) for storage, then retrieved and deserialized (or unpickled) during retrieval. Many people prefer it over saving to CSV. Pickle files usually have a
.pkl suffix, but
.db are also used sometimes.
Matt Clarke, Thursday, August 18, 2022