Pandas allows you to import data from a wide range of data sources directly into a dataframe. These can be static files, such as CSV, TSV, fixed width files, Microsoft Excel, JSON, SAS and SPSS files, as well as a range of popular databases, such as MySQL, PostgreSQL and Google BigQuery. You can even scrape data directly from web pages into Pandas dataframes.
When importing data into Pandas dataframes, you can also save time and write less code by defining which columns to import, rename the columns, set their data types, define the index, and many other things. Here are some handy tips to help you.
To get started, create a new Jupyter notebook and load the Pandas library. You’ll also need the Datetime package. When importing the Pandas package the convention is to use the command import pandas as pd
which allows you to call Pandas functions by prefixing them with pd.
instead of pandas.
.
import pandas as pd
import datetime as dt
Comma Separated Value or CSV files are likely to be the file format you encounter most commonly in data science. As the name suggests, these are simple text files in which the values are separated (usually) by commas. The data looks like this:
order,sku,quantity,price
1234,ABC1,10,9.99
1235,ABC2,3,19.99
1236,ABC3,2,29.99
To import a CSV file and put the contents into a Pandas dataframe we use the read_csv()
function, which is appended after calling the pd
object we created when we imported Pandas. The read_csv()
function can take several arguments, but by default you just need to provide the path to the file you wish to read. Here, data.csv
will read in the file called data.csv
which is present in the same directory. If the file was stored at /home/matt/data/data.csv
you would just replace the path.
df = pd.read_csv('data.csv')
df.head()
order | sku | quantity | price | |
---|---|---|---|---|
0 | 1234 | ABC1 | 10 | 9.99 |
1 | 1235 | ABC2 | 3 | 19.99 |
2 | 1236 | ABC3 | 2 | 29.99 |
You can also use read_csv()
to read remote CSV files. Instead of passing in the path to the file you provide the full URL to the CSV file. Here, I’m loading a CSV file from my GitHub account.
df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/housing.csv')
df.head()
longitude | latitude | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
Tab separated value files are just like CSVs, but the values are separated by a tab instead of a comma. They can also be read using the same read_csv()
function, you just need to specify the separator character used. For tabs, this is \t
.
order sku quantity price
3843 BCG1 3 9.99
3738 HGX2 1 73.92
3827 HBX3 7 19.87
df = pd.read_csv('data.tsv', sep='\t')
df.head()
order | sku | quantity | price | |
---|---|---|---|---|
0 | 3843 | BCG1 | 3 | 9.99 |
1 | 3738 | HGX2 | 1 | 73.92 |
2 | 3827 | HBX3 | 7 | 19.87 |
Pandas includes built-in functionality for reading Microsoft Excel spreadsheet files via its read_excel()
function. This works in the same way as read_csv()
so can be used on local Excel documents as well as remote files, however, as Excel files are a bit more bloated than CSV files, it can be a bit slower.
df = pd.read_excel('data.xlsx')
df.head()
order | sku | quantity | price | |
---|---|---|---|---|
0 | 1234 | ABC1 | 10 | 9.99 |
1 | 1235 | ABC2 | 3 | 19.99 |
2 | 1236 | ABC3 | 2 | 29.99 |
To avoid the speed issues when handling Excel files in Pandas, I think it’s beneficial to read the file into Pandas as an Excel document and then rewrite it back out to a CSV file to change the document format.
To save the Excel file to a CSV you can use the to_csv()
function. Adding index=False
ensures that a redundant column called Unnamed: 0
isn’t added. The new CSV file can then be read back in.
df.to_csv('excel.csv', index=False)
df = pd.read_csv('excel.csv')
df.head()
order | sku | quantity | price | |
---|---|---|---|---|
0 | 3843 | BCG1 | 3 | 9.99 |
1 | 3738 | HGX2 | 1 | 73.92 |
2 | 3827 | HBX3 | 7 | 19.87 |
One other handy feature of Pandas is the read_html()
function. This allows you to parse HTML markup from remote web pages or local HTML documents and extract any tables present. In the example below I’ve extracted an HTML table from a Wikipedia page.
The read_html()
function returns any tables it finds in a list, so if more than one is present, you’ll need to define which one to display in your dataframe using its list index, which starts from zero.
data = pd.read_html('https://en.wikipedia.org/wiki/Epyc')
data[0]
0 | 1 | |
---|---|---|
0 | NaN | NaN |
1 | General information | General information |
2 | Launched | June 2017 |
3 | Marketed by | AMD |
4 | Designed by | AMD |
5 | Common manufacturer(s) | GlobalFoundries (14 nm)TSMC (7 nm) |
6 | Performance | Performance |
7 | Max. CPU clock rate | 2.7 GHz to 3.9 GHz |
8 | Architecture and classification | Architecture and classification |
9 | Min. feature size | 14 nm to 7 nm |
10 | Microarchitecture | ZenZen 2 |
11 | Instruction set | x86-64MMX(+), SSE1, SSE2, SSE3, SSSE3, SSE4a, ... |
12 | Physical specifications | Physical specifications |
13 | Cores | up to 128 cores/256 threads on dual-socket sys... |
14 | Socket(s) | SP3 |
15 | Products, models, variants | Products, models, variants |
16 | Core name(s) | NaplesRome |
17 | Brand name(s) | Epyc |
18 | History | History |
19 | Predecessor | Opteron |
Yaml (or yml) files are used in a variety of systems, including the Jekyll static website builder I used to create this site. To read a Yaml file into Pandas you can use the yaml.load()
function along with the json_normalize()
function.
import pandas as pd
import yaml
with open('courses.yml', 'r') as f:
df = pd.json_normalize(yaml.load(f, Loader=yaml.FullLoader))
df.head()
Pandas allows you to read in data from a wide range of other file types and data sources. These are some of the other sources from which you can import data into dataframes. Some of these, such as SQL and JSON, are regularly used but much more complex so I’ll cover these separately.
Function | Purpose |
---|---|
read_csv() |
Reads local or remote CSV (comma separated value) files. |
read_csv() |
Reads local or remote CSV (comma separated value) files. |
read_excel() |
Reads local or remote Microsoft Excel spreadsheet files. |
read_clipboard() |
Reads the local clipboard. |
read_html() |
Reads local or remote HTML files or web pages. |
read_fwf() |
Reads local or remote fixed width text files. |
read_excel() |
Reads OpenDocument format spreadsheets. |
read_hdf() |
Reads HDFStore HDF5 PyTable files. |
read_feather() |
Reads Apache Arrow Feather format files. |
read_parquet() |
Reads Apache Parquet files from Hadoop. |
read_orc() |
Reads Optimized Row Column (ORC) format files from Hive. |
read_msgpack() |
Reads MessagePack format files. |
read_stata() |
Reads files from the Stata statistics software package. |
read_sas() |
Reads files from the SAS statistics software package. |
read_spss() |
Reads files from the SPSS statistics software package. |
read_pickle() |
Reads files from the Python Pickle format. |
read_sql() |
Reads files in a variety of SQL dialects via SQLAlchemy. |
read_gbq() |
Reads data from Google Big Query. |
If your data set includes hundreds of columns and you only need a specific subset of them, you can use the usecols
argument to define the list of column names to import. This saves the hassle of importing all of the columns and then dropping the ones you don’t need.
df = pd.read_csv('data.csv', usecols=['order','sku'])
df.head()
order | sku | |
---|---|---|
0 | 1234 | ABC1 |
1 | 1235 | ABC2 |
2 | 1236 | ABC3 |
Another common problem with importing third party data into Pandas is the column header names. While the Pandas rename()
function lets you define new names for each column after you’ve imported the data, the quickest and neatest way to rename columns is to define the new names as you’re importing the data.
In the example below the CSV data includes the column names Order
, Sku
, Quantity
, Price
and Date
. To rename the columns, we simply use read_csv()
to load the file and then pass in a list of the new names to the names
argument, and use skiprows
to ignore the first row of the file which contains the old column names.
Order,Sku,Quantity,Price,Date
1234,ABC1,10.0,9.99,2020-11-01
1235,ABC2,3.0,19.99,2020-11-02
1236,ABC3,2.0,29.99,2020-11-03
df = pd.read_csv('data.csv', names=['order_id','code', 'quantity', 'price'], skiprows=1)
df.head()
order_id | code | quantity | price | |
---|---|---|---|---|
0 | 1234 | ABC1 | 10 | 9.99 |
1 | 1235 | ABC2 | 3 | 19.99 |
2 | 1236 | ABC3 | 2 | 29.99 |
Another really common issue when dealing with data in Pandas is that the data you’re importing isn’t being identified with the correct data type for each column. For example, let’s load up the below orders.csv
file and use df.dtypes
to examine the data types.
df = pd.read_csv('orders.csv')
df.head()
order | sku | quantity | price | order_date | |
---|---|---|---|---|---|
0 | 1234 | ABC1 | 10.0 | 9.99 | 2020-11-01 |
1 | 1235 | ABC2 | 3.0 | 19.99 | 2020-11-02 |
2 | 1236 | ABC3 | 2.0 | 29.99 | 2020-11-03 |
df.dtypes
order int64
sku object
quantity float64
price float64
order_date object
dtype: object
As you can see from the output of df.dtypes
, the quantity
field is being identified as a float64
when it should be an int64
as you can’t purchase part of a product. Rather than using df['quantity'] = df['quantity'].astype(int)
after you’ve imported the data to change the data type, you can pass a dictionary of data type reassignments to the dtype
argument of read_csv()
.
df = pd.read_csv('orders.csv', dtype={"quantity": int})
df.head()
order | sku | quantity | price | order_date | |
---|---|---|---|---|---|
0 | 1234 | ABC1 | 10 | 9.99 | 2020-11-01 |
1 | 1235 | ABC2 | 3 | 19.99 | 2020-11-02 |
2 | 1236 | ABC3 | 2 | 29.99 | 2020-11-03 |
df.dtypes
order int64
sku object
quantity int64
price float64
order_date object
dtype: object
The other issue with the data set above is that the order_date
column isn’t being parsed as a date. Instead, Pandas thinks it’s an object. You can tell Pandas to parse specific columns as dates by passing a list of the affected date columns to the parse_dates
argument. Here, we’ll define order_date
as the date column and then run df.dtypes
to confirm Pandas recognised it as a datetime64[ns]
.
df = pd.read_csv('orders.csv', parse_dates=['order_date'])
df.head()
order | sku | quantity | price | order_date | |
---|---|---|---|---|---|
0 | 1234 | ABC1 | 10.0 | 9.99 | 2020-11-01 |
1 | 1235 | ABC2 | 3.0 | 19.99 | 2020-11-02 |
2 | 1236 | ABC3 | 2.0 | 29.99 | 2020-11-03 |
df.dtypes
order int64
sku object
quantity float64
price float64
order_date datetime64[ns]
dtype: object
Sometimes you may want to have one of your columns, such as the order ID, set as the index on your dataframe. Again, it’s easy enough to do this after you’ve read the data, but it’s much neater and quicker to do it during import. You can do this by passing a list of index columns to the index_col
argument.
df = pd.read_csv('data.csv', index_col=['order'])
df.head()
sku | quantity | price | |
---|---|---|---|
order | |||
1234 | ABC1 | 10 | 9.99 |
1235 | ABC2 | 3 | 19.99 |
1236 | ABC3 | 2 | 29.99 |
When you export data from some platforms thousands, millions, and billions, are separated by commas and the value gets encapsulated in double quote marks to avoid the commas messing with the structure of the CSV file. Since Pandas may not interpret those values as floats, you’ll either need to strip the commas out of the numbers using replace()
and then re-cast them to float64
or, simply tell Pandas the thousands separator is used via the thousands=','
argument.
order,sku,quantity,price,order_date
1234,ABC1,10.0,"1,999,999.99",2020-11-01
1235,ABC2,3.0,"11,999.99",2020-11-02
1236,ABC3,2.0,"2,999.99",2020-11-03
df = pd.read_csv('thousands.csv', thousands=',')
df.head()
order | sku | quantity | price | order_date | |
---|---|---|---|---|---|
0 | 1234 | ABC1 | 10.0 | 1999999.99 | 2020-11-01 |
1 | 1235 | ABC2 | 3.0 | 11999.99 | 2020-11-02 |
2 | 1236 | ABC3 | 2.0 | 2999.99 | 2020-11-03 |
If you’re dealing with massive datasets you may not always want to load the entire file. To restrict the number of rows that are read in you can pass an integer representing the number of rows to the nrows
argument of read_csv()
.
df = pd.read_csv('thousands.csv', nrows=1)
df.head()
order | sku | quantity | price | order_date | |
---|---|---|---|---|---|
0 | 1234 | ABC1 | 10.0 | 1,999,999.99 | 2020-11-01 |
Depending on the data source, missing values in a data set can be shown in a variety of ways. By default, Pandas recognises the presence of certain common missing value identifiers and replaces them with NaN
.
These values are: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', '<NA>', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan',
and 'null'
. If missing values in your data set take some other form, you can specifically tell Pandas to interpret them as NaN
values. In the example below, missing values are represented by ###.
df = pd.read_csv('missing.csv')
df.head()
order | sku | quantity | price | order_date | |
---|---|---|---|---|---|
0 | 1234 | ABC1 | 10.0 | ### | 2020-11-01 |
1 | 1235 | ABC2 | 3.0 | 19.99 | 2020-11-02 |
2 | 1236 | ABC3 | 2.0 | 29.99 | ### |
To tell Pandas to interpret ### as NaN
you simply pass in the ### value to the na_values
argument. If you reload the dataframe you’ll see that the ### has now been recognised as a NaN
value.
df = pd.read_csv('missing.csv', na_values='###')
df.head()
order | sku | quantity | price | order_date | |
---|---|---|---|---|---|
0 | 1234 | ABC1 | 10.0 | NaN | 2020-11-01 |
1 | 1235 | ABC2 | 3.0 | 19.99 | 2020-11-02 |
2 | 1236 | ABC3 | 2.0 | 29.99 | NaN |
Sometimes when importing data into Pandas things do not go to plan and Pandas will throw an error. There are two main reasons for this - at least in the files I regularly deal with. Firstly, the file encoding may not be set to utf-8, which causes Pandas to throw an error stating UnicodeDecodeError: 'utf-8' codec can't decode byte
.
These are quite common when importing data from Google Analytics. You can usually resolve this by specifying the file encoding i.e. encoding='utf-16'
and the problem should be resolved.
df = pd.read_csv('sessions.csv', encoding='utf-16', sep='\t', )
The other common Pandas error I encounter is ParserError: Error tokenizing data. C error
. This can occur for a few reasons. Sometimes it’s because there are “bad lines” in the data which are incorrectly formatted, so you can use error_bad_lines=False
to ignore the lines and hide the error. Using the wrong separator value can also cause this. You can fix it by either defining the right separator in the sep
argument or by ignoring the bad lines.
df = pd.read_csv('data.csv', error_bad_lines=False)
Matt Clarke, Saturday, March 06, 2021