How to import data into Pandas dataframes

Picture by Tyler Franta, Unsplash.

25 minutes to read

Data Science Pandas

Pandas allows you to import data from a wide range of data sources directly into a dataframe. These can be static files, such as CSV, TSV, fixed width files, Microsoft Excel, JSON, SAS and SPSS files, as well as a range of popular databases, such as MySQL, PostgreSQL and Google BigQuery. You can even scrape data directly from web pages into Pandas dataframes.

When importing data into Pandas dataframes, you can also save time and write less code by defining which columns to import, rename the columns, set their data types, define the index, and many other things. Here are some handy tips to help you.

Load the packages

To get started, create a new Jupyter notebook and load the Pandas library. You’ll also need the Datetime package. When importing the Pandas package the convention is to use the command import pandas as pd which allows you to call Pandas functions by prefixing them with pd. instead of pandas..

import pandas as pd
import datetime as dt

Reading CSV with Pandas

Comma Separated Value or CSV files are likely to be the file format you encounter most commonly in data science. As the name suggests, these are simple text files in which the values are separated (usually) by commas. The data looks like this:

order,sku,quantity,price
1234,ABC1,10,9.99
1235,ABC2,3,19.99
1236,ABC3,2,29.99

Reading a local CSV file

To import a CSV file and put the contents into a Pandas dataframe we use the read_csv() function, which is appended after calling the pd object we created when we imported Pandas. The read_csv() function can take several arguments, but by default you just need to provide the path to the file you wish to read. Here, data.csv will read in the file called data.csv which is present in the same directory. If the file was stored at /home/matt/data/data.csv you would just replace the path.

df = pd.read_csv('data.csv')
df.head()

	order	sku	quantity	price
0	1234	ABC1	10	9.99
1	1235	ABC2	3	19.99
2	1236	ABC3	2	29.99

Reading a remote CSV file

You can also use read_csv() to read remote CSV files. Instead of passing in the path to the file you provide the full URL to the CSV file. Here, I’m loading a CSV file from my GitHub account.

df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/housing.csv')
df.head()

	longitude	latitude	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41.0	126.0	8.3252	452600.0	NEAR BAY
1	-122.22	37.86	21.0	1138.0	8.3014	358500.0	NEAR BAY
2	-122.24	37.85	496.0	177.0	7.2574	352100.0	NEAR BAY
3	-122.25	37.85	558.0	219.0	5.6431	341300.0	NEAR BAY
4	-122.25	37.85	565.0	259.0	3.8462	342200.0	NEAR BAY

Reading a local TSV file

Tab separated value files are just like CSVs, but the values are separated by a tab instead of a comma. They can also be read using the same read_csv() function, you just need to specify the separator character used. For tabs, this is \t.

order	sku	quantity	price
BCG1	3	9.99
HGX2	1	73.92
HBX3	7	19.87

df = pd.read_csv('data.tsv', sep='\t')
df.head()

	order	sku	quantity	price
0	3843	BCG1	3	9.99
1	3738	HGX2	1	73.92
2	3827	HBX3	7	19.87

Reading Microsoft Excel files with Pandas

Pandas includes built-in functionality for reading Microsoft Excel spreadsheet files via its read_excel() function. This works in the same way as read_csv() so can be used on local Excel documents as well as remote files, however, as Excel files are a bit more bloated than CSV files, it can be a bit slower.

df = pd.read_excel('data.xlsx')
df.head()

	order	sku	quantity	price
0	1234	ABC1	10	9.99
1	1235	ABC2	3	19.99
2	1236	ABC3	2	29.99

To avoid the speed issues when handling Excel files in Pandas, I think it’s beneficial to read the file into Pandas as an Excel document and then rewrite it back out to a CSV file to change the document format.

To save the Excel file to a CSV you can use the to_csv() function. Adding index=False ensures that a redundant column called Unnamed: 0 isn’t added. The new CSV file can then be read back in.

df.to_csv('excel.csv', index=False)
df = pd.read_csv('excel.csv')
df.head()

	order	sku	quantity	price
0	3843	BCG1	3	9.99
1	3738	HGX2	1	73.92
2	3827	HBX3	7	19.87

Reading HTML with Pandas

One other handy feature of Pandas is the read_html() function. This allows you to parse HTML markup from remote web pages or local HTML documents and extract any tables present. In the example below I’ve extracted an HTML table from a Wikipedia page.

The read_html() function returns any tables it finds in a list, so if more than one is present, you’ll need to define which one to display in your dataframe using its list index, which starts from zero.

data = pd.read_html('https://en.wikipedia.org/wiki/Epyc')
data[0]

	0	1
0	NaN	NaN
1	General information	General information
2	Launched	June 2017
3	Marketed by	AMD
4	Designed by	AMD
5	Common manufacturer(s)	GlobalFoundries (14 nm)TSMC (7 nm)
6	Performance	Performance
7	Max. CPU clock rate	2.7 GHz to 3.9 GHz
8	Architecture and classification	Architecture and classification
9	Min. feature size	14 nm to 7 nm
10	Microarchitecture	ZenZen 2
11	Instruction set	x86-64MMX(+), SSE1, SSE2, SSE3, SSSE3, SSE4a, ...
12	Physical specifications	Physical specifications
13	Cores	up to 128 cores/256 threads on dual-socket sys...
14	Socket(s)	SP3
15	Products, models, variants	Products, models, variants
16	Core name(s)	NaplesRome
17	Brand name(s)	Epyc
18	History	History
19	Predecessor	Opteron

Importing Yaml files into Pandas

Yaml (or yml) files are used in a variety of systems, including the Jekyll static website builder I used to create this site. To read a Yaml file into Pandas you can use the yaml.load() function along with the json_normalize() function.

import pandas as pd
import yaml

with open('courses.yml', 'r') as f:
    df = pd.json_normalize(yaml.load(f, Loader=yaml.FullLoader))
    
df.head()

Reading other file types with Pandas

Pandas allows you to read in data from a wide range of other file types and data sources. These are some of the other sources from which you can import data into dataframes. Some of these, such as SQL and JSON, are regularly used but much more complex so I’ll cover these separately.

Function	Purpose
`read_csv()`	Reads local or remote CSV (comma separated value) files.
`read_csv()`	Reads local or remote CSV (comma separated value) files.
`read_excel()`	Reads local or remote Microsoft Excel spreadsheet files.
`read_clipboard()`	Reads the local clipboard.
`read_html()`	Reads local or remote HTML files or web pages.
`read_fwf()`	Reads local or remote fixed width text files.
`read_excel()`	Reads OpenDocument format spreadsheets.
`read_hdf()`	Reads HDFStore HDF5 PyTable files.
`read_feather()`	Reads Apache Arrow Feather format files.
`read_parquet()`	Reads Apache Parquet files from Hadoop.
`read_orc()`	Reads Optimized Row Column (ORC) format files from Hive.
`read_msgpack()`	Reads MessagePack format files.
`read_stata()`	Reads files from the Stata statistics software package.
`read_sas()`	Reads files from the SAS statistics software package.
`read_spss()`	Reads files from the SPSS statistics software package.
`read_pickle()`	Reads files from the Python Pickle format.
`read_sql()`	Reads files in a variety of SQL dialects via SQLAlchemy.
`read_gbq()`	Reads data from Google Big Query.

Defining which fields to import

If your data set includes hundreds of columns and you only need a specific subset of them, you can use the usecols argument to define the list of column names to import. This saves the hassle of importing all of the columns and then dropping the ones you don’t need.

df = pd.read_csv('data.csv', usecols=['order','sku'])
df.head()

	order	sku
0	1234	ABC1
1	1235	ABC2
2	1236	ABC3

Renaming column headers on import

Another common problem with importing third party data into Pandas is the column header names. While the Pandas rename() function lets you define new names for each column after you’ve imported the data, the quickest and neatest way to rename columns is to define the new names as you’re importing the data.

In the example below the CSV data includes the column names Order, Sku, Quantity, Price and Date. To rename the columns, we simply use read_csv() to load the file and then pass in a list of the new names to the names argument, and use skiprows to ignore the first row of the file which contains the old column names.

Order,Sku,Quantity,Price,Date
1234,ABC1,10.0,9.99,2020-11-01
1235,ABC2,3.0,19.99,2020-11-02
1236,ABC3,2.0,29.99,2020-11-03

df = pd.read_csv('data.csv', names=['order_id','code', 'quantity', 'price'], skiprows=1)
df.head()

	order_id	code	quantity	price
0	1234	ABC1	10	9.99
1	1235	ABC2	3	19.99
2	1236	ABC3	2	29.99

Correcting column data types on import

Another really common issue when dealing with data in Pandas is that the data you’re importing isn’t being identified with the correct data type for each column. For example, let’s load up the below orders.csv file and use df.dtypes to examine the data types.

df = pd.read_csv('orders.csv')
df.head()

	order	sku	quantity	price	order_date
0	1234	ABC1	10.0	9.99	2020-11-01
1	1235	ABC2	3.0	19.99	2020-11-02
2	1236	ABC3	2.0	29.99	2020-11-03

df.dtypes

order           int64
sku            object
quantity      float64
price         float64
order_date     object
dtype: object

As you can see from the output of df.dtypes, the quantity field is being identified as a float64 when it should be an int64 as you can’t purchase part of a product. Rather than using df['quantity'] = df['quantity'].astype(int) after you’ve imported the data to change the data type, you can pass a dictionary of data type reassignments to the dtype argument of read_csv().

df = pd.read_csv('orders.csv', dtype={"quantity": int})
df.head()

	order	sku	quantity	price	order_date
0	1234	ABC1	10	9.99	2020-11-01
1	1235	ABC2	3	19.99	2020-11-02
2	1236	ABC3	2	29.99	2020-11-03

df.dtypes

order           int64
sku            object
quantity        int64
price         float64
order_date     object
dtype: object

Parsing a column as a date

The other issue with the data set above is that the order_date column isn’t being parsed as a date. Instead, Pandas thinks it’s an object. You can tell Pandas to parse specific columns as dates by passing a list of the affected date columns to the parse_dates argument. Here, we’ll define order_date as the date column and then run df.dtypes to confirm Pandas recognised it as a datetime64[ns].

df = pd.read_csv('orders.csv', parse_dates=['order_date'])
df.head()

	order	sku	quantity	price	order_date
0	1234	ABC1	10.0	9.99	2020-11-01
1	1235	ABC2	3.0	19.99	2020-11-02
2	1236	ABC3	2.0	29.99	2020-11-03

df.dtypes

order                  int64
sku                   object
quantity             float64
price                float64
order_date    datetime64[ns]
dtype: object

Setting the index column

Sometimes you may want to have one of your columns, such as the order ID, set as the index on your dataframe. Again, it’s easy enough to do this after you’ve read the data, but it’s much neater and quicker to do it during import. You can do this by passing a list of index columns to the index_col argument.

df = pd.read_csv('data.csv', index_col=['order'])
df.head()

	sku	quantity	price
order
1234	ABC1	10	9.99
1235	ABC2	3	19.99
1236	ABC3	2	29.99

Handling thousand separator commas in numbers

When you export data from some platforms thousands, millions, and billions, are separated by commas and the value gets encapsulated in double quote marks to avoid the commas messing with the structure of the CSV file. Since Pandas may not interpret those values as floats, you’ll either need to strip the commas out of the numbers using replace() and then re-cast them to float64 or, simply tell Pandas the thousands separator is used via the thousands=',' argument.

order,sku,quantity,price,order_date
1234,ABC1,10.0,"1,999,999.99",2020-11-01
1235,ABC2,3.0,"11,999.99",2020-11-02
1236,ABC3,2.0,"2,999.99",2020-11-03

df = pd.read_csv('thousands.csv', thousands=',')
df.head()

	order	sku	quantity	price	order_date
0	1234	ABC1	10.0	1999999.99	2020-11-01
1	1235	ABC2	3.0	11999.99	2020-11-02
2	1236	ABC3	2.0	2999.99	2020-11-03

Specifying the number of rows to read

If you’re dealing with massive datasets you may not always want to load the entire file. To restrict the number of rows that are read in you can pass an integer representing the number of rows to the nrows argument of read_csv().

df = pd.read_csv('thousands.csv', nrows=1)
df.head()

	order	sku	quantity	price	order_date
0	1234	ABC1	10.0	1,999,999.99	2020-11-01

Defining specific missing values to recognise

Depending on the data source, missing values in a data set can be shown in a variety of ways. By default, Pandas recognises the presence of certain common missing value identifiers and replaces them with NaN.

These values are: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', '<NA>', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan', and 'null'. If missing values in your data set take some other form, you can specifically tell Pandas to interpret them as NaN values. In the example below, missing values are represented by ###.

df = pd.read_csv('missing.csv')
df.head()

	order	sku	quantity	price	order_date
0	1234	ABC1	10.0	###	2020-11-01
1	1235	ABC2	3.0	19.99	2020-11-02
2	1236	ABC3	2.0	29.99	###

To tell Pandas to interpret ### as NaN you simply pass in the ### value to the na_values argument. If you reload the dataframe you’ll see that the ### has now been recognised as a NaN value.

df = pd.read_csv('missing.csv', na_values='###')
df.head()

	order	sku	quantity	price	order_date
0	1234	ABC1	10.0	NaN	2020-11-01
1	1235	ABC2	3.0	19.99	2020-11-02
2	1236	ABC3	2.0	29.99	NaN

Fixing UnicodeDecodeError: ‘utf-8’ codec can’t decode byte

Sometimes when importing data into Pandas things do not go to plan and Pandas will throw an error. There are two main reasons for this - at least in the files I regularly deal with. Firstly, the file encoding may not be set to utf-8, which causes Pandas to throw an error stating UnicodeDecodeError: 'utf-8' codec can't decode byte.

These are quite common when importing data from Google Analytics. You can usually resolve this by specifying the file encoding i.e. encoding='utf-16' and the problem should be resolved.

df = pd.read_csv('sessions.csv', encoding='utf-16', sep='\t', )

Fixing ParserError: Error tokenizing data. C error

The other common Pandas error I encounter is ParserError: Error tokenizing data. C error. This can occur for a few reasons. Sometimes it’s because there are “bad lines” in the data which are incorrectly formatted, so you can use error_bad_lines=False to ignore the lines and hide the error. Using the wrong separator value can also cause this. You can fix it by either defining the right separator in the sep argument or by ignoring the bad lines.

df = pd.read_csv('data.csv', error_bad_lines=False)

Matt Clarke, Saturday, March 06, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.