The massive versatility of Pandas means that you can create dataframes from almost any type of raw data. Whether you have a list, a list of lists, a dictionary, a dictionary of lists, a list of dictionaries, some tuples, a NumPy array, or something else, you can turn your data into a Pandas dataframe. Here’s how it’s done.
There are numerous ways to create a Pandas dataframe from scratch. The most commonly used is to create a dictionary containing a list of values for each column (or series) you want to add, then pass the dictionary and a list of corresponding column names to the columns
argument of pd.DataFrame()
.
import pandas as pd
data = {'Model': ['Jaguar XE', 'Jaguar XF', 'Jaguar XJ'],
'Price from': [29635, 32585, 56020]}
df = pd.DataFrame(data, columns = ['Model', 'Price from'])
df
Model | Price from | |
---|---|---|
0 | Jaguar XE | 29635 |
1 | Jaguar XF | 32585 |
2 | Jaguar XJ | 56020 |
If you have a single list you can pass it directly to pd.DataFrame()
, along with a list containing the column name, and Pandas will turn it into a dataframe with a single column.
import pandas as pd
models = ['Jaguar XE', 'Jaguar XF', 'Jaguar XJ', 'Jaguar F-Type', 'Jaguar XK']
df = pd.DataFrame(models, columns=['Models'])
df
Models | |
---|---|
0 | Jaguar XE |
1 | Jaguar XF |
2 | Jaguar XJ |
3 | Jaguar F-Type |
4 | Jaguar XK |
If you have two or more lists, you can use the list(zip())
technique to pass them into Pandas, and then define their column names in a list passed to the columns
argument.
import pandas as pd
models = ['Jaguar XE', 'Jaguar XF', 'Jaguar XJ', 'Jaguar F-Type', 'Jaguar XK']
prices = [29635, 32585, 56020, 67300, 75392]
df = pd.DataFrame(list(zip(models, prices)), columns=['Models', 'Prices'])
df
Models | Prices | |
---|---|---|
0 | Jaguar XE | 29635 |
1 | Jaguar XF | 32585 |
2 | Jaguar XJ | 56020 |
3 | Jaguar F-Type | 67300 |
4 | Jaguar XK | 75392 |
If you have a multidimensional list that contains a series of data points, such as the model and torque figure in our below example, you can pass this as the first argument to pd.DataFrame()
and define the columns
list as the second argument.
import pandas as pd
cars = [['BMW 435d', '465 ft lb'],
['BMW M3', '406 ft lb'],
['BMW M4', '406 ft lb'],
['BMW M5', '553 ft lb']]
df = pd.DataFrame(cars, columns=['Model', 'Torque'])
df
Model | Torque | |
---|---|---|
0 | BMW 435d | 465 ft lb |
1 | BMW M3 | 406 ft lb |
2 | BMW M4 | 406 ft lb |
3 | BMW M5 | 553 ft lb |
The other quick way to create a Pandas dataframe from a dictionary is to use the from_dict()
function. If you use this approach, Pandas will take the key of the dictionary (i.e. Model
and Wheelbase
in this example) and assign them to the column names.
import pandas as pd
data = {'Model': ['Defender 90', 'Defender 110', 'Defender 130'],
'Wheelbase': ['90 inches', '110 inches', '130 inches']
}
df = pd.DataFrame.from_dict(data)
df
Model | Wheelbase | |
---|---|---|
0 | Defender 90 | 90 inches |
1 | Defender 110 | 110 inches |
2 | Defender 130 | 130 inches |
If you have a list containing one or more dictionaries with the same format, you can pass the list to the from_records()
function. Like from_dict()
, this is quite a time saver, because it automatically takes the dictionary keys and uses them to assign to the column headers.
import pandas as pd
data = [{'Species': 'Esox lucius', 'Weight': 4272},
{'Species': 'Perca fluviatilis', 'Weight': 1022},
{'Species': 'Salmo trutta', 'Weight': 3832}]
df = pd.DataFrame.from_records(data)
df
Species | Weight | |
---|---|---|
0 | Esox lucius | 4272 |
1 | Perca fluviatilis | 1022 |
2 | Salmo trutta | 3832 |
If your data is currently present within a number of lists, you can create a dictionary and pass the dict
to pd.DataFrame
. The key names assigned to the dictionary will be used to set the Pandas column names.
import pandas as pd
species = ["Salmo trutta", "Thymallus thymallus", "Phoxinus phoxinus"]
length = [91, 35, 6]
dict = {'Species': species, 'Length': length}
df = pd.DataFrame(dict)
df
Species | Length | |
---|---|---|
0 | Salmo trutta | 91 |
1 | Thymallus thymallus | 35 |
2 | Phoxinus phoxinus | 6 |
Tuples are slightly less common than dictionaries and lists. However, the approach to building a dataframe from tuples is just the same. You simply pass the list of tuples to the first argument of from_records()
and pass a list of column names to the columns
argument.
import pandas as pd
data = [('Esox lucius', 'Llyn Brenig'),
('Salmo trutta', 'River Dee'),
('Phoxinus phoxinus', 'River Ceiriog')]
df = pd.DataFrame.from_records(data, columns=['Species', 'Location'])
df
Species | Location | |
---|---|---|
0 | Esox lucius | Llyn Brenig |
1 | Salmo trutta | River Dee |
2 | Phoxinus phoxinus | River Ceiriog |
Perhaps the most common way to create a dataframe in Pandas is to create it by importing data from another source, such as a CSV file or Excel spreadsheet. Here’s a really simple example, but for more details on this technique please check out my other guide to importing data in Pandas.
import pandas as pd
df = pd.read_csv('../sitemap.csv')
df.head()
url | |
---|---|
0 | http://flyandlure.org/ |
1 | http://flyandlure.org/about |
2 | http://flyandlure.org/terms |
3 | http://flyandlure.org/privacy |
4 | http://flyandlure.org/copyright |
NumPy arrays are slightly more challenging to import. However, they can also be handled by the from_records()
function.
import pandas as pd
import numpy as np
data = np.array([(43, 'a'), (35, 'b'), (27, 'c'), (13, 'd')],
dtype=[('Score', 'i4'), ('Segment', 'U1')])
df = pd.DataFrame.from_records(data)
df
Score | Segment | |
---|---|---|
0 | 43 | a |
1 | 35 | b |
2 | 27 | c |
3 | 13 | d |
Matt Clarke, Tuesday, March 02, 2021