When working with Pandas dataframes you’ll often need to drop, remove, or delete columns or rows from a dataframe to leave you with a clean dataframe containing tidier data you can use in your reports, visualisations, analyses, or models. As you’d expect, Pandas makes it very easy to drop data from dataframes and in this tutorial we’ll work through some simple practical examples to see how it’s done.
There are four main ways to drop columns or rows from a Pandas dataframe: the drop()
function is used to drop columns and rows by condition; the dropna()
function is used to drop rows or columns that contain NaN
values or missing values; the drop_duplicates()
function is used to drop rows or columns containing non-unique values when de-duping data; and the pop()
method is used to drop a column and return the dropped series.
These versatile functions let you delete, remove, or drop selected columns or rows, whatever your needs, and they’re all very easy to learn to use.
Method | Use |
---|---|
drop() |
The drop() function is used to drop columns or rows from a Pandas dataframe. Primarily, it's used to drop rows by column name or index, but can also be used to drop values that match certain criteria, such as rows that do or do not contain a certain value, or that are of a certain dtype. |
dropna() |
The dropna() function is used to drop rows or columns that contain missing values. It's commonly used in machine learning to remove NaN values prior to training a model. |
drop_duplicates() |
The drop_duplicates() function is used to drop rows that contain duplicate values. It's commonly used for de-duping or de-duplicating data. |
pop() |
The pop() function is used to drop a single column from a Pandas dataframe and can also be used to return the dropped series as a variable. |
In this tutorial, we’ll go over the commonly used drop()
function. You can find out how to use drop_duplicates()
in my other post on de-duping data with Pandas.
First, we’ll look at the drop()
method. This is the main Pandas function you’ll use for dropping columns and rows, either by name, index, or by condition. It’s really powerful.
The Pandas drop()
method is used by appending the method to the name of a dataframe, i.e. df.drop()
and can be used to delete or remove rows, columns or series from the original dataframe. There are some required arguments you must pass to the method. If you run df.drop()
without these you’ll get back an error that says ValueError: Need to specify at least one of 'labels', 'index' or 'columns'
.
Parameter | Description |
---|---|
labels |
The labels parameter is used to specify the row or column labels of the DataFrame to be dropped. It takes either a single value or a list (or list-like). |
axis |
The axis parameter is used to specify the axis along which the labels are to be dropped. It takes either 0 for index or 1 for columns. The default value is 0 , so when you call drop() on its own it will drop via index. |
index |
The index parameter is used to specify the row labels of the DataFrame to be dropped. It takes either a single value or a list (or list-like). It's an alternative to specifying both labels, axis=0 and does the same as columns=labels . |
columns |
The columns parameter is used to specify the column labels of the DataFrame to be dropped. It takes either a single value or a list (or list-like). It's an alternative to specifying both labels, axis=1 and does the same as columns=labels . |
inplace |
The inplace parameter is used to specify whether the operation is to be performed on the DataFrame itself or a copy of it. It takes either True or False . The default value is False , so when you call drop() on its own it will return a copy of the DataFrame with the specified rows or columns dropped. |
To get started working through these simple examples, first open a Jupyter notebook and import the Pandas and Numpy libraries.
import pandas as pd
import numpy as np
Next, either import data into a Pandas dataframe, or create a simple dataframe containing a few values to drop. If you use info()
to print the details of your dataframe, you’ll see that we have a mixture of Pandas data types or dtypes in here, as well as some missing values, and some duplicates.
data = {'model': ['Defender 90', 'Defender 110', 'Defender 130'],
'wheelbase': ['90 inches', '110 inches', '130 inches'],
'bhp': [140, 140, 140],
'reliability': ['High', 'High', 'High'],
'desirability': ['High', 'High', 'High'],
'looks': ['High', 'Medium', np.NaN],
'price': [35000.99, 45999.99, 59999.99],
'year': [2016, 2016, 0]
}
df_defenders = pd.DataFrame.from_dict(data)
df_defenders
model | wheelbase | bhp | reliability | desirability | looks | price | year | |
---|---|---|---|---|---|---|---|---|
0 | Defender 90 | 90 inches | 140 | High | High | High | 35000.99 | 2016 |
1 | Defender 110 | 110 inches | 140 | High | High | Medium | 45999.99 | 2016 |
2 | Defender 130 | 130 inches | 140 | High | High | NaN | 59999.99 | 0 |
df_defenders.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 model 3 non-null object
1 wheelbase 3 non-null object
2 bhp 3 non-null int64
3 reliability 3 non-null object
4 desirability 3 non-null object
5 looks 2 non-null object
6 price 3 non-null float64
7 year 3 non-null int64
dtypes: float64(1), int64(2), object(5)
memory usage: 320.0+ bytes
The labels
parameter can be used to define which column or columns you want to drop. To drop a single column using labels
we simply pass the column or series name as the first argument. We need to set axis=1
to tell drop()
to use columns and not the index.
df = df_defenders.copy()
df = df.drop('bhp', axis=1)
df
model | wheelbase | reliability | desirability | looks | price | year | |
---|---|---|---|---|---|---|---|
0 | Defender 90 | 90 inches | High | High | High | 35000.99 | 2016 |
1 | Defender 110 | 110 inches | High | High | Medium | 45999.99 | 2016 |
2 | Defender 130 | 130 inches | High | High | NaN | 59999.99 | 0 |
While you don’t need to specify the name of labels
in your call to drop
, since it’s the default argument, it makes your code more readable and easier to maintain.
df = df_defenders.copy()
df = df.drop(labels='bhp', axis=1)
df
model | wheelbase | reliability | desirability | looks | price | year | |
---|---|---|---|---|---|---|---|
0 | Defender 90 | 90 inches | High | High | High | 35000.99 | 2016 |
1 | Defender 110 | 110 inches | High | High | Medium | 45999.99 | 2016 |
2 | Defender 130 | 130 inches | High | High | NaN | 59999.99 | 0 |
To drop multiple columns from the dataframe using labels
you simply change the string value passed in the first argument to a list of column names, and again set axis=1
to tell the drop()
method to use columns and not the index.
df = df_defenders.copy()
df = df.drop(labels=['bhp', 'reliability'], axis=1)
df
model | wheelbase | desirability | looks | price | year | |
---|---|---|---|---|---|---|
0 | Defender 90 | 90 inches | High | High | 35000.99 | 2016 |
1 | Defender 110 | 110 inches | High | Medium | 45999.99 | 2016 |
2 | Defender 130 | 130 inches | High | NaN | 59999.99 | 0 |
The labels
method can also be used to drop a single specific row by passing in its index value. To do this, I’m passing in df.index[0]
to remove the row that contains the model
“Defender 90”.
df = df_defenders.copy()
df = df.drop(labels=df.index[0])
df
model | wheelbase | bhp | reliability | desirability | looks | price | year | |
---|---|---|---|---|---|---|---|---|
1 | Defender 110 | 110 inches | 140 | High | High | Medium | 45999.99 | 2016 |
2 | Defender 130 | 130 inches | 140 | High | High | NaN | 59999.99 | 0 |
To delete multiple rows from the dataframe using their index, we can pass a list of values to the first labels
argument and set axis=0
to tell drop()
to use the index and not the column name.
df = df_defenders.copy()
df = df.drop(labels=[0, 1], axis=0)
df
model | wheelbase | bhp | reliability | desirability | looks | price | year | |
---|---|---|---|---|---|---|---|---|
2 | Defender 130 | 130 inches | 140 | High | High | NaN | 59999.99 | 0 |
The inplace
parameter is used to modify the original dataframe and prevents the need to call the code and reassign the output back to the dataframe variable. For example, df.drop(labels='bhp', axis=1, inplace=True)
is exactly the same as df = df.drop(labels='bhp', axis=1)
. The inplace
parameter takes a True
or False
boolean value and defaults to False
which returns a copy of the original dataframe with the chosen columns or rows removed, to which you can reassign back to the original dataframe variable or a new one.
df = df_defenders.copy()
df.drop(labels='bhp', axis=1, inplace=True)
df
model | wheelbase | reliability | desirability | looks | price | year | |
---|---|---|---|---|---|---|---|
0 | Defender 90 | 90 inches | High | High | High | 35000.99 | 2016 |
1 | Defender 110 | 110 inches | High | High | Medium | 45999.99 | 2016 |
2 | Defender 130 | 130 inches | High | High | NaN | 59999.99 | 0 |
The columns
parameter is the one you’ll see data scientists use most often, because it’s the most concise. At first glance, it might look like it does the same thing as labels
, and it kind of does, but it’s not quite the same. Since the labels
method requires you to set axis=1
to tell the drop()
method you’re referring to columns not indices, you need to enter df1 = df1.drop('bhp', axis=1)
to drop a single column, but with columns you can reduce this to df = df.drop(columns='bhp')
.
df = df_defenders.copy()
df = df.drop(columns='bhp')
df
model | wheelbase | reliability | desirability | looks | price | year | |
---|---|---|---|---|---|---|---|
0 | Defender 90 | 90 inches | High | High | High | 35000.99 | 2016 |
1 | Defender 110 | 110 inches | High | High | Medium | 45999.99 | 2016 |
2 | Defender 130 | 130 inches | High | High | NaN | 59999.99 | 0 |
As with labels
, you can also pass in a list of column names to drop multiple columns at once. However, there’s no need to provide axis=1
, as Pandas knows you’re dropping columns by name.
df = df_defenders.copy()
df = df.drop(columns=['bhp', 'reliability'])
df
model | wheelbase | desirability | looks | price | year | |
---|---|---|---|---|---|---|
0 | Defender 90 | 90 inches | High | High | 35000.99 | 2016 |
1 | Defender 110 | 110 inches | High | Medium | 45999.99 | 2016 |
2 | Defender 130 | 130 inches | High | NaN | 59999.99 | 0 |
The Pandas select_dtypes()
function is also compatible with the drop()
function and can be used to drop columns based on their Pandas dtype or data type. To use select_dtypes()
to drop Pandas columns according to their dtype you need to pass the function to the columns
parameter of drop with the include
argument set to the dtype you wish to include, and then append the Pandas columns
method to the end.
For example, df.drop(columns=df.select_dtypes(include='int').columns)
will use select_dtypes()
to select all columns that have an int
or integer dtype and then return a list of their column names to the columns
parameter of drop()
.
df = df_defenders.copy()
df = df.drop(columns=df.select_dtypes(include='int').columns)
df
model | wheelbase | reliability | desirability | looks | price | |
---|---|---|---|---|---|---|
0 | Defender 90 | 90 inches | High | High | High | 35000.99 |
1 | Defender 110 | 110 inches | High | High | Medium | 45999.99 |
2 | Defender 130 | 130 inches | High | High | NaN | 59999.99 |
You can, of course, substitute the int
dtype with any Pandas dtype. For example, float
for floating point or decimal numbers, or object
for strings.
df = df_defenders.copy()
df = df.drop(columns=df.select_dtypes(include='float').columns)
df
model | wheelbase | bhp | reliability | desirability | looks | year | |
---|---|---|---|---|---|---|---|
0 | Defender 90 | 90 inches | 140 | High | High | High | 2016 |
1 | Defender 110 | 110 inches | 140 | High | High | Medium | 2016 |
2 | Defender 130 | 130 inches | 140 | High | High | NaN | 0 |
To drop the columns that contain strings or object
dtype values you can pass include='object'
to select_dtypes()
.
df = df_defenders.copy()
df = df.drop(columns=df.select_dtypes(include='object').columns)
df
bhp | price | year | |
---|---|---|---|
0 | 140 | 35000.99 | 2016 |
1 | 140 | 45999.99 | 2016 |
2 | 140 | 59999.99 | 0 |
If you want to drop all columns that contain numeric data you can use Numpy’s np.number
method to return all numeric data and then pass that to select_dtypes()
and pass that to drop()
.
df = df_defenders.copy()
df = df.drop(columns=df.select_dtypes(include=np.number).columns)
df
model | wheelbase | reliability | desirability | looks | |
---|---|---|---|---|---|
0 | Defender 90 | 90 inches | High | High | High |
1 | Defender 110 | 110 inches | High | High | Medium |
2 | Defender 130 | 130 inches | High | High | NaN |
To drop all non-numeric columns and return only the numeric columns, you can use the exclude
argument when calling select_dtypes()
. This gives us a dataframe containing only the float
or int
dtypes, with the object
columns dropped.
df = df_defenders.copy()
df = df.drop(columns=df.select_dtypes(exclude=np.number).columns)
df
bhp | price | year | |
---|---|---|---|
0 | 140 | 35000.99 | 2016 |
1 | 140 | 45999.99 | 2016 |
2 | 140 | 59999.99 | 0 |
Pandas also lets you drop columns by condition. There are loads of ways to do this. First, we’ll look at dropping column names that contain a given substring in the column name. We can do this by creating a for
loop and looping over the column names and return the name if they contain the substring, then return the list of columns containing the substring to drop()
to remove them.
df = df_defenders.copy()
df = df.drop(columns=[col for col in df.columns if 'y' in col])
df
model | wheelbase | bhp | looks | price | |
---|---|---|---|---|---|
0 | Defender 90 | 90 inches | 140 | High | 35000.99 |
1 | Defender 110 | 110 inches | 140 | Medium | 45999.99 |
2 | Defender 130 | 130 inches | 140 | NaN | 59999.99 |
You can also drop Pandas columns by condition where the column value contains a string or substring. This works just like the code above, but calls df[col].values
instead.
df = df_defenders.copy()
df = df.drop(columns=[col for col in df.columns if 'High' in df[col].values])
df
<ipython-input-18-1e45f1ff6166>:2: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
df = df.drop(columns=[col for col in df.columns if 'High' in df[col].values])
model | wheelbase | bhp | price | year | |
---|---|---|---|---|---|
0 | Defender 90 | 90 inches | 140 | 35000.99 | 2016 |
1 | Defender 110 | 110 inches | 140 | 45999.99 | 2016 |
2 | Defender 130 | 130 inches | 140 | 59999.99 | 0 |
You can also drop Pandas dataframe rows by condition where a column value contains a string. To do this, you loop over the rows, look for the substring in the row values, and return the list of matches to the index
argument of drop()
and the matching rows will be removed.
df = df_defenders.copy()
df = df.drop(index=[row for row in df.index if 'Medium' in df.loc[row].values])
df
model | wheelbase | bhp | reliability | desirability | looks | price | year | |
---|---|---|---|---|---|---|---|---|
0 | Defender 90 | 90 inches | 140 | High | High | High | 35000.99 | 2016 |
2 | Defender 130 | 130 inches | 140 | High | High | NaN | 59999.99 | 0 |
If you’re working with numeric data, such as int
or float
dtypes, you can drop rows where a column value is greater than a target value by looping over the rows and returning the index of those matching rows to index
when calling drop()
.
df = df_defenders.copy()
df = df.drop(index=[row for row in df.index if 50000 > df.loc[row, 'price']])
df
model | wheelbase | bhp | reliability | desirability | looks | price | year | |
---|---|---|---|---|---|---|---|---|
2 | Defender 130 | 130 inches | 140 | High | High | NaN | 59999.99 | 0 |
The same approach can be used to drop rows where the column value is greater than a given value.
df = df_defenders.copy()
df = df.drop(index=[row for row in df.index if 50000 < df.loc[row, 'price']])
df
model | wheelbase | bhp | reliability | desirability | looks | price | year | |
---|---|---|---|---|---|---|---|---|
0 | Defender 90 | 90 inches | 140 | High | High | High | 35000.99 | 2016 |
1 | Defender 110 | 110 inches | 140 | High | High | Medium | 45999.99 | 2016 |
To drop dataframe rows where a column value is equal to a specific value you can also use the for
loop approach to return the index of matching rows to index
when calling drop()
.
df = df_defenders.copy()
df = df.drop(index=[row for row in df.index if 45999.99 == df.loc[row, 'price']])
df
model | wheelbase | bhp | reliability | desirability | looks | price | year | |
---|---|---|---|---|---|---|---|---|
0 | Defender 90 | 90 inches | 140 | High | High | High | 35000.99 | 2016 |
2 | Defender 130 | 130 inches | 140 | High | High | NaN | 59999.99 | 0 |
To drop rows where a value is less than or equal to a specific value you use the <=
operator.
df = df_defenders.copy()
df = df.drop(index=[row for row in df.index if 45999.99 <= df.loc[row, 'price']])
df
model | wheelbase | bhp | reliability | desirability | looks | price | year | |
---|---|---|---|---|---|---|---|---|
0 | Defender 90 | 90 inches | 140 | High | High | High | 35000.99 | 2016 |
To drop rows where a value is greater than or equal to a threshold value you use the >=
operator.
df = df_defenders.copy()
df = df.drop(index=[row for row in df.index if 45999.99 >= df.loc[row, 'price']])
df
model | wheelbase | bhp | reliability | desirability | looks | price | year | |
---|---|---|---|---|---|---|---|---|
2 | Defender 130 | 130 inches | 140 | High | High | NaN | 59999.99 | 0 |
Similarly, to drop rows where a column value is not equal to a given value you can simply switch from the ==
operator to the !=
operator, for “not equals”.
df = df_defenders.copy()
df = df.drop(index=[row for row in df.index if 45999.99 != df.loc[row, 'price']])
df
model | wheelbase | bhp | reliability | desirability | looks | price | year | |
---|---|---|---|---|---|---|---|---|
1 | Defender 110 | 110 inches | 140 | High | High | Medium | 45999.99 | 2016 |
To drop those rows where a column value falls between two numeric values you can call the values separately, for example row for row in df.index if 35000 < df.loc[row, 'price'] < 50000
and then pass the indices of the matching rows to the drop()
function’s index
parameter.
df = df_defenders.copy()
df = df.drop(index=[row for row in df.index if 35000 < df.loc[row, 'price'] < 50000])
df
model | wheelbase | bhp | reliability | desirability | looks | price | year | |
---|---|---|---|---|---|---|---|---|
2 | Defender 130 | 130 inches | 140 | High | High | NaN | 59999.99 | 0 |
If you want to drop any columns that contain a zero you can use col for col in df.columns if 0 in df[col].values
.
df = df_defenders.copy()
df = df.drop(columns=[col for col in df.columns if 0 in df[col].values])
df
model | wheelbase | bhp | reliability | desirability | looks | price | |
---|---|---|---|---|---|---|---|
0 | Defender 90 | 90 inches | 140 | High | High | High | 35000.99 |
1 | Defender 110 | 110 inches | 140 | High | High | Medium | 45999.99 |
2 | Defender 130 | 130 inches | 140 | High | High | NaN | 59999.99 |
To drop any rows that contain a zero in the row values you can use row for row in df.index if 0 in df.loc[row].values
.
df = df_defenders.copy()
df = df.drop(index=[row for row in df.index if 0 in df.loc[row].values])
df
model | wheelbase | bhp | reliability | desirability | looks | price | year | |
---|---|---|---|---|---|---|---|---|
0 | Defender 90 | 90 inches | 140 | High | High | High | 35000.99 | 2016 |
1 | Defender 110 | 110 inches | 140 | High | High | Medium | 45999.99 | 2016 |
A commonly used snippet in machine learning is to drop all columns except those in a list. This can be done a couple of ways, but the not in
for loop is an easy one.
df = df_defenders.copy()
df = df.drop(columns=[col for col in df.columns if col not in ['model', 'price']])
df
model | price | |
---|---|---|
0 | Defender 90 | 35000.99 |
1 | Defender 110 | 45999.99 |
2 | Defender 130 | 59999.99 |
Matt Clarke, Thursday, November 24, 2022