One of the first things you’ll do whenever you import a Pandas dataframe is view the data to check that it’s formatted correctly and see what you’re dealing with. It’s an important step since about 80% of what we data scientists do is, unfortunately, just cleaning and reformatting data before we can do more interesting stuff with it.
Pandas includes a few very useful functions for viewing and checking data in dataframes. In this simple tutorial we’ll be going over the head()
function used for showing the first rows, the tail()
function for showing the last rows, the sample()
function for showing random rows, and the T
or transpose function for flipping the orientation of the dataframe.
To get started, open a Jupyter notebook, import Pandas and Numpy using the import pandas as pd
and import numpy as np
naming conventions, and create a dummy dataset containing some random data. The Pandas shape
method can be used to return a tuple indicating how many rows and columns are present.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000, 4), columns=list('ABCD'))
df.shape
(1000, 4)
The Pandas head()
function is used to return the first rows from a dataframe. By default, it returns the first 5 rows, so when you call head()
without any arguments you’ll get 5 rows back, unless you’ve used set_option()
to increase the default value.
df.head()
A | B | C | D | |
---|---|---|---|---|
0 | -1.695955 | 0.151810 | -1.304380 | 1.117109 |
1 | -0.483403 | 0.229203 | -0.490425 | 0.728589 |
2 | -0.091534 | -1.057842 | 0.325895 | 0.769804 |
3 | 0.962251 | 0.885115 | 0.078876 | -0.723674 |
4 | 0.303216 | -0.204397 | 0.150116 | 0.367043 |
To return a specific number of rows, such as just the first row, you can pass an integer value to the head()
function, so df.head(1)
will return the first row and df.head(10)
will return the first 10 rows.
df.head(1)
A | B | C | D | |
---|---|---|---|---|
0 | -1.695955 | 0.15181 | -1.30438 | 1.117109 |
When you import Pandas it sets a number of default display options, such as the max_rows
option, which defines the maximum number of rows displayed. You can find the value of the max_rows
option by calling pd.get_option('max_rows')
. By default, this is set to 60 rows.
pd.get_option('max_rows')
60
If you attempt to print the whole df
dataframe, or call the head()
function with a value that exceeds the max_rows
setting, i.e. df.head(100)
, Pandas will return the data in a truncated view in which the first five and last five rows are shown with ellipses in the middle to denote that data is missing.
df.head(100)
A | B | C | D | |
---|---|---|---|---|
0 | 0.164807 | -0.286455 | 1.340928 | 0.115890 |
1 | -1.060355 | -0.644209 | -1.364114 | -2.747539 |
2 | 0.464657 | 0.478078 | 0.622145 | 0.294728 |
3 | 0.753968 | -0.275934 | -0.605848 | 1.109735 |
4 | -0.661911 | 0.092234 | 0.951647 | -1.525059 |
... | ... | ... | ... | ... |
95 | -0.653926 | 1.001738 | -0.118835 | -0.291664 |
96 | -1.303779 | -0.465464 | -1.025301 | -1.536585 |
97 | -0.598785 | -0.384292 | 1.649467 | 0.543135 |
98 | 0.389788 | 0.482393 | -0.544763 | -1.138746 |
99 | -1.838366 | -0.515860 | -0.615266 | -0.272381 |
100 rows × 4 columns
If you want to override the default max_rows
value you can use the Pandas set_option
function to set the value to a higher number. Now you’ll see a larger number of rows. This is very useful when you just need to scan the data by eye to check that it looks right.
pd.set_option('max_rows', 100)
df.head(100)
The Pandas tail()
function works just like the head()
function but instead shows the last rows. Calling the function with no arguments will return the last five rows in the dataframe.
df.tail()
A | B | C | D | |
---|---|---|---|---|
995 | -1.781818 | 0.417116 | -1.995486 | 0.316706 |
996 | -0.738063 | 1.208763 | 1.200226 | 0.143066 |
997 | 0.758632 | -0.186649 | 1.618236 | 0.711830 |
998 | 0.256206 | -1.166524 | 0.709279 | 0.610565 |
999 | 0.681827 | 0.873835 | 1.829247 | -0.641025 |
As with head()
, you can also pass an integer value to the tail()
function to return a specific number of rows from the bottom of the dataframe, so df.tail(1)
will return only the last row.
df.tail(1)
A | B | C | D | |
---|---|---|---|---|
999 | 0.681827 | 0.873835 | 1.829247 | -0.641025 |
The Pandas transpose function T
can be used to flip the orientation of the dataframe, so the columns become rows and the rows become columns. Transposing a dataframe is an extremely useful technique for visually comparing or checking data, especially on dataframes that are wide due to the presence of lots of columns or long column values.
df.head(3).T
0 | 1 | 2 | |
---|---|---|---|
A | -1.695955 | -0.483403 | -0.091534 |
B | 0.151810 | 0.229203 | -1.057842 |
C | -1.304380 | -0.490425 | 0.325895 |
D | 1.117109 | 0.728589 | 0.769804 |
More rarely, you might also see a negative value being passed to the head()
or tail()
functions. For example, calling df.head(-10)
returns all rows apart from the first 10, while df.tail(-10)
returns all rows apart from the last 10.
df.head(-10)
A | B | C | D | |
---|---|---|---|---|
0 | 0.164807 | -0.286455 | 1.340928 | 0.115890 |
1 | -1.060355 | -0.644209 | -1.364114 | -2.747539 |
2 | 0.464657 | 0.478078 | 0.622145 | 0.294728 |
3 | 0.753968 | -0.275934 | -0.605848 | 1.109735 |
4 | -0.661911 | 0.092234 | 0.951647 | -1.525059 |
... | ... | ... | ... | ... |
985 | 0.505878 | 0.641314 | 0.848925 | 2.262935 |
986 | 0.148572 | -0.984472 | 1.963678 | -0.302820 |
987 | 0.155646 | 0.799404 | -0.867468 | 0.233681 |
988 | 1.879391 | -0.530778 | -0.906801 | -1.321481 |
989 | 0.333411 | -0.520215 | 0.180943 | -0.336810 |
990 rows × 4 columns
df.tail(-990)
A | B | C | D | |
---|---|---|---|---|
990 | -0.064963 | 0.217709 | -0.767372 | 0.363326 |
991 | -1.285855 | -1.214493 | 0.542552 | -0.511454 |
992 | 0.500009 | 0.864383 | -1.350805 | 0.192343 |
993 | 1.150621 | -1.119834 | 0.054419 | -1.936994 |
994 | -0.182967 | 0.872534 | 0.841756 | -1.004139 |
995 | -0.842824 | -0.120532 | -0.190949 | -0.652673 |
996 | -1.805779 | 0.398528 | -1.638430 | -1.060032 |
997 | 1.519081 | -0.947822 | -1.514677 | 0.031164 |
998 | -0.905574 | 0.761248 | 0.219420 | -0.913892 |
999 | 1.812816 | -0.031498 | -0.910258 | 0.607475 |
If you want to return a random sample of rows, you can use the Pandas sample()
function. Like head()
and tail()
, sample()
takes an optional n
parameter, which specifies the number of rows to return. If you don’t specify n
, it will return a single row. So, df.sample()
returns a single random row, and df.sample(3)
returns 3 random rows.
df.sample()
A | B | C | D | |
---|---|---|---|---|
259 | -1.745995 | 0.407777 | 0.596915 | -0.540192 |
df.sample(3)
A | B | C | D | |
---|---|---|---|---|
691 | 1.004318 | 1.131174 | 0.046987 | -0.391565 |
863 | 0.126247 | 0.584434 | -0.274276 | -0.386791 |
834 | 0.126157 | 0.636516 | 0.694208 | -0.555351 |
Matt Clarke, Saturday, November 26, 2022