The Pandas sample()
function is used to show a random sample of data from a dataframe. The sample()
function is useful for quickly checking the data in a dataframe, and can be used to check that the data is being read in correctly, or check for potential issues.
The Pandas head()
and tail()
functions do a similar thing, but show you the first and last rows of a dataframe only. In some cases, data can change, so you can get a better idea of the data quality and consistency by showing a random sample instead of the head()
and tail()
.
The Pandas sample()
method is very simple to use. The main parameter controls the number of rows to show, but there are also some other useful parameters that can be used to control the sample. These are the function parameters. We’ll go over how these work and how you can use them next.
Parameter | Description |
---|---|
n |
The optional n parameter controls the number of rows returned in the sample. This is set to 1 row by default, so calling df.sample() will show a single random row, and calling df.sample(5) will show five random rows. |
frac |
The optional frac parameter controls the fraction of rows returned in the sample. This is set to None by default, so calling df.sample() will show a single random row, and calling df.sample(frac=0.5) will show half of the rows in the data frame. |
replace |
The optional replace parameter controls whether the sample is with or without replacement. This is set to False by default, so calling df.sample() will show a sample without replacement. If you want to sample with replacement, set replace=True . |
random_state |
The optional random_state parameter controls the seed used by the random number generator. This is set to None by default, so calling df.sample() will show a random sample each time. If you want to sample with a specific seed, set random_state to an integer. |
axis |
The optional axis parameter controls whether to sample rows or columns. This is set to 0 by default, so calling df.sample() will show a random sample of rows. If you want to sample columns, set axis=1 . |
weights |
The optional weights parameter controls the probability distribution used when sampling. This is set to None by default, so calling df.sample() will show a random sample. If you want to sample according to a specific probability distribution, set weights to a list of values. |
To get started, open a Jupyter notebook and import the Pandas library. Then, either import a dataset into Pandas, or create a dummy dataset of your own.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/google-analytics.csv')
df.head()
User Type | Source | Medium | Browser | Device Category | Date | Pageviews | |
---|---|---|---|---|---|---|---|
0 | New Visitor | (direct) | (none) | Amazon Silk | mobile | 2020-07-31 | 3 |
1 | New Visitor | (direct) | (none) | Amazon Silk | mobile | 2020-07-14 | 1 |
2 | New Visitor | (direct) | (none) | Amazon Silk | tablet | 2020-07-14 | 1 |
3 | New Visitor | (direct) | (none) | Amazon Silk | tablet | 2020-08-07 | 1 |
4 | New Visitor | (direct) | (none) | Amazon Silk | tablet | 2020-08-12 | 1 |
To use the sample()
function you simply append the function name to the Pandas dataframe obect. Since the n
parameter is optional, by default it will return a single random row.
df.sample()
User Type | Source | Medium | Browser | Device Category | Date | Pageviews | |
---|---|---|---|---|---|---|---|
3500 | New Visitor | duckduckgo | organic | Safari | desktop | 2020-08-02 | 1 |
To return a random sample of rows from the dataframe, you simply add a number to the sample() method - the n
parameter. This number represents the number of rows you want to return.
You don’t need to write df.sample(n=5)
- df.sample(5)
will do exactly the same thing, but you might want to do this to make your code clearer to read, especially when you’re adding in additional optional parameters.
df.sample(5)
User Type | Source | Medium | Browser | Device Category | Date | Pageviews | |
---|---|---|---|---|---|---|---|
1925 | New Visitor | (direct) | (none) | Safari | mobile | 2020-08-08 | 1 |
1013 | New Visitor | (direct) | (none) | Internet Explorer | desktop | 2020-08-10 | 1 |
2494 | New Visitor | bing | organic | Amazon Silk | tablet | 2020-08-10 | 1 |
8632 | New Visitor | organic | Chrome | mobile | 2020-07-23 | 1 | |
2246 | New Visitor | (direct) | (none) | Samsung Internet | mobile | 2020-07-26 | 1 |
By default, the sample()
function will return a random sample of rows. To return a random sample of columns, use the axis parameter and pass in 1. Calling df.sample(axis=1)
will return a single random column.
df.sample(axis=1)
Device Category | |
---|---|
0 | mobile |
1 | mobile |
2 | tablet |
3 | tablet |
4 | tablet |
... | ... |
9995 | mobile |
9996 | mobile |
9997 | mobile |
9998 | mobile |
9999 | mobile |
10000 rows × 1 columns
To return a random sample of columns from the dataframe, you need to pass in a number to the n
parameter and set axis
to 1. For example, to return a random sample of 5 columns, you would use the following code:
df.sample(3, axis=1)
Medium | User Type | Device Category | |
---|---|---|---|
0 | (none) | New Visitor | mobile |
1 | (none) | New Visitor | mobile |
2 | (none) | New Visitor | tablet |
3 | (none) | New Visitor | tablet |
4 | (none) | New Visitor | tablet |
... | ... | ... | ... |
9995 | organic | New Visitor | mobile |
9996 | organic | New Visitor | mobile |
9997 | organic | New Visitor | mobile |
9998 | organic | New Visitor | mobile |
9999 | organic | New Visitor | mobile |
10000 rows × 3 columns
The Pandas sample()
function also supports a random_state
parameter. This is much like the random_state
parameter used when calling the scikit-learn train_test_split()
function in that it controls whether you get a consistent, reproducible sample each time the function runs.
First, let’s see what happens when you don’t set a random_state
and call the df.sample(1)
method twice. You’ll notice that we get a different random row each time, rather than the same random row.
df.sample(1)
User Type | Source | Medium | Browser | Device Category | Date | Pageviews | |
---|---|---|---|---|---|---|---|
4247 | New Visitor | organic | Chrome | desktop | 2020-07-30 | 1 |
df.sample(1)
User Type | Source | Medium | Browser | Device Category | Date | Pageviews | |
---|---|---|---|---|---|---|---|
1984 | New Visitor | (direct) | (none) | Safari | tablet | 2020-08-10 | 1 |
To get a consistent and reproducible random sample, where the same random row appears every time you run the code, you simply pass an integer seed value to the random_state
parameter. Setting a different number will give you a different random row.
df.sample(1, random_state=1)
User Type | Source | Medium | Browser | Device Category | Date | Pageviews | |
---|---|---|---|---|---|---|---|
9953 | New Visitor | organic | Chrome | mobile | 2020-07-24 | 1 |
df.sample(1, random_state=1)
User Type | Source | Medium | Browser | Device Category | Date | Pageviews | |
---|---|---|---|---|---|---|---|
9953 | New Visitor | organic | Chrome | mobile | 2020-07-24 | 1 |
Another really useful parameter of the sample()
function is frac
. This allows you to specify the fraction of the data you want to sample. For example, if you want to sample 30% of the data, you can use frac=0.3
. It’s a great way to quickly create a random sample of data.
When calling frac
you don’t need to specify the n
parameter. To get reproducible results between code runs, you can also optionally pass in the random_state
.
df_validation = df.sample(frac=0.3)
df_validation.head()
User Type | Source | Medium | Browser | Device Category | Date | Pageviews | |
---|---|---|---|---|---|---|---|
1625 | New Visitor | (direct) | (none) | Safari | mobile | 2020-08-10 | 1 |
7173 | New Visitor | organic | Chrome | mobile | 2020-07-18 | 2 | |
7395 | New Visitor | organic | Chrome | mobile | 2020-08-08 | 4 | |
5619 | New Visitor | organic | Chrome | desktop | 2020-08-10 | 1 | |
1272 | New Visitor | (direct) | (none) | Safari | desktop | 2020-07-18 | 1 |
By default, when you call sample()
on a DataFrame, it will not allow repeated samples. Therefore, you’ll only get each row appearing once within the random sample of data returned. If you want to allow sample()
to return repeated sample rows, you can use the replace
parameter.
For example, to get back five random rows, where there’s a statistical chance that the same row can appear multiple times in the sample dataset, you’d call df.sample(n=5, replace=True)
. It doesn’t guarantee that any rows will appear more than once, but it doesn’t prevent them from reappearing as the default of replace=False
does.
df.sample(n=5, replace=True)
User Type | Source | Medium | Browser | Device Category | Date | Pageviews | |
---|---|---|---|---|---|---|---|
4697 | New Visitor | organic | Chrome | desktop | 2020-08-01 | 1 | |
4151 | New Visitor | organic | Chrome | desktop | 2020-08-02 | 1 | |
9626 | New Visitor | organic | Chrome | mobile | 2020-08-02 | 1 | |
1100 | New Visitor | (direct) | (none) | Safari | desktop | 2020-07-19 | 1 |
1308 | New Visitor | (direct) | (none) | Safari | desktop | 2020-07-19 | 1 |
The weights
parameter can be used to set the probability of sampling a row. Basically, this means you can control the probability with which certain values get sampled, so some are more likely to appear in the sample dataset than others.
This works slightly differently depending on whether the Pandas column values you want to weight are categorical dtypes or numeric dtypes. On a numeric variable, you’d use code like df.sample(n=3, weights='orders')
and sample()
will favour those rows with a higher value in the orders
column.
If you’re running this on a categorical variable column, the easiest way to do it is to use a Pandas groupby()
and the transform()
function to calculate a value. For example, we can use df.groupby('Browser')['Browser'].transform('count')
to calculate the number of occurrences, and then pass that to weights
.
df2 = df.sample(n=5, weights=df.groupby('Browser')['Browser'].transform('count'))
df2
User Type | Source | Medium | Browser | Device Category | Date | Pageviews | |
---|---|---|---|---|---|---|---|
615 | New Visitor | (direct) | (none) | Chrome | tablet | 2020-07-15 | 1 |
7334 | New Visitor | organic | Chrome | mobile | 2020-07-24 | 1 | |
9294 | New Visitor | organic | Chrome | mobile | 2020-08-12 | 1 | |
442 | New Visitor | (direct) | (none) | Chrome | mobile | 2020-08-01 | 1 |
6346 | New Visitor | organic | Chrome | desktop | 2020-08-04 | 1 |
Matt Clarke, Sunday, November 27, 2022