How to use Pandas sample() to show a sample of data

Learn how to use the Pandas sample() function to show a sample of data, including the lesser known frac, weights, and replace parameters.

How to use Pandas sample() to show a sample of data
Picture by Life of Pix, Pexels.
16 minutes to read

The Pandas sample() function is used to show a random sample of data from a dataframe. The sample() function is useful for quickly checking the data in a dataframe, and can be used to check that the data is being read in correctly, or check for potential issues.

The Pandas head() and tail() functions do a similar thing, but show you the first and last rows of a dataframe only. In some cases, data can change, so you can get a better idea of the data quality and consistency by showing a random sample instead of the head() and tail().

The Pandas sample() method

The Pandas sample() method is very simple to use. The main parameter controls the number of rows to show, but there are also some other useful parameters that can be used to control the sample. These are the function parameters. We’ll go over how these work and how you can use them next.

Parameter Description
n The optional n parameter controls the number of rows returned in the sample. This is set to 1 row by default, so calling df.sample() will show a single random row, and calling df.sample(5) will show five random rows.
frac The optional frac parameter controls the fraction of rows returned in the sample. This is set to None by default, so calling df.sample() will show a single random row, and calling df.sample(frac=0.5) will show half of the rows in the data frame.
replace The optional replace parameter controls whether the sample is with or without replacement. This is set to False by default, so calling df.sample() will show a sample without replacement. If you want to sample with replacement, set replace=True.
random_state The optional random_state parameter controls the seed used by the random number generator. This is set to None by default, so calling df.sample() will show a random sample each time. If you want to sample with a specific seed, set random_state to an integer.
axis The optional axis parameter controls whether to sample rows or columns. This is set to 0 by default, so calling df.sample() will show a random sample of rows. If you want to sample columns, set axis=1.
weights The optional weights parameter controls the probability distribution used when sampling. This is set to None by default, so calling df.sample() will show a random sample. If you want to sample according to a specific probability distribution, set weights to a list of values.

Import a Pandas dataframe

To get started, open a Jupyter notebook and import the Pandas library. Then, either import a dataset into Pandas, or create a dummy dataset of your own.

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/google-analytics.csv')
df.head()
User Type Source Medium Browser Device Category Date Pageviews
0 New Visitor (direct) (none) Amazon Silk mobile 2020-07-31 3
1 New Visitor (direct) (none) Amazon Silk mobile 2020-07-14 1
2 New Visitor (direct) (none) Amazon Silk tablet 2020-07-14 1
3 New Visitor (direct) (none) Amazon Silk tablet 2020-08-07 1
4 New Visitor (direct) (none) Amazon Silk tablet 2020-08-12 1

Show a random sample row

To use the sample() function you simply append the function name to the Pandas dataframe obect. Since the n parameter is optional, by default it will return a single random row.

df.sample()
User Type Source Medium Browser Device Category Date Pageviews
3500 New Visitor duckduckgo organic Safari desktop 2020-08-02 1

Show a random sample of rows

To return a random sample of rows from the dataframe, you simply add a number to the sample() method - the n parameter. This number represents the number of rows you want to return.

You don’t need to write df.sample(n=5) - df.sample(5) will do exactly the same thing, but you might want to do this to make your code clearer to read, especially when you’re adding in additional optional parameters.

df.sample(5)
User Type Source Medium Browser Device Category Date Pageviews
1925 New Visitor (direct) (none) Safari mobile 2020-08-08 1
1013 New Visitor (direct) (none) Internet Explorer desktop 2020-08-10 1
2494 New Visitor bing organic Amazon Silk tablet 2020-08-10 1
8632 New Visitor google organic Chrome mobile 2020-07-23 1
2246 New Visitor (direct) (none) Samsung Internet mobile 2020-07-26 1

Show a random sample column

By default, the sample() function will return a random sample of rows. To return a random sample of columns, use the axis parameter and pass in 1. Calling df.sample(axis=1) will return a single random column.

df.sample(axis=1)
Device Category
0 mobile
1 mobile
2 tablet
3 tablet
4 tablet
... ...
9995 mobile
9996 mobile
9997 mobile
9998 mobile
9999 mobile

10000 rows × 1 columns

Show a random sample of columns

To return a random sample of columns from the dataframe, you need to pass in a number to the n parameter and set axis to 1. For example, to return a random sample of 5 columns, you would use the following code:

df.sample(3, axis=1)
Medium User Type Device Category
0 (none) New Visitor mobile
1 (none) New Visitor mobile
2 (none) New Visitor tablet
3 (none) New Visitor tablet
4 (none) New Visitor tablet
... ... ... ...
9995 organic New Visitor mobile
9996 organic New Visitor mobile
9997 organic New Visitor mobile
9998 organic New Visitor mobile
9999 organic New Visitor mobile

10000 rows × 3 columns

Using random_state to get a consistent sample

The Pandas sample() function also supports a random_state parameter. This is much like the random_state parameter used when calling the scikit-learn train_test_split() function in that it controls whether you get a consistent, reproducible sample each time the function runs.

First, let’s see what happens when you don’t set a random_state and call the df.sample(1) method twice. You’ll notice that we get a different random row each time, rather than the same random row.

df.sample(1)
User Type Source Medium Browser Device Category Date Pageviews
4247 New Visitor google organic Chrome desktop 2020-07-30 1
df.sample(1)
User Type Source Medium Browser Device Category Date Pageviews
1984 New Visitor (direct) (none) Safari tablet 2020-08-10 1

To get a consistent and reproducible random sample, where the same random row appears every time you run the code, you simply pass an integer seed value to the random_state parameter. Setting a different number will give you a different random row.

df.sample(1, random_state=1)
User Type Source Medium Browser Device Category Date Pageviews
9953 New Visitor google organic Chrome mobile 2020-07-24 1
df.sample(1, random_state=1)
User Type Source Medium Browser Device Category Date Pageviews
9953 New Visitor google organic Chrome mobile 2020-07-24 1

Using frac to get a fraction of sample data

Another really useful parameter of the sample() function is frac. This allows you to specify the fraction of the data you want to sample. For example, if you want to sample 30% of the data, you can use frac=0.3. It’s a great way to quickly create a random sample of data.

When calling frac you don’t need to specify the n parameter. To get reproducible results between code runs, you can also optionally pass in the random_state.

df_validation = df.sample(frac=0.3)
df_validation.head()
User Type Source Medium Browser Device Category Date Pageviews
1625 New Visitor (direct) (none) Safari mobile 2020-08-10 1
7173 New Visitor google organic Chrome mobile 2020-07-18 2
7395 New Visitor google organic Chrome mobile 2020-08-08 4
5619 New Visitor google organic Chrome desktop 2020-08-10 1
1272 New Visitor (direct) (none) Safari desktop 2020-07-18 1

Use replace to allow repeated samples

By default, when you call sample() on a DataFrame, it will not allow repeated samples. Therefore, you’ll only get each row appearing once within the random sample of data returned. If you want to allow sample() to return repeated sample rows, you can use the replace parameter.

For example, to get back five random rows, where there’s a statistical chance that the same row can appear multiple times in the sample dataset, you’d call df.sample(n=5, replace=True). It doesn’t guarantee that any rows will appear more than once, but it doesn’t prevent them from reappearing as the default of replace=False does.

df.sample(n=5, replace=True)
User Type Source Medium Browser Device Category Date Pageviews
4697 New Visitor google organic Chrome desktop 2020-08-01 1
4151 New Visitor google organic Chrome desktop 2020-08-02 1
9626 New Visitor google organic Chrome mobile 2020-08-02 1
1100 New Visitor (direct) (none) Safari desktop 2020-07-19 1
1308 New Visitor (direct) (none) Safari desktop 2020-07-19 1

Setting weights so some rows are more likely to be sampled

The weights parameter can be used to set the probability of sampling a row. Basically, this means you can control the probability with which certain values get sampled, so some are more likely to appear in the sample dataset than others.

This works slightly differently depending on whether the Pandas column values you want to weight are categorical dtypes or numeric dtypes. On a numeric variable, you’d use code like df.sample(n=3, weights='orders') and sample() will favour those rows with a higher value in the orders column.

If you’re running this on a categorical variable column, the easiest way to do it is to use a Pandas groupby() and the transform() function to calculate a value. For example, we can use df.groupby('Browser')['Browser'].transform('count') to calculate the number of occurrences, and then pass that to weights.

df2 = df.sample(n=5, weights=df.groupby('Browser')['Browser'].transform('count'))
df2
User Type Source Medium Browser Device Category Date Pageviews
615 New Visitor (direct) (none) Chrome tablet 2020-07-15 1
7334 New Visitor google organic Chrome mobile 2020-07-24 1
9294 New Visitor google organic Chrome mobile 2020-08-12 1
442 New Visitor (direct) (none) Chrome mobile 2020-08-01 1
6346 New Visitor google organic Chrome desktop 2020-08-04 1

Matt Clarke, Sunday, November 27, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.