How to use Pandas sample() to show a sample of data

Picture by Life of Pix, Pexels.

16 minutes to read

Data Science Pandas

The Pandas sample() function is used to show a random sample of data from a dataframe. The sample() function is useful for quickly checking the data in a dataframe, and can be used to check that the data is being read in correctly, or check for potential issues.

The Pandas head() and tail() functions do a similar thing, but show you the first and last rows of a dataframe only. In some cases, data can change, so you can get a better idea of the data quality and consistency by showing a random sample instead of the head() and tail().

The Pandas sample() method

The Pandas sample() method is very simple to use. The main parameter controls the number of rows to show, but there are also some other useful parameters that can be used to control the sample. These are the function parameters. We’ll go over how these work and how you can use them next.

Parameter	Description
`n`	The optional `n` parameter controls the number of rows returned in the sample. This is set to 1 row by default, so calling `df.sample()` will show a single random row, and calling `df.sample(5)` will show five random rows.
`frac`	The optional `frac` parameter controls the fraction of rows returned in the sample. This is set to `None` by default, so calling `df.sample()` will show a single random row, and calling `df.sample(frac=0.5)` will show half of the rows in the data frame.
`replace`	The optional `replace` parameter controls whether the sample is with or without replacement. This is set to `False` by default, so calling `df.sample()` will show a sample without replacement. If you want to sample with replacement, set `replace=True`.
`random_state`	The optional `random_state` parameter controls the seed used by the random number generator. This is set to `None` by default, so calling `df.sample()` will show a random sample each time. If you want to sample with a specific seed, set `random_state` to an integer.
`axis`	The optional `axis` parameter controls whether to sample rows or columns. This is set to `0` by default, so calling `df.sample()` will show a random sample of rows. If you want to sample columns, set `axis=1`.
`weights`	The optional `weights` parameter controls the probability distribution used when sampling. This is set to `None` by default, so calling `df.sample()` will show a random sample. If you want to sample according to a specific probability distribution, set `weights` to a list of values.

Import a Pandas dataframe

To get started, open a Jupyter notebook and import the Pandas library. Then, either import a dataset into Pandas, or create a dummy dataset of your own.

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/google-analytics.csv')
df.head()

	User Type	Source	Medium	Browser	Device Category	Date	Pageviews
0	New Visitor	(direct)	(none)	Amazon Silk	mobile	2020-07-31	3
1	New Visitor	(direct)	(none)	Amazon Silk	mobile	2020-07-14	1
2	New Visitor	(direct)	(none)	Amazon Silk	tablet	2020-07-14	1
3	New Visitor	(direct)	(none)	Amazon Silk	tablet	2020-08-07	1
4	New Visitor	(direct)	(none)	Amazon Silk	tablet	2020-08-12	1

Show a random sample row

To use the sample() function you simply append the function name to the Pandas dataframe obect. Since the n parameter is optional, by default it will return a single random row.

df.sample()

	User Type	Source	Medium	Browser	Device Category	Date	Pageviews
3500	New Visitor	duckduckgo	organic	Safari	desktop	2020-08-02	1

Show a random sample of rows

To return a random sample of rows from the dataframe, you simply add a number to the sample() method - the n parameter. This number represents the number of rows you want to return.

You don’t need to write df.sample(n=5) - df.sample(5) will do exactly the same thing, but you might want to do this to make your code clearer to read, especially when you’re adding in additional optional parameters.

df.sample(5)

	User Type	Source	Medium	Browser	Device Category	Date	Pageviews
1925	New Visitor	(direct)	(none)	Safari	mobile	2020-08-08	1
1013	New Visitor	(direct)	(none)	Internet Explorer	desktop	2020-08-10	1
2494	New Visitor	bing	organic	Amazon Silk	tablet	2020-08-10	1
8632	New Visitor	google	organic	Chrome	mobile	2020-07-23	1
2246	New Visitor	(direct)	(none)	Samsung Internet	mobile	2020-07-26	1

Show a random sample column

By default, the sample() function will return a random sample of rows. To return a random sample of columns, use the axis parameter and pass in 1. Calling df.sample(axis=1) will return a single random column.

df.sample(axis=1)

	Device Category
0	mobile
1	mobile
2	tablet
3	tablet
4	tablet
...	...
9995	mobile
9996	mobile
9997	mobile
9998	mobile
9999	mobile

10000 rows × 1 columns

Show a random sample of columns

To return a random sample of columns from the dataframe, you need to pass in a number to the n parameter and set axis to 1. For example, to return a random sample of 5 columns, you would use the following code:

df.sample(3, axis=1)

	Medium	User Type	Device Category
0	(none)	New Visitor	mobile
1	(none)	New Visitor	mobile
2	(none)	New Visitor	tablet
3	(none)	New Visitor	tablet
4	(none)	New Visitor	tablet
...	...	...	...
9995	organic	New Visitor	mobile
9996	organic	New Visitor	mobile
9997	organic	New Visitor	mobile
9998	organic	New Visitor	mobile
9999	organic	New Visitor	mobile

10000 rows × 3 columns

Using random_state to get a consistent sample

The Pandas sample() function also supports a random_state parameter. This is much like the random_state parameter used when calling the scikit-learn train_test_split() function in that it controls whether you get a consistent, reproducible sample each time the function runs.

First, let’s see what happens when you don’t set a random_state and call the df.sample(1) method twice. You’ll notice that we get a different random row each time, rather than the same random row.

df.sample(1)

	User Type	Source	Medium	Browser	Device Category	Date	Pageviews
4247	New Visitor	google	organic	Chrome	desktop	2020-07-30	1

df.sample(1)

	User Type	Source	Medium	Browser	Device Category	Date	Pageviews
1984	New Visitor	(direct)	(none)	Safari	tablet	2020-08-10	1

To get a consistent and reproducible random sample, where the same random row appears every time you run the code, you simply pass an integer seed value to the random_state parameter. Setting a different number will give you a different random row.

df.sample(1, random_state=1)

	User Type	Source	Medium	Browser	Device Category	Date	Pageviews
9953	New Visitor	google	organic	Chrome	mobile	2020-07-24	1

df.sample(1, random_state=1)

	User Type	Source	Medium	Browser	Device Category	Date	Pageviews
9953	New Visitor	google	organic	Chrome	mobile	2020-07-24	1

Using frac to get a fraction of sample data

Another really useful parameter of the sample() function is frac. This allows you to specify the fraction of the data you want to sample. For example, if you want to sample 30% of the data, you can use frac=0.3. It’s a great way to quickly create a random sample of data.

When calling frac you don’t need to specify the n parameter. To get reproducible results between code runs, you can also optionally pass in the random_state.

df_validation = df.sample(frac=0.3)
df_validation.head()

	User Type	Source	Medium	Browser	Device Category	Date	Pageviews
1625	New Visitor	(direct)	(none)	Safari	mobile	2020-08-10	1
7173	New Visitor	google	organic	Chrome	mobile	2020-07-18	2
7395	New Visitor	google	organic	Chrome	mobile	2020-08-08	4
5619	New Visitor	google	organic	Chrome	desktop	2020-08-10	1
1272	New Visitor	(direct)	(none)	Safari	desktop	2020-07-18	1

Use replace to allow repeated samples

By default, when you call sample() on a DataFrame, it will not allow repeated samples. Therefore, you’ll only get each row appearing once within the random sample of data returned. If you want to allow sample() to return repeated sample rows, you can use the replace parameter.

For example, to get back five random rows, where there’s a statistical chance that the same row can appear multiple times in the sample dataset, you’d call df.sample(n=5, replace=True). It doesn’t guarantee that any rows will appear more than once, but it doesn’t prevent them from reappearing as the default of replace=False does.

df.sample(n=5, replace=True)

	User Type	Source	Medium	Browser	Device Category	Date	Pageviews
4697	New Visitor	google	organic	Chrome	desktop	2020-08-01	1
4151	New Visitor	google	organic	Chrome	desktop	2020-08-02	1
9626	New Visitor	google	organic	Chrome	mobile	2020-08-02	1
1100	New Visitor	(direct)	(none)	Safari	desktop	2020-07-19	1
1308	New Visitor	(direct)	(none)	Safari	desktop	2020-07-19	1

Setting weights so some rows are more likely to be sampled

The weights parameter can be used to set the probability of sampling a row. Basically, this means you can control the probability with which certain values get sampled, so some are more likely to appear in the sample dataset than others.

This works slightly differently depending on whether the Pandas column values you want to weight are categorical dtypes or numeric dtypes. On a numeric variable, you’d use code like df.sample(n=3, weights='orders') and sample() will favour those rows with a higher value in the orders column.

If you’re running this on a categorical variable column, the easiest way to do it is to use a Pandas groupby() and the transform() function to calculate a value. For example, we can use df.groupby('Browser')['Browser'].transform('count') to calculate the number of occurrences, and then pass that to weights.

df2 = df.sample(n=5, weights=df.groupby('Browser')['Browser'].transform('count'))
df2

	User Type	Source	Medium	Browser	Device Category	Date	Pageviews
615	New Visitor	(direct)	(none)	Chrome	tablet	2020-07-15	1
7334	New Visitor	google	organic	Chrome	mobile	2020-07-24	1
9294	New Visitor	google	organic	Chrome	mobile	2020-08-12	1
442	New Visitor	(direct)	(none)	Chrome	mobile	2020-08-01	1
6346	New Visitor	google	organic	Chrome	desktop	2020-08-04	1

Matt Clarke, Sunday, November 27, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.