The SELECT
statement is the most simple of all SQL queries and allows you to retrieve the precise data you want from one or more tables, or even databases. In this simple tutorial I’ll explain how you can use SELECT
with the FROM
, WHERE
and AND
clauses to fetch a wide range of specific column values from the database tables in a MySQL database.
Before you can start, you’ll need to have access to a MySQL database that you can query. To make this easier to follow along, I’ve created a MySQL database dump file of the popular Online Retail Dataset that you can import into a Docker container.
To get started, you’ll first need to create a MySQL Docker container and then import the SQL database dump to create a new database. Although it sounds scary, the whole process should only take a few minutes and I’ve provided the code you need for every step.
To query the MySQL database on your Docker container we’ll need to install a couple of Python packages. To query a database with Pandas you’ll need an SQL driver - a bit of extra code that connects to the database and passes your SQL queries. We’ll be using sqlalchemy
and pymysql
, so import the packages into your Jupyter environment using the commands below.
!pip3 install sqlalchemy
!pip3 install pymysql
import pandas as pd
from sqlalchemy import create_engine
Next we need to configure our connection to the MySQL Docker container. This is covered in the tutorial on creating the database itself, but basically involves connecting to the IP address of the server, usually on port 3306, and then defining the ecommerce
database to which you want to connect, along with your username and password.
engine = create_engine('mysql+pymysql://root:**********@172.17.0.3:3306/ecommerce')
Now you hopefully have a working database connection, we’ll write the most simple of all SQL queries - the SELECT
statement. The statement SELECT * FROM customers
uses the asterisk wildcard to to MySQL to return all columns in the customers
table.
To execute the SQL statement or query, we’ll pass it to the Pandas read_sql_query()
function along with the engine
connection we created above. Pandas will use SQLAlchemy and PyMSQL to connect to the database, run the query, and return the results in a Pandas dataframe that we’re assigning to the variable df
.
Since that’s just a regular Pandas dataframe, we can manipulate it in the same way we would an other dataframe, for example, by using head()
to print the first five rows. The ecommerce
database contains four database tables called customers
, orders
, order_items
, and products
. You can select the data from these tables simply by changing the database table parameter after FROM
.
df = pd.read_sql_query("SELECT * FROM customers", con=engine)
df.head()
customer_id | revenue | orders | skus | units | first_order_date | last_order_date | |
---|---|---|---|---|---|---|---|
0 | 14646.0 | 279489.02 | 77 | 703 | 196719 | 2010-12-20 10:09:00 | 2011-12-08 12:12:00 |
1 | 18102.0 | 256438.49 | 62 | 151 | 64122 | 2010-12-07 16:42:00 | 2011-12-09 11:50:00 |
2 | 17450.0 | 187482.17 | 55 | 127 | 69029 | 2010-12-07 09:23:00 | 2011-12-01 13:29:00 |
3 | 14911.0 | 132572.62 | 248 | 1794 | 77180 | 2010-12-01 14:05:00 | 2011-12-08 15:54:00 |
4 | 12415.0 | 123725.45 | 26 | 444 | 77242 | 2011-01-06 11:12:00 | 2011-11-15 14:22:00 |
df = pd.read_sql_query("SELECT * FROM orders", con=engine)
df.head()
invoice | revenue | skus | items | |
---|---|---|---|---|
0 | 536365 | 139.12 | 7 | 40 |
1 | 536366 | 22.20 | 2 | 12 |
2 | 536367 | 278.73 | 12 | 83 |
3 | 536368 | 70.05 | 4 | 15 |
4 | 536369 | 17.85 | 1 | 3 |
df = pd.read_sql_query("SELECT * FROM order_items", con=engine)
df.head()
invoice | sku | description | qty | order_date | unit_price | customer_id | country | line_price | |
---|---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom | 15.30 |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom | 22.00 |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
df = pd.read_sql_query("SELECT * FROM products", con=engine)
df.head()
sku | units | avg_units | revenue | orders | customers | |
---|---|---|---|---|---|---|
0 | 10002 | 1037 | 14.205479 | 759.89 | 73 | 40 |
1 | 10080 | 495 | 20.625000 | 119.09 | 24 | 19 |
2 | 10120 | 193 | 6.433333 | 40.53 | 29 | 25 |
3 | 10123C | -13 | -3.250000 | 3.25 | 4 | 3 |
4 | 10123G | -38 | -38.000000 | 0.00 | 1 | 0 |
The asterisk operator returns all columns from a database table, but you’ll often want to select only specific columns be returned by your query. To do this, you just pass a comma separated value list of column names instead of *
. The order of the columns dictates the order in which they’ll appear in the SQL result set or dataframe.
df = pd.read_sql_query("SELECT customer_id, revenue, orders FROM customers", con=engine)
df.head()
customer_id | revenue | orders | |
---|---|---|---|
0 | 14646.0 | 279489.02 | 77 |
1 | 18102.0 | 256438.49 | 62 |
2 | 17450.0 | 187482.17 | 55 |
3 | 14911.0 | 132572.62 | 248 |
4 | 12415.0 | 123725.45 | 26 |
The very simple SQL statement we wrote above doesn’t include any clauses to tell MySQL to filter the results. To select only specific rows from the database you need to use the WHERE
clause. This takes a column name, i.e. invoice
, and a column value to find, i.e. 536365
. Running the statement SELECT * FROM order_items WHERE invoice = 536365
will, therefore, return all the order_items
columns from invoice
536365
.
df = pd.read_sql_query("SELECT * FROM order_items WHERE invoice = 536365", con=engine)
df.head()
invoice | sku | description | qty | order_date | unit_price | customer_id | country | line_price | |
---|---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom | 15.30 |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom | 22.00 |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
If you want to create a more complex database query, you’ll often need to combine multiple clauses using AND
. The below query will select all columns from the order_items
table WHERE
the country
is United Kingdom
and the order_date
is greater than 2010-10-01
and the qty
of units was greater than 10
.
query = """
SELECT *
FROM order_items
WHERE country = 'United Kingdom'
AND order_date > '2010-10-01'
AND qty > 10
"""
df = pd.read_sql_query(query, con=engine)
df.head()
invoice | sku | description | qty | order_date | unit_price | customer_id | country | line_price | |
---|---|---|---|---|---|---|---|---|---|
0 | 536367 | 84879 | ASSORTED COLOUR BIRD ORNAMENT | 32 | 2010-12-01 08:34:00 | 1.69 | 13047.0 | United Kingdom | 54.08 |
1 | 536371 | 22086 | PAPER CHAIN KIT 50'S CHRISTMAS | 80 | 2010-12-01 09:00:00 | 2.55 | 13748.0 | United Kingdom | 204.00 |
2 | 536374 | 21258 | VICTORIAN SEWING BOX LARGE | 32 | 2010-12-01 09:09:00 | 10.95 | 15100.0 | United Kingdom | 350.40 |
3 | 536376 | 22114 | HOT WATER BOTTLE TEA AND SYMPATHY | 48 | 2010-12-01 09:32:00 | 3.45 | 15291.0 | United Kingdom | 165.60 |
4 | 536376 | 21733 | RED HANGING HEART T-LIGHT HOLDER | 64 | 2010-12-01 09:32:00 | 2.55 | 15291.0 | United Kingdom | 163.20 |
Sometimes, you might want to perform a fuzzy match and return only items that match a partial string. For example, we might want to look at all the products in the order_items
table that contain the word white
in their product name held in the description
column. To do this we can use LIKE
along with the %
symbol. By putting a percentage symbol either side of the value we’ll return any products that contain the word white
.
query = """
SELECT *
FROM order_items
WHERE description LIKE '%white%'
"""
df = pd.read_sql_query(query, con=engine)
df.head()
invoice | sku | description | qty | order_date | unit_price | customer_id | country | line_price | |
---|---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom | 15.30 |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
2 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
3 | 536373 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 09:02:00 | 2.55 | 17850.0 | United Kingdom | 15.30 |
4 | 536373 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 09:02:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
Wrapping two percentage symbols around a string will return all values that contain the string. However, sometimes you might only want to return values that start with a string. To do this, you simply place the percentage symbol at the end.
query = """
SELECT *
FROM order_items
WHERE description LIKE 'white%'
"""
df = pd.read_sql_query(query, con=engine)
df.head()
invoice | sku | description | qty | order_date | unit_price | customer_id | country | line_price | |
---|---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom | 15.30 |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
2 | 536373 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 09:02:00 | 2.55 | 17850.0 | United Kingdom | 15.30 |
3 | 536373 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 09:02:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
4 | 536375 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 09:32:00 | 2.55 | 17850.0 | United Kingdom | 15.30 |
Similarly, to select values that end with a value, you can use LIKE
and place the %
symbol at the start of the string. This will return only those items that end with the provided value.
query = """
SELECT *
FROM order_items
WHERE description LIKE '%white'
"""
df = pd.read_sql_query(query, con=engine)
df.head()
invoice | sku | description | qty | order_date | unit_price | customer_id | country | line_price | |
---|---|---|---|---|---|---|---|---|---|
0 | 536378 | 85099C | JUMBO BAG BAROQUE BLACK WHITE | 10 | 2010-12-01 09:37:00 | 1.95 | 14688.0 | United Kingdom | 19.50 |
1 | 536386 | 85099C | JUMBO BAG BAROQUE BLACK WHITE | 100 | 2010-12-01 09:57:00 | 1.65 | 16029.0 | United Kingdom | 165.00 |
2 | 536464 | 85099C | JUMBO BAG BAROQUE BLACK WHITE | 1 | 2010-12-01 12:23:00 | 1.95 | 17968.0 | United Kingdom | 1.95 |
3 | 536488 | 22376 | AIRLINE BAG VINTAGE JET SET WHITE | 1 | 2010-12-01 12:31:00 | 4.25 | 17897.0 | United Kingdom | 4.25 |
4 | 536529 | 22222 | CAKE PLATE LOVEBIRD WHITE | 3 | 2010-12-01 13:20:00 | 4.95 | 14237.0 | United Kingdom | 14.85 |
To obtain items that do not contain a given value you can use NOT LIKE
. Here’s a simple example that returns all products that don’t contain the word white
in their product name. You can, of course, use the above approach and return items that start with a string, or end with a string, simply by using the percentage symbol at either end, rather than both ends.
query = """
SELECT *
FROM order_items
WHERE description NOT LIKE '%white%'
"""
df = pd.read_sql_query(query, con=engine)
df.head()
invoice | sku | description | qty | order_date | unit_price | customer_id | country | line_price | |
---|---|---|---|---|---|---|---|---|---|
0 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom | 22.00 |
1 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
2 | 536365 | 22752 | SET 7 BABUSHKA NESTING BOXES | 2 | 2010-12-01 08:26:00 | 7.65 | 17850.0 | United Kingdom | 15.30 |
3 | 536365 | 21730 | GLASS STAR FROSTED T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 4.25 | 17850.0 | United Kingdom | 25.50 |
4 | 536366 | 22633 | HAND WARMER UNION JACK | 6 | 2010-12-01 08:28:00 | 1.85 | 17850.0 | United Kingdom | 11.10 |
Matt Clarke, Tuesday, April 12, 2022