For years, I used to spend much of my time performing Exploratory Data Analysis directly in SQL. Over time, the queries I wrote became very complicated, and it was often a struggle to make them run efficiently. Then I discovered that I could run SQL queries in Jupyter notebooks via Pandas and SQLAlchemy, and the way I worked changed forever.
Rather than writing one, slow, and hideously complex query in SQL, I found that I could write a series of shorter, faster, simpler queries in SQL and execute them independently via Pandas in a Jupyter notebook, allowing me to manipulate and join the selected data to a range of other sources.
Not only that, but the Jupyter notebook meant that I could add commentary to my analysis, generate useful plots, calculate things that are much harder to do in SQL, and create models and complex analyses. If you work with SQL regularly, I’d highly recommend switching to Jupyter and Pandas for your EDA work if you want to become more efficient. Here’s how it’s done.
Our aim is to run SQL queries via an SQLAlchemy connection using Pandas to execute the query and display the data in a dataframe. Open a Jupyter notebook and load the packages below. You’ll need to install the
pymysql packages, which you can do by typing
pip3 install package-name into your terminal.
I’m assuming that you’re using MySQL as your database, and that this is already set up and running. If you need to set up a local MySQL server, the easiest way to do this is to do this via Docker. If you prefer to connect to a remote MySQL server, you can do that via an SSH tunnel.
import pandas as pd from sqlalchemy import create_engine
create_engine() function in SQLAlchemy allows you to connect to a wide range of different database platforms, providing you have the necessary driver (like
pymysql) installed. To create a connection you need to configure a connection string like the one below.
I’m connecting to
mysql using the
pymysql driver, as the
root user with the password
password-here, on the IP address
172.17.0.3 and port
3306, and the database name
database-here. Once connected, we’ll store the connection in an object called
engine so we can pass it to each of the SQLAlchemy functions we run.
engine = create_engine('mysql+pymysql://root:firstname.lastname@example.org:3306/database-here')
Now we have a working database connection, we can create an SQL statement and execute it on the MySQL server using the Pandas
read_sql() function. We’ll pass in the
engine object above to the
con argument and return our results in a dataframe and display the first five rows using
df = pd.read_sql("SELECT * FROM brands", con=engine) df.head()
If you want to run a more complex SQL query which spans multiple lines, you can use triple quotes instead of single quotes. This allows you to format the query in a more readable manner, which will make your code easier to review and debug. For additional readability, it’s a good idea to place larger queries into variables, or even separate files and load them in.
query = """ SELECT shoporders.id AS order_id, shoporders.siteid AS site_id, shoporders.customerid AS customer_id, shoporders.amazon_orderid AS amazon_order_id, shoporders.ordertotal AS total_revenue, shoporders.datecreated AS date_created FROM shoporders LEFT JOIN shoporderorigins ON shoporderorigins.id = shoporders.orderoriginid LEFT JOIN countries ON countries.id = shoporders.deliverycountry """
df = pd.read_sql(query, con=engine) df.head()
As with regular SQL queries, you can “bind” placeholder parameters to named variables to create dynamic queries. Here, we’ll create a bound parameter called
%s in our query to hold a
query = """ SELECT shoporders.id AS order_id, shoporders.siteid AS site_id, shoporders.customerid AS customer_id, shoporders.amazon_orderid AS amazon_order_id, shoporders.ordertotal AS total_revenue, shoporders.datecreated AS date_created FROM shoporders LEFT JOIN shoporderorigins ON shoporderorigins.id = shoporders.orderoriginid LEFT JOIN countries ON countries.id = shoporders.deliverycountry WHERE shoporders.customerid = %s """
To pass the bound parameter to the query and replace
%s with our value, we’ll assign the value 166958 to
customer_id, and then pass this to the
params argument. This can take a
customer_id = "166958" df = pd.read_sql(query, con=engine, params=(['customer_id'])) df.tail()
Now you’ve extracted data from MySQL using a
SELECT query, you’re able to manipulate the dataframe just as you would one that’s been loaded from any other source. This means you can clean it, merge it, connect it to other tables or other data sources, and save it out to almost any file type you like, all using the functions built into Pandas.
Matt Clarke, Monday, March 08, 2021