There are hundreds of excellent Python data science libraries and packages that you’ll encounter when working on data science projects. However, there are four of them that you’ll probably use on a daily basis and in almost every project you work upon.
The absolute number one Python data science library you need to learn is Pandas. Pandas, or the Python Data Analysis Library, is used for loading, manipulating, and analysing text-based data. It’s essentially a bit like Microsoft Excel, but much, much better.
It lets you load or import data from a wide range of different data sources, including CSV, Excel, JSON, SAS, SPSS, and databases such as MySQL, PostgreSQL or BigQuery and displays its output in a table called a DataFrame. Once the data is loaded into a DataFrame, you can clean it, modify it, filter or search it, and join it to other data sets, so it can be visualised, analysed, or used within a model.
Most data scientists use Pandas in every project they create, so learning Pandas is without doubt the most important thing you’ll want to cover when starting out. It’s a really powerful tool, especially when combined with the Jupyter Notebook. Using these tools together will undoubtedly change the way you work forever.
Second to Pandas is NumPy, a numerical computing library for Python. NumPy, or Numerical Python, is very powerful and includes a massive array of features that make complex mathematical operations much easier. NumPy is actually a core component of Pandas, Matplotlib, Seaborn, and Scikit Learn, so you’ll likely be using it unknowingly when you perform any data science task.
With Pandas being so powerful, and already providing access to NumPy code, I tend to find that I only rarely need to use raw NumPy, so the basics will likely see you have the skills you need for most projects. A few basic tutorials on the key functions and the way NumPy arrays work should be enough to get you up and running.
One of the key stages in any data science project is called Exploratory Data Analysis or EDA. This involves digging into the data to understand the statistical distributions, examine correlations and relationships between the data, look for outliers, and identify potential new features to engineer. To do any of these, and for creating any data-led reports, you’ll need to use a data visualisation library.
Most of the Python data visualisation libraries are based on the Matplotlib framework. However, this is quite verbose, so Pandas includes some common functionality built-in, while other packages, such as Seaborn, let you create Matplotlib code with greater ease via helpful “wrapper” functions. This means you can just learn a few bits of basic Seaborn, and some basic Matplotlib, and you’ll still be able to create visually appealing charts, plots, and graphs.
Finally, there is scikit-learn, or sklearn as it’s also known. This is an amazing Python data science library which includes a massive range of pre-written algorithms covering almost every type of machine learning application you can imagine. All scikit-learn algorithms can be accessed and used in a standardised syntax, which means they’re all relatively simple to pick up, once you know the basics.
scikit-learn works alongside Pandas and NumPy, so you’ll likely use all three at once, performing the data loading, manipulation, and cleansing in Pandas, doing some numerical work in NumPy, and then running the model in scikit-learn.
This library is an incredible tool for data scientists, because it provides access to peer-reviewed algorithm code that you can use in your projects without the need to try to implement the algorithms yourself. It has allowed data scientists to employ powerful model selection steps where they can test a wide range of different algorithms on their data to find the most effective, something that just wouldn’t have been viable before.
While there are loads of other Python data science libraries, it’s Pandas, NumPy, Matplotlib and Seaborn, and scikit-learn that you’re likely to use almost every day. Master these four packages first, and you’ll be able to perform the vast majority of common data science tasks.
There are loads of great websites that provide excellent guidance on using these packages - Machine Learning Mastery is my particular favourite - as well as dozens of great Python data science courses. If you’re looking for a quick way to get to grips with the key things, I’d highly recommend the Kaggle course as a starting point.
Matt Clarke, Sunday, March 07, 2021