Data Engineering

12 articles and tutorials on data engineering

How to Dockerize a data science application

The phrase “It works on my machine” is a common one in most offices with a data science department. It’s very common to write an application that works perfectly in...

How to backup a MySQL database using mysqldump, SSH and SCP

If you need to create a backup of a remote MySQL database, you can use the mysqldump command. The mysqldump application is known as a client utility and installed alongside...

How to create a Google Service Account client secrets JSON key

The Google Cloud Platform offers a variety of ways for users, or applications, to authenticate themselves in order to gain access to data. For Python developers, one of the most...

The difference between data scientists and data engineers

The growing need for data engineers, as well as data scientists, means that increased demand is pushing salaries even higher, making the two positions among the most highly paid in...

How to use Google Secret Manager to improve data security

Google Cloud Functions make it easy to deploy Python data science applications and models in the cloud as serverless applications. Since it’s inevitable that these applications need to access sensitive...

How to import data into Google Data Studio using Python

Google Data Studio has native support for a range of platforms, but there’s no reliable means of pushing data in from Python without going via another data source. Google BigQuery...

How to import data into BigQuery using Pandas and MySQL

Google BigQuery is a “serverless” data warehouse platform stored in the Google Cloud Platform. The serverless approach means you don’t have to maintain a server yourself and Google looks after...

How to create a BI platform using Apache Superset

Apache Superset is a new “enterprise-ready” web application for building business intelligence (BI) applications and dashboards. Developed by the team that built Airbnb using the Flask Python framework, React JS,...

How to use Apache Druid for real-time analytics data storage

Apache Druid is described as a high performance real time analytics database and was developed at Metamarkets in 2011 for their internal analytics system. Unlike traditional relational databases, such as...

How to set up a Docker container for your MySQL server

Like most people who work in ecommerce data science, I regularly need to access data stored in a database - usually MySQL or MariaDB, but sometimes also MSSQL. Although it...

How to create ecommerce data pipelines in Apache Airflow

Like Apache Superset, Apache Airflow was developed by the engineering team at Airbnb and was open sourced in 2014. It’s a Python-based platform designed to make it easier to create,...

How to use Docker for your data science projects

How many times have you struggled to get Python packages like TensorFlow, Keras, or PyTorch working together? How many times have you downloaded code or shared yours with others only...