One of the most annoying aspects of working with GPU-accelerated data science software, such as NVIDIA Rapids, TensorFlow, PyTorch and XGBoost, is that it can sometimes be very complicated and time-consuming to get all of your drivers and packages working together properly.
Over the past year I’ve set up several Ubuntu data science workstations and have encountered various issues in getting everything set up correctly. I was hoping that NVIDIA might release their own Linux data science distro, however, I recently found that they’d already created the next best thing - the NVIDIA Data Science Stack.
The Data Science Stack is a shell script tool for Linux that can be used to set up the right NVIDIA GPU drivers and GPU-accelerated data science packages, and it removes much of the hassle from the manual approach. It can be used on Ubuntu 18.04 LTS, Ubuntu 20.04 LTS, and Red Hat Enterprise Linux (RHEL) 7.5+ or 8.x.
The Data Science Stack can be used to install and configure dozens of commonly used packages, including these:
After recently taking delivery of a new Dell Precision 7750 portable data science workstation, I thought I’d give it a try. Here’s how you use it.
After creating a fresh install of Ubuntu Linux 20.04 LTS on my Precision 7750, I installed git and cloned the NVIDIA Data Science Stack repository to a directory called Data, which I placed in my home directory.
sudo apt install git
cd ~
mkdir Data
cd Data
git clone https://github.com/NVIDIA/data-science-stack
cd data-science-stack
To run the Data Science Stack installation script, start the script by entering ./data-science-stack
and then pass the setup-system
argument. This script will set up the correct NVIDIA graphics drivers, currently 455.23.04, and the CUDA 11.0.228 libraries that allow data science software to utilise NVIDIA GPUs.
./data-science-stack setup-system
The next step is to add a user to the system using the setup-user
argument. After running this, if you run gnome-session-quit --no-prompt
your Gnome session will automatically log you out and then back in again without a prompt.
./data-science-stack setup-user
gnome-session-quit --no-prompt
Next we will build a containerised environment containing all the data science applications we need. To build a container, you pass in the build-container
argument.
This command will set up Docker CE and NVIDIA Docker 2 and install a metric shit-ton of packages that allow you to do GPU-accelerated stuff. If you prefer to use Conda, you can pass in the build-conda-env
command instead.
./data-science-stack build-container
As the build-container
process is very intensive, it does take a long time to complete - maybe 30-60 minutes. It does run without requiring user input, so you can go for a walk while it’s running.
To run the NVIDIA Docker container you can pass the run-container
command to the data-science-stack
script. This will fire up the container and start up a Jupyter Lab environment from where you can run your own Jupyter notebooks, or some of the built-in GPU-accelerated example scripts.
./data-science-stack run-container
To access the NVIDIA Docker container running Jupyter Lab you will need to visit http://localhost:8888/ in your web browser. The CLI may give you a URL, such as http://a3682313c15e:8888/ but these often don’t work on Ubuntu.
To get the NVIDIA Data Science Stack to map your local notebooks directory to the one loaded by the Docker container you’ll need to edit the data-science-stack
shell script.
Look for the docker run
command towards the bottom of the file and add a section which reads -v ~/Data/notebooks:/notebooks
just after the docker run
, where Data/notebooks
is the name of the directory in your home directory containing your Jupyter notebooks.
docker run -v ~/Data/notebooks:/notebooks --gpus all --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 ${ENVIRONMENT_NAME}:${STACK_VERSION}
Finally, run the ./data-science-stack run-container
and the container will start up with your local notebooks folder mapped to the Docker machine.
Canonical released its latest long-term support or LTS release - Ubuntu 22.04 LTS - on April 21, 2022. Unfortunately, as of July 2022, the NVIDIA Data Science Stack still doesn’t officially support Ubuntu 22.04 LTS, so users will need to remain (as I am) on the older Ubuntu 20.04 LTS release (Focal Fossa), or attempt to install packages themselves.
A number of NVIDIA Data Science Stack users have already requested that NVIDIA adds support for Ubuntu 22.04 LTS, but it’s not yet been added, so you’ll need to stick with an older version of Ubuntu for a while longer.
Matt Clarke, Sunday, March 07, 2021