Loading Python and Virtual Environments
Overview
Teaching: 30 min
Exercises: 10 minQuestions
How do I run python on the cluster?
How do I install python packages?
Objectives
Be able to load a specific version of Python
Be able to create and activate virtual environments
Be able to install packages in a virtual environment
If you’ve learned python using Jupyter notebooks, you may have never run python on the
command line or seen a virtual environment before. Quite often packages like pandas
and
numpy
are “just there” when we use online notebook services like Colab.
On a cluster, we have many options for how we want to run Python. We have a system called “modules” (or “environment modules”, or it’s proper name “lmod”) that help us choose which Python version we want to use. On a cluster, it’s our responsibility to ensure that we have the packages we need to run our Python code.
This is what we will cover in this section.
Loading Python
When you first log into an HPC cluster, you will have python available to you, but it’s rarely the version you will want.
$ python --version
Python 3.7.7
Unlike your home computer, the Alliance clusters use a modular software stack that allows you to
choose what version of python you would like to use. To use this stack, most actions are initiated
by the module
command. The module load
command, followed by the name of the software you want
(and possibly the version) gives you access to that software. For example, we can load Python 3.11
with:
module load python/3.11
To find out the available version of python:
module spider python
Note: the module load command doesn’t do any permanent changes to your account, and the next time you log in things will be the same as before. You will need to load the python module again if you want to use it.
Python Packages
Once python is loaded, we need to ensure that we have the packages we need to run our python code.
We can see if we have the pandas
package installed by starting up the
Python console:
python
Now try to import pandas
:
import pandas
You likely got an error saying that pandas
couldn’t be found
(Ctrl-D
to exit the console). It’s our responsibility to get it.
Virtual environments provide a nice solution to keeping the packages you
need for any particular program isolated from your account and the system
install of python. If you have two different projects that use different
Python versions and different dependencies, no problem: you can
create two different virtual environments that you can turn on of off as
needed. The tool pip
is used to install packages that don’t come
with the main Python install (pip
stands for “pip
installs packages”).
Note: on most clusters, the Anaconda distribution of Python is not supported (nor does it work). If you would like to know why, check out this document:
https://docs.alliancecan.ca/wiki/Anaconda/en#Do_not_install_Anaconda_on_our_clusters
You create a virtualenvironment with:
virtualenv --no-download venv
(Here venv
is a name for the virtual environment, and will be created on disk as a folder.)
To use a virtual environment, you need to activate it. This is done with
source venv/bin/activate
Notice how your command prompt now has the name of the virtual environment. We can now use pip to start downloading packages. The first thing we should do is to upgrade pip.
pip install --upgrade pip
Later versions of pip
do a better job of handling package requirements and dependencies,
so this is why this step is important.
Some important considerations
The above is fine for your own computer, but in general:
- python packages from PyPI aren’t optimized for the cluster environment. They might be missing parallelization options, or may have been built without vectorization or other optimization flags.
- worker nodes on Alliance clusters almost never have access to the internet to reach PyPI.
For this reason, the Alliance has it’s own wheelhouse that is accessible to all
cluster nodes and has mostly wheels that were built for clusters. To use this instead of
PyPI, we can use the --no-index
flag for pip
, e.g.,
pip install --no-index --upgrade pip
pip install --no-index pandas
If you neglect to include --no-index
when installing with pip
you can run into
real problems where pip
tries to access PyPI but can’t due to lack of internet
access. Your install command might hang forever without completing.
Now start up the Python console and try import pandas
. Did it work?
To see all of the wheels that are in the Alliance wheelhouse, visit this page: https://docs.alliancecan.ca/wiki/Available_Python_wheels
Please note that if there is a python package that you need that you find is not in the Alliance wheelhouse, you can contact support to request that it be added: support@tech.alliancecan.ca.
If the missing package is “pure python”, there is also a chance that you can download a wheel from PyPI (on a login node) and install it directly into your virual environment from your account.
A warning
If you don’t have a virtual environment enabled, pip
will attempt to install packages so they
are available to your entire account. This almost always leads to problems, so it is recommended that
you always have a virtual environment activated when you install packages with pip
.
If you do make the mistake and install in your account, not in a virtual environment,
you can usually find the software installed in the .local
folder in your home.
There are some environment variables that you can use to prevent this (Google for
PYTHONNOUSERSITE
or PIP_REQUIRE_VIRTUALENV
if you are interested).
pip and versions
You’ll notice that when we ran pip install --no-index pandas
,
we didn’t specify a version.
If we want to install a specific version of a package we can do so by using the package name, double equals signs, and the version number. By sure to keep all of this inside of quotation marks (I prefer single quotes):
pip install --no-index 'flask==1.1.2'
Checking which packages are installed
This is done with the pip freeze
command.
pip freeze
numpy==1.25.2+computecanada
pandas==2.1.0+computecanada
python-dateutil==2.8.2+computecanada
pytz==2023.3.post1+computecanada
six==1.16.0+computecanada
tzdata==2023.3+computecanada
Repeatability through requirements
Sometimes you want to ensure that you use the same versions of your packages each time you run your python code, on whatever cluster we are running on.
We can use the output from pip freeze
and send the output to a file
(the convention is to call this file requirements.txt
, but the file could
be named anything you want):
pip freeze > requirements.txt
Now the next time we create a virtual environment, we can use this file
to populate the packages with the -r
flag to pip.
pip install --no-index -r requirements.txt
Deactivating a virtual environment
When you are done using a virtual environment, or you want to activate a different one, run
deactivate
Notice how your command prompt changes.
Note: you can’t have more than one virtual environment active at a time.
requirements.txt
example
Let’s create a second virtual environment called venv2
:
module load python/3.11
virtualenv --no-download venv2
source venv2/bin/activate
pip install --upgrade --no-index pip
pip install --no-index -r requirements.txt
# check if it works
deactivate
scipy-stack
: An alternative to a virtual environment
If you are using some common data science packages, there is a module in the Alliance
software stack that contains many of them: scipy-stack
.
We can load the scipy-stack
as follows:
module load python/3.11
module load scipy-stack
As of this writing, the version of scipy-stack
that gets loaded is 2023b
.
You can be explicit about this:
module load python/3.11
module load scipy-stack/2023b
If you would like to know which packages are loaded, check out module spider scipy-stack/2023b
or do a pip freeze
(although you will also see some packages that come with the python
module).
Unloading a module
One you’ve deactivated a virtual environment, you might decide you want to work with a different version of Python. You can unload the Python module with:
module unload python/3.11
Note that the following also works:
module unload python
Sometimes you realize that you want to reset all of the software modules back to the defaults. One way to do this is to log out and back into the cluster. More efficient though:
module reset
Your turn …
Creating virtual environments
Now that we have a clean setup (virtual environments are deactivated and modules are reset), try the following on your own:
- Load Python 3.10
- Create a virtual environment and activate it (careful to choose a new name or work in a new directory, since we have already used
venv
andvenv2
)- Upgrade
pip
- Install the packages
dask
anddistributed
(version1.28.1
).- Create a requirements file (e.g.,
requirements2.txt
).- Deactivate your virtual environment.
- Create a second virtual environment (and activate it!) and use your requirements file to populate it.
Note: Each previous step should be done to do the next step:
Solution
# First virtual environment module load python/3.10 virtualenv --no-download venv3 source venv3/bin/activate pip install --no-index --upgrade pip pip install --no-index dask 'distributed==1.28.1' pip freeze > requirements2.txt deactivate # Second virtual environment module load python/3.10 virtualenv --no-download venv4 source venv4/bin/activate pip install --no-index --upgrade pip pip install --no-index -r requirements2.txt
Key Points
Python is loaded on the cluster through modules
Virtual environments store a version of python and some packages
Package requirements ensure consistency for versions