This is the second part of a multi-part guide on GPU cloud computing for Deep Learning
- Set Up Amazon Elastic Compute Cloud (EC2)
- Theano on Amazon Web Services for Deep Learning
- Set up Microsoft Azure for CUDA Cloud
This entry demonstrates how you can offload computational tasks to an Amazon Elastic
Compute Cloud (EC2) instance through Amazon Web Services (AWS). The guide focuses on
CUDA support for Theano
.
Requirements
- Can set up an EC2 Instance - see part one
- Familiarity with Linux and Bash e.g.
sudo
,wget
,export
- Familiarity with Ubuntu for
apt-get
Contents
Connect
Connect to the Instance through SSH. Assuming you followed part 1 this is just
ssh ubuntu@[DNS]
Load Software
See the references for the sources of these instructions. This code is almost identical with a few tweaks.
Note you will have to do this each time you start a new Instance
You can download this code as install.sh
# update software
sudo apt-get update
sudo apt-get -y dist-upgrade
# install dependencies
sudo apt-get install -y gcc g++ gfortran build-essential \
git wget linux-image-generic libopenblas-dev \
python-dev python-pip ipython python-nose\
python-numpy python-scipy\
linux-image-extra-virtual\
gnuplot-qt # a lot quicker than matplotlib for runtime plots
# install bleeding edge theano
sudo pip install --upgrade --no-deps git+git://github.com/Theano/Theano.git
# get CUDA
sudo wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/cuda-repo-ubuntu1404_7.0-28_amd64.deb
# depackage and install CUDA
sudo dpkg -i cuda-repo-ubuntu1404_7.0-28_amd64.deb
sudo apt-get update
sudo apt-get install -y cuda
# update PATH variables
{
echo -e "\nexport PATH=/usr/local/cuda/bin:\$PATH";
echo -e "export LD_LIBRARY_PATH=/usr/local/cuda/lib64";
} >> ~/.bashrc
# reboot for CUDA
sudo reboot
After waiting about a minute for the reboot, ssh
back into the Instance
You can download this code as cuda_check.sh
# install included samples and test cuda
ver=8.0 # version number -- will get a more robust method in a later edit
echo "CUDA version: ${ver}"
/usr/local/cuda/bin/cuda-install-samples-${ver}.sh ~/
cd NVIDIA\_CUDA-${ver}\_Samples/1\_Utilities/deviceQuery
make
./deviceQuery
Make sure the test shows that a GPU exists - common errors are listed here. If you don’t have a GPU then skip the next step or use a GPU EC2 Instance
# set up the theano config file to use gpu by default
{
echo -e "\n[global]\nfloatX=float32\ndevice=gpu";
echo -e "[mode]=FAST_RUN";
echo -e "\n[nvcc]";
echo -e "fastmath=True";
echo -e "\n[cuda]";
echo -e "root=/usr/local/cuda";
}>> ~/.theanorc
Install any other dependencies you may require.
Get CuDNN
To obtain CuDNN
you must register with the NVIDIA developer programme here.
The download page for CuDNN
is here and it’s simplest to download the latest
Library for Linux to your local machine and scp
it over to EC2 as follows
scp -r ~/Downloads/cudnn-8.0-linux-x64-v5.1.tar ubuntu@[DNS]:/home/ubuntu/
where the [DNS]
needs to be entered and the filename will differ as the software is updated.
Once the scp
has transferred the file move to the active ssh
terminal instance in EC2 and do
the following to install CuDNN
tar -xvf cudnn-8.0-linux-x64-v5.1-tgz
# use tar -xzf cudnn-8.0-linux-x64-v5.1-tgz if the extension is .tgz
cd cuda
sudo cp lib64/* /usr/local/cuda/lib64/
sudo cp include/* /usr/local/cuda/include/
Now add the following to enable CNMeM
echo -e "\n[lib]\ncnmem=0.5" >> ~/.theanorc
A value between 0-1
allocates this fraction of GPU memory to theano
so here we allocate
50% to not be stingy.
Now check that theano
is configured properly by opening ipython
and running
import theano.sandbox.cuda
theano.sandbox.cuda.use("gpu0")
which gave me the output
Using gpu device 0: Tesla K80 (CNMeM is enabled with initial size: 50.0% of memory, cuDNN 5105)
Run Code
Transfer the relevant code across the the Cloud e.g.
- Pull from an existing
git
repository scp
files across
If you are running code in a Spot Instance, I would recommend saving results at runtime
and passing them back to your local machine. It is sensible to pickle
the state of the neural net
at runtime so that you can easily continue the training process from a saved state rather than
having to run again from scratch.
Close
Don’t forget to Stop or Close the instance once it has completed the task!
Make sure that you check the instance has been closed in addition to the Spot request in the dashboard! I received a 31 hour bill for an unclosed GPU Compute instance that I had thought I closed which was rather annoying.
In theory this can be automated by running the following as root after code has been executed
shutdown -h now
but now I don’t particularly trust the methodology in practice.
Common Errors
CUDA Failures
A few common errors encountered with installing CUDA
No GPU
If no GPU exists you will receive the following output
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
NVIDIA: no NVIDIA devices found
cudaGetDeviceCount returned 30
-> unknown error
Result = FAIL
ubuntu@ip-172-31-36-215:~/NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery$ ./deviceQuery
deviceQuery deviceQuery.cpp deviceQuery.o Makefile NsightEclipse.xml readme.txt
ubuntu@ip-172-31-36-215:~/NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
NVIDIA: no NVIDIA devices found
cudaGetDeviceCount returned 30
-> unknown error
Result = FAIL
The resolution is to cancel the instance and get a GPU instance if you require CUDA support.
Unknown symbol in module
This is a slightly more complicated issue that arose since CUDA 7.5
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
modprobe: ERROR: could not insert 'nvidia_361_uvm': Unknown symbol in module, or unknown parameter (see dmesg)
cudaGetDeviceCount returned 30
-> unknown error
Result = FAIL
The resolution for this is fairly simple and means that you didn’t install linux-image-extra-virtual
as above. This is probably because you followed one of the guides in the references which are
now out of date.
Simply run this line
# install the required library
sudo apt-get install linux-image-extra-virtual
# restart instance
sudo reboot
then wait a minute or so for a restart and ssh
back in and run the CUDA check
again which should give the following at the end of the output
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Tesla K80
Result = PASS
References
- Amazon Web Services
theano
Docs- Ubuntu
- Installing CUDA, OpenCL, and PyOpenCL on AWS EC2
- How to install Theano on Amazon EC2 GPU instances for deep learning
- StackOverflow: Self-Terminating AWS EC2 Instance
- StackOverflow: ERROR: could not insert ‘nvidia_361_uvm’: Unknown symbol in module
- NVIDIA Forums: Error installing nvidia drivers on x86_64 amazon ec2 gpu cluster (T20 GPU)
- NVIDIA Developer Programme
- CuDNN Download page
- Theano Docs: CNMeM