Getting Started

This section shows how to get started with SmallTrain.

Getting start with SmallTrain using Docker on NVIDIA DGX Station.

Getting Started with SmallTrain v0.1.2 using Docker on NVIDIA DGX Station with Ubuntu 18.04

(NVIDIA Docker is already installed)

Here you can see how to install and setup SmallTrain on DGX STATION. Please change the content appropriately according to your own environment settings.

Check docker-compose

on the host

$ docker-compose -v
$ docker-compose version 1.22.0, build f46880fe

Install docker-compose if not exists

on the host by host sudoers

$ sudo curl -L "https://github.com/docker/compose/releases/download/1.22.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
$ sudo chmod +x /usr/local/bin/docker-compose

Clone SmallTrain repository

on the host

$ mkdir -p ~/gitlab/geek-guild/
$ cd  ~/gitlab/geek-guild/
# Authenticate with your GeekGuildGitlab account
$ git clone -b develop/v0.1.2 https://gitlab.geek-guild.net/geek-guild/smalltrain.git

# or Upload SmallTrain local src
# $ mkdir -p ~/gitlab/geek-guild/smalltrain/
# $ rsync -avz --delete -e "ssh -i $SSH_KEY_PATH" ~/gitlab/geek-guild/smalltrain/ $USER_NAME@$INS_IP_ADDR:/home/$USER_NAME/gitlab/geek-guild/smalltrain/

Clone GGUtils repository

on the host

$ mkdir -p ~/github/geek-guild/
$ cd  ~/github/geek-guild/
# Authenticate with your Github account on github.com/geek-guild repository
$ git clone -b release/v0.0.3 https://github.com/geek-guild/ggutils.git

(sudoers run docker if not running)

on the host by host sudoers

$ sudo service docker start

Run docker image

on the host

# SmallTrain
$ cd ~/gitlab/geek-guild/smalltrain/docker/
$ docker-compose up -d

Building smalltrain
Step 1/18 : FROM nvcr.io/nvidia/tensorflow:19.10-py3
...
Creating smalltrain ... done

Check a new SmallTrain container running and its CONTAINER ID

on the host

$ docker ps -a

CONTAINER ID        IMAGE                     COMMAND                  CREATED             STATUS              PORTS                                              NAMES
YYYYYYYYYYYY        docker_smalltrain-redis   "docker-entrypoint.s…"   15 minutes ago      Up 15 minutes       0.0.0.0:6379->6379/tcp, 0.0.0.0:16379->16379/tcp   smalltrain-redis
XXXXXXXXXXXX        docker_smalltrain         "/usr/local/bin/entr…"   15 minutes ago      Up 15 minutes       0.0.0.0:6006->6006/tcp                             smalltrain

Check the log of running SmallTrain container

on the host

$ CONTAINER_ID=XXXXXXXXXXXX
$ docker logs $CONTAINER_ID

...
Exec operation id: IR_2D_CNN_V2_l49-c64_20200109-TRAIN
nohup: appending output to 'nohup.out'

On the host, Check GPU usage

on the host

$ watch -n 1 nvidia-smi

Every 1.0s: nvidia-smi                                                                                                                                                                       gg-sta-20200116-volta: Tue Jan 21 10:42:27 2020

Tue Jan 21 10:42:27 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-DGXS...  On   | 00000000:07:00.0  On |                    0 |
| N/A   32C    P0    37W / 300W |    316MiB / 32475MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-DGXS...  On   | 00000000:08:00.0 Off |                    0 |
| N/A   33C    P0    35W / 300W |      1MiB / 32478MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-DGXS...  On   | 00000000:0E:00.0 Off |                    0 |
| N/A   34C    P0   104W / 300W |   2893MiB / 32478MiB |     23%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-DGXS...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   32C    P0    36W / 300W |      1MiB / 32478MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      6903      G   /usr/lib/xorg/Xorg                            40MiB |
|    0      7058      G   /usr/bin/gnome-shell                         148MiB |
|    0     40041      G   /usr/lib/xorg/Xorg                            39MiB |
|    0     40082      G   /usr/bin/gnome-shell                          86MiB |
|    2      1441      C   python                                      2879MiB |
+-----------------------------------------------------------------------------+
  • Check that the GPU device which set with environment value (e.g. NVIDIA_VISIBLE_DEVICES=2) is running.

Login SmallTrain container

on the host

$ docker exec -it $CONTAINER_ID /bin/bash

Check log of tutorial operation

on the container

# check log
$ less /var/smalltrain/logs/IR_2D_CNN_V2_l49-c64_20200109-TRAIN.log

2020-01-20 14:34:45.125276: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
...
========================================
step 49, training loss 0.103139
========================================
test cross entropy 0.44906
save model to save_file_path:/var/model/image_recognition/tutorials/tensorflow/model/IR_2D_CNN_V2_l49-c64_20200109-TRAIN/model-nn_lr-0.001_bs-128.ckpt
DONE train data
====================

Run TensorBoard

on the container

$ nohup tensorboard --logdir /var/model/image_recognition/tutorials/tensorflow/logs/ &

Check the result of the tutorial operation

on the container

# Report directory
$ ls -l /var/model/image_recognition/tutorials/tensorflow/report/IR_2D_CNN_V2_l49-c64_20200109-TRAIN/
total 424
-rw-r--r-- 1 root root  38074 Jan 20 15:12 all_variables_names.csv
-rw-r--r-- 1 root root  77687 Jan 20 15:13 prediction_e49_all.csv
-rw-r--r-- 1 root root  77687 Jan 20 15:13 prediction_e9_all.csv
-rw-r--r-- 1 root root     28 Jan 20 15:12 summary_layers_9.json
-rw-r--r-- 1 root root 109286 Jan 20 15:12 test_plot__.png
-rw-r--r-- 1 root root  55458 Jan 20 15:13 test_plot_e49_all.png
-rw-r--r-- 1 root root  54259 Jan 20 15:13 test_plot_e9_all.png
-rw-r--r-- 1 root root   6406 Jan 20 15:12 trainable_variables_names.csv

# Prediction after 49steps of training
$ less /var/model/image_recognition/tutorials/tensorflow/report/IR_2D_CNN_V2_l49-c64_20200109-TRAIN/prediction_e49_all.csv

DateTime,Estimated,MaskedEstimated,True
/var/data/cifar-10-image/test_batch/test_batch_i9_c1.png_0,1,0.0,1
/var/data/cifar-10-image/test_batch/test_batch_i90_c0.png_0,0,0.0,0
/var/data/cifar-10-image/test_batch/test_batch_i91_c3.png_0,6,0.0,3
/var/data/cifar-10-image/test_batch/test_batch_i92_c8.png_0,8,0.0,8
  • You can see how to read the result as follows:
    each result line shows “DateTime”, “Estimated”, “MaskedEstimated”, “True”. The part of “True” shows 5 digits in which the second digits shows output and the last digit shows true label. Therefore,

    this means the image is incorrect:
    /var/data/cifar-10-image/test_batch/test_batch_i91_c3.png is 6 but the true label is 3 (incorrect),

    this means the image is correct:
    /var/data/cifar-10-image/test_batch/test_batch_i92_c8.png is 8 and the true label is also 8 (correct)

You have done!