Getting Started

Learn how to install and set up SmallTrain. You can experience a simple training demonstration using the data of CIFAR-10.

How to run SmallTrain on a Linux server using Docker

SmallTrain trains small data on Linux server Docker

Here, as an example, you can see how to install and set up SmallTrain on your DGX STATION using MacOS. You can experience a learning demo using CIFAR-10 data. Make appropriate changes and adjustments to the settings that suit your environment, such as your Linux server.
Environment example: Linux server (NVIDIA DGX Station on Ubuntu 18.04), local machine (macOS)
(NVIDIA Docker is already installed)

Check docker-compose

on the host

$ docker-compose -v
docker-compose version 1.22.0, build f46880fe

Install docker-compose if not exists

on the host by host sudoers

$ sudo curl -L "https://github.com/docker/compose/releases/download/1.22.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
$ sudo chmod +x /usr/local/bin/docker-compose

Clone SmallTrain repository

on the host

$ mkdir -p ~/github/geek-guild/
$ cd  ~/github/geek-guild/
$ git clone https://github.com/geek-guild/smalltrain.git

Clone GGUtils repository

on the host

$ mkdir -p ~/github/geek-guild/
$ cd  ~/github/geek-guild/
# Authenticate with your Github account on github.com/geek-guild repository
$ git clone https://github.com/geek-guild/ggutils.git

If docker is not running, run docker. However, sudoers permission is required.

on the host by host sudoers

$ sudo service docker start

Create a Docker bridge network for SmallTrain

In Docker’s bridge network, containers connected on the same bridge network can communicate with each other.

  • Bridge network name: smalltrain_network
  • Subnet: 172.28.0.0/24
  • Gateway: 172.28.0.1

on the host

$ docker network create -d bridge smalltrain_network --gateway=172.28.0.1 --subnet=172.28.0.0/24

Run docker image

Run the docker script to create a docker image. (It is a work to set SmallTrain on docker.)

on the host

# SmallTrain
$ cd ~/gitlab/geek-guild/smalltrain/docker/
$ docker-compose up -d

Building smalltrain
Step 1/18 : FROM nvcr.io/nvidia/tensorflow:19.10-py3
...
Creating smalltrain ... done

Check a new SmallTrain container running and its CONTAINER ID

on the host

$ docker ps -a

CONTAINER ID        IMAGE                     COMMAND                  CREATED             STATUS              PORTS                                              NAMES
YYYYYYYYYYYY        docker_smalltrain-redis   "docker-entrypoint.s…"   15 minutes ago      Up 15 minutes       0.0.0.0:6379->6379/tcp, 0.0.0.0:16379->16379/tcp   smalltrain-redis
XXXXXXXXXXXX        docker_smalltrain         "/usr/local/bin/entr…"   15 minutes ago      Up 15 minutes       0.0.0.0:6006->6006/tcp                             smalltrain

Check the log of running SmallTrain container

on the host

$ CONTAINER_ID=XXXXXXXXXXXX
$ docker logs $CONTAINER_ID

...
Exec operation id: IR_2D_CNN_V2_l49-c64_20200109-TRAIN
nohup: appending output to 'nohup.out'

On the host, Check GPU usage

on the host

$ watch -n 1 nvidia-smi

Every 1.0s: nvidia-smi                                                                                                                                                                       gg-sta-20200116-volta: Tue Jan 21 10:42:27 2020

Tue Jan 21 10:42:27 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-DGXS...  On   | 00000000:07:00.0  On |                    0 |
| N/A   32C    P0    37W / 300W |    316MiB / 32475MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-DGXS...  On   | 00000000:08:00.0 Off |                    0 |
| N/A   33C    P0    35W / 300W |      1MiB / 32478MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-DGXS...  On   | 00000000:0E:00.0 Off |                    0 |
| N/A   34C    P0   104W / 300W |   2893MiB / 32478MiB |     23%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-DGXS...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   32C    P0    36W / 300W |      1MiB / 32478MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      6903      G   /usr/lib/xorg/Xorg                            40MiB |
|    0      7058      G   /usr/bin/gnome-shell                         148MiB |
|    0     40041      G   /usr/lib/xorg/Xorg                            39MiB |
|    0     40082      G   /usr/bin/gnome-shell                          86MiB |
|    2      1441      C   python                                      2879MiB |
+-----------------------------------------------------------------------------+
  • Check that the GPU device which set with environment value (e.g. NVIDIA_VISIBLE_DEVICES=2) is running.

Login SmallTrain container

on the host

$ docker exec -it $CONTAINER_ID /bin/bash

Check log of tutorial operation

on the container

# check log
$ less /var/smalltrain/logs/IR_2D_CNN_V2_l49-c64_20200109-TRAIN.log

2020-01-20 14:34:45.125276: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
...
========================================
step 49, training loss 0.103139
========================================
test cross entropy 0.44906
save model to save_file_path:/var/model/image_recognition/tutorials/tensorflow/model/IR_2D_CNN_V2_l49-c64_20200109-TRAIN/model-nn_lr-0.001_bs-128.ckpt
DONE train data
====================

Run TensorBoard

on the container

$ nohup tensorboard --logdir /var/model/image_recognition/tutorials/tensorflow/logs/ &

Check the result of the tutorial operation

on the container

# Report directory
$ ls -l /var/model/image_recognition/tutorials/tensorflow/report/IR_2D_CNN_V2_l49-c64_20200109-TRAIN/
total 424
-rw-r--r-- 1 root root  38074 Jan 20 15:12 all_variables_names.csv
-rw-r--r-- 1 root root  77687 Jan 20 15:13 prediction_e49_all.csv
-rw-r--r-- 1 root root  77687 Jan 20 15:13 prediction_e9_all.csv
-rw-r--r-- 1 root root     28 Jan 20 15:12 summary_layers_9.json
-rw-r--r-- 1 root root 109286 Jan 20 15:12 test_plot__.png
-rw-r--r-- 1 root root  55458 Jan 20 15:13 test_plot_e49_all.png
-rw-r--r-- 1 root root  54259 Jan 20 15:13 test_plot_e9_all.png
-rw-r--r-- 1 root root   6406 Jan 20 15:12 trainable_variables_names.csv

# Prediction after 49steps of training
$ less /var/model/image_recognition/tutorials/tensorflow/report/IR_2D_CNN_V2_l49-c64_20200109-TRAIN/prediction_e49_all.csv

DateTime,Estimated,MaskedEstimated,True
/var/data/cifar-10-image/test_batch/test_batch_i9_c1.png_0,1,0.0,1
/var/data/cifar-10-image/test_batch/test_batch_i90_c0.png_0,0,0.0,0
/var/data/cifar-10-image/test_batch/test_batch_i91_c3.png_0,6,0.0,3
/var/data/cifar-10-image/test_batch/test_batch_i92_c8.png_0,8,0.0,8
  • You can see how to read the result as follows:
    each result line shows “DateTime”, “Estimated”, “MaskedEstimated”, “True”. The part of “True” shows 5 digits in which the second digits shows output and the last digit shows true label. Therefore,

    this means the image is incorrect:
    /var/data/cifar-10-image/test_batch/test_batch_i91_c3.png is 6 but the true label is 3 (incorrect),

    this means the image is correct:
    /var/data/cifar-10-image/test_batch/test_batch_i92_c8.png is 8 and the true label is also 8 (correct)

Done!