How to Prepare a Data Set

This section shows how to prepare a data set for training or prediction.

You can provide your original data set for training or prediction on SmallTrain.

Overview

In operation file, you can check and edit setting abount data set:

{
...
    "data_dir_path": "/var/data/cifar-10-image/",
    "data_set_def_path": "/var/data/cifar-10-image/data_set_def/train_cifar10_classification.csv",
    "cache_data_set_id": "train_cifar10_classification",
...
}

All you have to do is:

  1. Put data files to data directory, your local directory given by data_dir_path.
  2. Create data set definition with csv format at the local path data_set_def_path.
  3. Set cache_data_set_id in order to identify your data set.

Data Set Structure

As an example, suppose you are accessing the server after running Getting Started(CIFAR-10 image classification). Your local directory /var/data/cifar-10-image/ has following structure:

/var/data/cifar-10-image/
├── data_batch_1  // One of the training data directories
├── ...
├── data_batch_5  // One of the training data directories
├── data_set_def  // data set definition directory
|   └── train_cifar10_classification.csv // Data set definition file for training and prediction
└── test_batch    // The testing data directory

and data set definition file (in this case /var/data/cifar-10-image/data_set_def/train_cifar10_classification.csv) is:

data_set_id,label,sub_label,test,group
/var/data/cifar-10-image/data_batch_1/data_batch_1_i0_c6.png,6,6,0,TRAIN
/var/data/cifar-10-image/data_batch_1/data_batch_1_i1_c9.png,9,9,0,TRAIN
/var/data/cifar-10-image/data_batch_1/data_batch_1_i2_c9.png,9,9,0,TRAIN
...
/var/data/cifar-10-image/data_batch_5/data_batch_5_i9999_c1.png,1,1,0,TRAIN
/var/data/cifar-10-image/test_batch/test_batch_i0_c3.png,3,3,1,TRAIN
...
/var/data/cifar-10-image/test_batch/test_batch_i9997_c5.png,5,5,1,TRAIN
/var/data/cifar-10-image/test_batch/test_batch_i9998_c1.png,1,1,1,TRAIN
/var/data/cifar-10-image/test_batch/test_batch_i9999_c7.png,7,7,1,TRAIN

If you want to add a new data file /var/data/cifar-10-image/data_batch_6/data_batch_6_i10000_c9.png as training data with labeled class = 9,

  1. Put the new file on the path: /var/data/cifar-10-image/data_batch_6/data_batch_6_i10000_c1.png
  2. Add the following row to the data set definition file.
/var/data/cifar-10-image/data_batch_6/data_batch_6_i10000_c1.png,9,9,0,TRAIN

In another example, if you want to add a new data file /var/data/cifar-10-image/test_batch/test_batch_i10000_c0.png as testing data with labeled class = 0,

  1. Put the new file on the path: /var/data/cifar-10-image/test_batch/test_batch_i10000_c0.png
  2. Add the following row to the data set definition file.
/var/data/cifar-10-image/test_batch/test_batch_i10000_c0.png,0,0,1,TRAIN

Data Set Specifications

  • operation file

    • data_dir_path: String, the directory path which contains data files.
    • data_set_def_path: String, the file path of data set definition file.
    • cache_data_set_id: String, the identifier of the data set.
    • target_group: String, the identifier for the group which to use as data set(see also group in data set definition).
  • data set definition file

    • format: csv
    • columns:
      • data_set_id: String, the file path of the data file. It also works as the unique id that represents the data file.
      • label: Integer, the label which represents class for data.
      • sub_label: Integer, The sub lavel which is used if you want to label with a combination of label and sub_label.
      • test: Integer, the flag whether to use as testing data or not. If 1 then used as testing data.
      • group: String, the group identifier. If you don’t want to use the data, you can exclude the data by setting group not equal to target_group in operation file setting.