Versioning

The following ingredients are considered for versioning, including

  1. Code: Code is the main ingredient where various pre-processing steps were taken, and various models were trained, validated, etc. Each time code is changed, which may be in pre-processing or modeling, a new version of code is created and needs to be versioned.
  2. Data (training and validation): Changes in training data may have significant effects on model performance so training data need to be versioned. Validation data is important for preventing overfitting so need to be versioned as well.
  3. Parameters: Hyperparameters are very important, usually found during parameter tuning. Changes in hyperparameters may make a model better or ruin it, so hyperparameters need to be versioned.
  4. Models: A model is the result of all the ingredients above, so can be reproduced if these are versioned. A model can be saved as a file and can be versioned, then can be used for deployment.

Parameters can be versioned together with code. Training and validation data may be versioned together. Splitting parameters can be versioned together with code.

Software for code versioning include Git, Bitbucket (available at stash.uconn.edu), together with UI such as SourceTree. Software for data versioning include DVC (https://dvc.org/)