Forking for code versioning

A fork workflow is purposed to streamline a Data Scientist team operations regarding code.

At the beginning of a Data Science project, an empty remote public repository is created by the Lead Data Scientist (LDS) using Bitbucket (https://stash.uconn.edu). By default, the repository is named Origin and has 1 master branch. The LDS would then create an associated local repository and upload the first copy of code to the remote repository.

When a Data Scientist joins a Data Science project, he would fork from Origin/master to create his remote public repository. This is done via the web interface at stash.uconn.edu by the Data Scientist. The name of this newly remote public repository is recommended to be suffixed by the Data Scientist’s name. The Data Scientist would unset the “Allow forks” setting to prevent others from forking his repository as everyone should fork from the main public repository.

Next, the Data Scientist needs to clone the public repository to a local folder on his computer. SourceTree is recommended as the Git UI though many prefer using Git on the command line.

In their daily life, Data Scientists

  1. Commit changes to the staging area (local repository). Note that after this step, changes are still on the local machines.
  2. Push changes to the given public repository. This ensures the changes to be safely stored in a remote location in case something may happen to the local machines
  3. Open a pull request. This allows the Lead Data Scientist to review changes and discuss modifications & improvement. The request may be approved or rejected.
  4. Pull changes from the main repository that the LDS is in charge of, which has changes from other Data Scientists via pull requests.. This is to make sure the local repository is up to date with the main public repository.
    • Note: In some cases, pulls from self remote public repositories are needed as well (e.g. new computer)

The branching strategy for git is parameter-based, not feature-based like Software Development, e.g. additional layers in a Deep Learning model, etc.

(Image source: dvc.org)

A list of file extensions should be added to .gitignore, typically *.csv, *.xls, *.xlsx, *.h5, *.p, *.pkl, *.rds. The following git commands add and commit and push code to the origin/master branch on the remote public repository:

cd C:\new\ibi\apps\testapp
git init
git add –all
git commit -m “Initial Commit”
git remote add origin https://stash.uconn.edu/scm/srpm/testapp.git
git push -u origin master

Files that are pushed using git include many, such as

  • Python code: *.py, *.ipynb
  • R code: *.Rmd, *.nb.html, *.R
  • WebFocus code: *.mas, *.acx, *.fex