Our Data Science Lifecyle is based on Microsoft Azure standards, with added features to accommodate additional requirements, which discusses goals, tasks, and deliverables in each stage. Basically stages can be divided in the following:
- Business understanding
- Data acquisition and understanding
- Modeling
- Deployment
- Customer acceptance
Business understanding
- Goals:
- Specify the key variables that are to serve as the model targets and whose related metrics are used determine the success of the project.
- Identify the relevant data sources that the business has access to or needs to obtain.
- How to do it: There are two main tasks addressed in this stage:
- Define objectives: Work with your customer and other stakeholders to understand and identify the business problems.
- A target variable needs to be specified. The target variable is the one that deeper understanding (e.g. if a student would enroll)
- Formulate questions that define the business goals that the data science techniques can target.
- How much or how many? (regression)
- Which category? (classification)
- Which group? (clustering)
- Is this weird? (anomaly detection)
- Which option should be taken? (recommendation)
- Define the success metrics: Root Mean Square Error (RMSE), or Mean Absolute Error (MAE) for a regression problem, or Accuracy for a classification problem.
- Identify data sources: Find the relevant data that helps you answer the questions that define the objectives of the project.
- Artifacts:
- Charter document: The charter document is a living document. You update the template throughout the project as you make new discoveries and as business requirements change. The key is to iterate upon this document, adding more detail, as you progress through the discovery process. Keep the customer and other stakeholders involved in making the changes and clearly communicate the reasons for the changes to them.
- Data sources: The Raw data sources section of the Data definitions report contains the data sources. This section specifies the original and destination locations for the raw data. In later stages, you fill in additional details like the scripts to move the data to your analytic environment.
- Data dictionaries: This document provides descriptions of the data that’s provided by the client. These descriptions include information about the schema (the data types and information on the validation rules, if any) and the entity-relation diagrams, if available.
Data Acquisition and understanding
- Goals
- Produce a clean, high-quality data set whose relationship to the target variables is understood. Locate the data set in the appropriate analytics environment so you are ready to model.
- Develop a solution architecture of the data pipeline that refreshes and scores the data regularly.
- How to do it
- Ingest the data into the target analytic environment.
- Explore the data to determine if the data quality is adequate to answer the question
- Set up a data pipeline to score new or regularly refreshed data (online prediction)
- Artifacts: Data quality report that includes data summaries (count, mean, std, min, 25%, 50%, 75%, max), the relationships between each attribute and target, variable ranking.
Modeling
- Goals
- Determine the optimal data features for the machine-learning model.
- Create an informative machine-learning model that predicts the target most accurately.
- Create a machine-learning model that’s suitable for production.
- How to do it
- Feature engineering: Create data features from the raw data to facilitate model training.
- Feature engineering involves the inclusion, aggregation, and transformation of raw variables to create the features used in the analysis. If you want insight into what is driving a model, then you need to understand how the features relate to each other and how the machine-learning algorithms are to use those features.
- This step requires a creative combination of domain expertise and the insights obtained from the data exploration step. Unrelated variables introduce unnecessary noise into the model so should be avoided.
- Model training: Find the model that answers the question most accurately by comparing their success metrics.
- Split the input data randomly for modeling into a training data set and a test data set.
- Build the models by using the training data set.
- Evaluate the training and the test data set. Use a series of competing machine-learning algorithms along with the various associated tuning parameters that are geared toward answering the question of interest with the current data.
- Determine the “best” solution to answer the question by comparing the success metrics between alternative methods.
- Determine if your model is suitable for production.
- Feature engineering: Create data features from the raw data to facilitate model training.
- Artifacts
- Raw data sources, the processed/transformed data, and feature sets
- Model report for each model that’s tried
- What algorithm
- Which parameters
- Accuracy, precision, recall, ROC, AUC
- Variable importance
- Discussion of overfitting if applicable
- Checkpoint decision: Evaluate whether the model performs sufficiently for production. Some key questions to ask are:
- Does the model answer the question with sufficient confidence given the test data?
- Should you try any alternative approaches? Should you collect additional data, do more feature engineering, or experiment with other algorithms?
Deployment
- Goal: Deploy models with a data pipeline to a production or production-like environment for final user acceptance.
- How to do it: Deploy the model and pipeline to a production or production-like environment for application consumption.
- Artifacts:
- A status dashboard that displays the key metrics
- Scoring script that reads data and scores data
- Reports/Charts/Dashboard that consume the model
Customer acceptance
- Goal
- Finalize project deliverables: Confirm that the pipeline, the model, and their deployment in a production environment satisfy the customer’s objectives.
- How to do it
- System validation: Confirm that the deployed model and pipeline meet the customer’s needs.
- Project hand-off: Hand the project off to the entity that’s going to run the system in production.
- The customer should validate that the system meets their business needs and that it answers the questions with acceptable accuracy to deploy the system to production for use by their client’s application or reports/charts.
- The team provides the Exit report of the project for the customer. This technical report contains all the details of the project that are useful for learning about how to operate the system
- Artifacts
- Exit report of the project for the customer.