Data Leakage

There are steps in a Machine Learning project in which the order may seriously affect model quality. One of the common mistakes is that steps were taken in such an order that leaks information from the test data to the model.

An example is when feature preprocessing & feature selection are performed then train/validation/test split followed by model training/validation. Imputing with a mean or median on the whole dataset is mistaking as the mean or median contain the knowlege from the validation/test sets. The same story is with feature scaling as min and max would have the knowledge from the validation/test sets. Another example is standardization when mean and standard deviation leak information from the validation/test sets to the model. Eigen-vectors & eigenvalues are different on the whole data set vs on the training set, which suggests the presence of additional knowlege from the test set. The correct way is to perform these only for training dataset by pipelining multiple steps. Cross validation would only do feature preprocessing, selection, model training in the folds for training.