Imbalanced Data

Tyipcally student-at-risk data is imbalanced (skewed) as the portion of students who were labeled at-risk is very small compared to the number of students who are doing fine. Training a model on imbalanced data needs some manipulation, such as oversampling or undersampling to make data balanced.

(From imbalanced-learn documentation) Both SMOTE and ADASYN use the same algorithm to generate new samples. Considering a sample x_i, a new sample x_{new} will be generated considering its k neareast-neighbors (corresponding to k_neighbors). For instance, the 3 nearest-neighbors are included in the blue circle as illustrated in the figure below. Then, one of these nearest-neighbors x_{zi} is selected and a sample is generated as follows:

x_{new} = x_i + \lambda \times (x_{zi} - x_i)

where \lambda is a random number in the range [0, 1]. This interpolation will create a sample on the line between x_{i} and x_{zi} as illustrated in the image below:

In the following example, trainining data was manipulated to become balanced

from imblearn.over_sampling import SMOTE, ADASYN
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
X_resampled, y_resampled = SMOTE(random_state=0).fit_sample(X_train, y_train)

In Azure, Automl has class balancing enabled by default. The criteria to determine whether or not a classification dataset is imbalanced is shown as below:

size of smallest class / size of entire dataset / (1 / number of classes) <= 0.1 or size of smallest class is less than 5.