Tyipcally student-at-risk data is imbalanced (skewed) as the portion of students who were labeled at-risk is very small compared to the number of students who are doing fine. Training a model on imbalanced data needs some manipulation, such as oversampling or undersampling to make data balanced.
(From imbalanced-learn documentation) Both SMOTE and ADASYN use the same algorithm to generate new samples. Considering a sample , a new sample
will be generated considering its k neareast-neighbors (corresponding to
k_neighbors
). For instance, the 3 nearest-neighbors are included in the blue circle as illustrated in the figure below. Then, one of these nearest-neighbors is selected and a sample is generated as follows:
where is a random number in the range
. This interpolation will create a sample on the line between
and
as illustrated in the image below:
In the following example, trainining data was manipulated to become balanced
from imblearn.over_sampling import SMOTE, ADASYN
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
X_resampled, y_resampled = SMOTE(random_state=0).fit_sample(X_train, y_train)
In Azure, Automl has class balancing enabled by default. The criteria to determine whether or not a classification dataset is imbalanced is shown as below:
size of smallest class / size of entire dataset / (1 / number of classes) <= 0.1 or size of smallest class is less than 5.