Improving Model Accuracy

If the training accuracy is low, it is an indication that the model did not capture the complexity in the data. One way is to adjust the training parameters, such as the maximum number of iterations, or the number of variables. Another way is to try different algorithms.

If the validation accuracy is low, the model might have overfitted. It means the model learned too many specific details about the training set that didn’t generally apply to other examples. In the case of a Lasso/Ridge/ElasticNet or Deep Learning, we may want to reduce the number of training iterations to prevent the model from learning too much about the training data. In the case of a Random Forest we may want to reduce the maximum depth of the tree to prevent nodes from being expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

If the evaluation accuracy on test data is lower than that on training and validation accuracy, it means there is some difference between the training data and the test data for evaluation. In this case we may need more the data that are in the test data for training. The following code plots a learning curve example. The trend of the learning curve for test data means that the test accuracy may improve as more data is used for training.

from sklearn.model_selection import learning_curve
train_sizes, train_scores_rfc, test_scores_rfc = learning_curve(rfc_test, X_train, y_train, cv=5, shuffle=True, scoring='accuracy', n_jobs=-1, verbose=0, random_state=123, train_sizes=array([0.1, 0.33, 0.55, 0.78, 1.]))

plt.plot(train_sizes, test_scores_rfc.mean(1), 'o-', color="r", label="Test")
plt.plot(train_sizes, train_scores_rfc.mean(1), 'o-', color="g", label="Training")
plt.xlabel("Train size")
plt.title('Learning curves')

If simple random sampling was selected for splitting, a stratified sampling approach may improve test accuracy.

Statistical Analysis with Survey