Monday, December 30, 2019

My Failures in Machine Learning


Important:
This is an incomplete post and I may or may not edit this again. My knowledge is still limited, but I learned quite a bit from the mistakes I made. Please feel free to correct me if I my understanding is wrong, made mistakes, or any suggestions.
1. Not scaling data properly and then trying to train but getting poor results. Scaling only training data and not test data. Also scaling train and test data separately. So for, scikit-learn I used standardization by fitting train then transforming train and then fitting test then transforming test with it.

It should be to get mean, standard deviation from train set and then transform both train and test set with that. In scikit for standardization standard scaler is fit on train and using that same scaler both train and test are transformed.

scaler = preprocessing.StandardScaler().fit(X_train)
X_train_transformed = scaler.transform(X_train)
X_test_transformed = scaler.transform(X_test)

2. Getting very good results(or at least I though at the time!!!) and then not being able to get the same results again. In scikit setting the random_state parameter to a fixed integer helps getting same results across different runs. In pytorch I am able to get same result in multiple runs by,

import torch
torch.manual_seed(0)
import numpy as np
np.random.seed(0)
import random
random.seed(0)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

3. Even somewhat simple training taking a long time to run. I was using 1 core to run machine learning algorithm where multiple cpu cores are supported. This was the case where n_jobs in scikit was 1 instead of amount of cores or -1.
4. Trying to choose between Tensorflow, Pytorch, Keras. Of all these pytorch seems easiest to set up in local machine with anaconda and no need to install anything related to gpu separately. For tensorflow-gpu multiple things must be downloaded along with the pip install. It can also be done by creating a conda environment and then installing tensorflow with pip in that environment.

Keras with tensorflow backend is recommended on their website. Also among these keras seems easiest to understand and there are a lot code examples on kaggle for keras. Pytorch basics tutorials is also seem easier.

5. Not using google colab to learn faster rather wasting time on setting up libraries in local machine. I modified the CNN pytorch image classification tutorial converted to gpu by someone(I cannot remember link to credit) below by adding another layer, dropout, batch normalization, leaky-relu.

Using this I was able to 77% on 10 epoch though since I modified some parameters so this code will provide different results. I saw some tutorial mention relu-6 is better on cifar10 based on paper. This code can be run directly on google colab.


6. Not knowing about importance of real world highly imbalanced data. Using accuracy as the metric for binary classification where 90% belong to class 1 and rest 10% data belong to class 2. This resulted in very good accuracy on both training and test data. But upon seeing the confusion matrix it was clear that it only identified the majority samples in correct class and the minority sample was classified to wrong class. So this is a useless system.

To remedy this other metrics such as ROC, PR Curve may help. Also scikit provides balanced accuracy as well as class weight adjustment to handle this. Stratified K Fold is an approach. Finally, imbalanced dataset can be balanced by loosing some information by undersampling, adding synthetic examples to minority class using algorithms such as SMOTE, or maybe a combination of two.

7. Not knowing about Type-1(False Positive), Type-2(False Negative) errors early. Depending on the situation it may be, if a person is innocent but system found guilty(False Positive) then it is worse than diagnosis that a criminal is innocent(False Negative). In case where it is not possible to reduce both, there reducing Type-1 error may make the system more reliable.

8. Not using feature selection to reduce dataset dimensions. The larger dataset the longer time it takes to train. Scikit provides various feature selection algorithms, also some other can be implemented by coding.

9. Knowledge leak by performing manual feature creation seeing the whole dataset(train, validation, test), also performing feature selection by seeing the whole dataset including the held out test set. Improving both test and train results by seeing results of both test, train data and then moving data from train set to test manually to get good results for both train and test.

Another big one is having same sample in both train and test set. These issues can be resolved by not seeing using a test set and cross validation on training data to improve results and only when cross validation is good then evaluate on test set.

10. Wasting time by running multiple epochs and overfitting on training data. Also not using proper validation scheme, using only train, test set but no validation. K fold or in case of imbalanced data startified K fold cross validation random shuffle seems better than using some percentage split between train, validation, split.

11. Loosing valuable information by undersampling the whole dataset(train, validation, test) and when multiple classes have equal representation then splitting then to train and test for k fold cross validation.

Test set should be separated beforehand and sampling must be performed per fold for training data in k fold cross validation.

12. Overfitting or memorizing training data by not using regularization. There are methods such as L1, L2 and for neural networks dropout.

13. Using very simple model where a more complex model is required. Ex: trying to separate points with line where multiple class are very close each other in circular fashion. Visual example of this can be found on tensorflow playground. More example can be using very few convolution layers or very few layers in multi layer perceptron.

No comments: