Part 4: Splitting
I wanted to split the dataset into train set, test/validation set, and held-out sets. There is a crash course on machine learning at Google if you are not sure how to split.
There are two models: Hold-out and Cross-validation. I was unsure of the data I had collected, and I decided to use both models. I split the complete data set into two piles. Pile A was used for the Hold-out method, and Pile B was used utilizing the Cross-validation method.
The hold-out method use 80% of data for training and the remaining 20% of the data for testing. The training set is what the model is trained on, and the test set is used to see how well that model performs on unseen data.
Cross-validation is when the dataset is randomly split up into ‘k’ groups. One of the groups is used as the test set, and the rest are used as the training set. The model is trained on the training set and scored on the test set. Then the process is repeated until each unique group as been used as the test set.
The hold-out method is suitable when you have a large data set, or you are starting to build an initial model.