Testing and Training Sets
When training a model it is crucial to separate a small subset of your data and keep it aside. A model should never be allowed to train on examples that are present in the test data. If that happens we will never understand the predictive power of the model and the model will most likely fail at generalization (i.e. the ability to correctly predict on previously unseen data) due to overfitting.
Good performance on the test set is often a useful indicator of the good performance of the model. It also indicates that the model is likely to do well on new/unseen data. For this make sure that
- The test set is large enough to be representative of the data set
- The training set does not contain the same examples from the test set (this can happen if there are duplicates in the dataset)
The entire cycle of testing and training would look something like this:
Partitioning the dataset into training and testing sets enables us to train on one set of examples and then test it against a new set of examples (testing data). After this, the model is tweaked a bit and the process repeats. At the end of the process, we get a model that works great on the test set as well. This is it, right? Um… no.
Validation Set
Chances are that due to the above iterative process, you may have overfit the testing set. So how can you know if you overfit the test set? Simple, by dividing your data set into three parts, not two.
Use the validation set to evaluate results from the training set. Once you are satisfied of the quality of the model, only then double-check your results with the test set
Pick the model with the best performance on Validation test and then move on to the test set for final checks
Source: Machine Learning Crash Course by Google
Leave a comment