Questions and discussion for this lecture live here. Hit Reply below to ask a question.
Thank you, Tim, for your description of the assessment of the quality of models, and the choices of splitting test data from training data. But I have a question?
Having identified an approach to producing an effective model, which is then to be applied to other (truly “unseen”) data for which the results are not known (rather than to test data selected from the available data), would it not be sensible to train the model on all of the available data (previously split into training and test sets)? Would this not provide an even better model, especially for small available datasets? Maybe is so obvious that it is already done!
This is a really interesting point. I had to think about it. On the face of it yes, sure once you have validated it then train with more data. More data is better right?
I guess the answer is yes. But if including the testing data in the training made a big difference, I would be worried. Not only would I be worried that my original test wasn’t fair, but I would also be worried about my model’s sensitivity in including a small percentage more of my data.
If the latter were the case, I would be doing more investigation anyway. In a way, this is what the k-fold cross-validation seeks to address!