English subtitles

← K-Fold Cross Validation - Intro to Machine Learning

Get Embed Code
4 Languages

Showing Revision 4 created 05/25/2016 by Udacity Robot.

  1. So Katie, you told everybody about training and
  2. test sets, and I hope people exercise it quite a bit.
  3. Is that correct? >> Yes, that's right.
  4. >> So now I'm going to talk about something that slightly generalizes this
  5. called cross validation.
  6. And to get into cross validation, let's first talk about problems with
  7. splitting a data set into training and testing data.
  8. Suppose this is your data.
  9. By doing what Katie told you,
  10. you now have to say what fraction of data is testing and what is training.
  11. And the dilemma you're running into is you like to maximize both of the sets.
  12. You want to have as many data points in the training sets to
  13. get the best learning results, and you want the maximum number of data items in
  14. your test set to get the best validation.
  15. But obviously, there's an inherent trade-off here, which is every data point you
  16. take out of the training set into the test is lost for the training set.
  17. So we had to reset this trade-off.
  18. And this is where cross validation comes into the picture.
  19. The basic idea is that you partition the data set into k bins of equal size.
  20. So example, if you have 200 data points.
  21. And you have ten bins.
  22. Very quickly.
  23. What's the number of data points per bin?
  24. Quite obviously, it's 20.
  25. So you will have 20 data points in each of the 10 bins.
  26. So here's the picture.
  27. Whereas in the work that Katie showed you, you just pick one of those bins as
  28. a testing bin and the other then as a training bin.
  29. In k-fold cross validation, you run k separate learning experiments.
  30. In each of those, you pick one of those k subsets as your testing set.
  31. The remaining k minus one bins are put together into the training set,
  32. then you train your machine learning algorithm and
  33. just like before, you'll test the performance on the testing set.
  34. The key thing in cross validation is you run this multiple times.
  35. In this case ten times, and then you average the ten different testing set
  36. performances for the ten different hold out sets, so
  37. you average the test results from those k experiments.
  38. So obviously, this takes more compute time because you now have to run
  39. k separate learning experiments, but
  40. the assessment of the learning algorithm will be more accurate.
  41. And in a way, you've kind of used all your data for
  42. training and all your data for testing, which is kind of cool.
  43. Say we just ask one question.
  44. Suppose you have a choice to do the static train test methodology that Katie
  45. told you about, or you do say 10-fold cross validation, C.V., and
  46. you really care about minimizing training time.
  47. Minimize run time after training using your machine learning algorithm
  48. to output past the training time and maximize accuracy.
  49. In each of these three situations, you might pick either train/test or
  50. 10-fold cross validation.
  51. Give me your best guess.
  52. Which one would you pick?
  53. So for each minimum training time,
  54. pick one of the two over here on the right side.